We often say that reliability must be designed in, not tested in. What do we mean?
I recall a debate from early in my career as a young software engineer about the merits of optimizing for the average case behavior versus optimizing for the worst case. The product in question had a lot of serial interfaces. Those arguing for the average case advocated to interrupt routines tied to each of the serial ports. That way, if a subset of ports were active, the non-active ports would not use any processing cycles. This, they reasoned, would give optimal performance for a typical workload. My approach was to create one timer interrupt at a specified highest byte rate for the serial ports. My interrupt handler checked all the ports for data. This guaranteed a worst case throughput of the timer interrupt rate on all ports.
The average case team argued that they could pump data through the system on a small number of ports at rates higher than my specified highest byte rate. But when all the ports were active, they could not keep up with my implementation—not even close. The overhead of all those per-port interrupts swamped the processor, and interrupts were missed. What guarantees could they provide? When using more than a couple of ports, they really couldn’t say. They made some empirical measurements and waved their hands about interrupt stacking and preemption. Which approach would you prefer: A guaranteed rate on all channels simultaneously, or a slightly better rate… on some channels… most of the time?
This debate was about utilization of a telephone switching fabric. Nowadays, everyone uses packet switching on the Internet. Telecoms thinking was different then. Designing for the average call volume back in the day would have meant that any number of calls could be attempted. With this approach, at some point you have too many callers and the system gets bogged down, and you end up with static; you can’t hear the conversation, or calls get dropped after a while. Designing reliability in, on the other hand, means there will be busy signals at some point when you attempt to place a call; this happens before the system gets too bogged down to give everyone with a dial tone a high-quality phone call.
I know what I would chose, as I did then: I want deterministic behavior in all situations. I want that behavior by design, proven by analysis, and yes, verified with testing. That was true with the phone lines, even if it meant the occasional busy signal, you know what you’re getting, and it’s also true with medical devices when lives are on the line.
Unfortunately, many developers are trained on systems that are general purpose computers, expecting a dynamic and flexible workload. On those systems it may make sense to optimize for the average case, and our desktop and server operating systems are designed that way. For reliable systems, we need a different mindset.
There are similar tradeoffs with communication protocols. Sometimes you want “guaranteed delivery.” This would be a good idea when you are downloading a contract from a website: You need the whole contract, no missing bits, so the communication protocol must verify delivery and resend missing bits until the whole contract is delivered. Nobody wants to sign a contract with a missing page or even a missing sentence. Document delivery is worthless in this situation unless you have guaranteed delivery. That means that if you miss a piece, you go back and you retransmit it, then you go back and get it and put checks around it so you get the whole thing.
You have to adjust your mindset when what matters is real-time delivery. When you’re delivering data that’s being used to control a real-time instrument, you know within a few seconds or microseconds if data didn’t go through. You know you can’t do anything about it, though, because it already happened. Bogging down the communication protocol with the mechanisms needed to go back and get that data would just degrade the performance of the device for no benefit.
Think of watching a movie. If a few movie frames are garbled, you might prefer the video to continue on rather than it automatically stopping, rewinding, and retrying the frames. It would be very annoying to have a blip of some milliseconds turn into many seconds of retrying and rebuffering. You can miss that miniscule fraction of the action without missing out on the movie, or you can get stuck waiting for what feels like ages to see if the movie will ever resume.
In real-time systems, guaranteed delivery is generally a bad idea. Latency is critical for control loops, so you don’t want the real-time communication stream interrupted to retry sending a value, especially when the next sample of the value is coming “soon.” To support guaranteed delivery, the sender needs to retain sent values until acknowledged, adding complexity, additional memory use, and another class of errors: What to do when the retry buffer fills up? Do we need to resend an old value of a real time variable when a new sample is available? These complications make it very difficult to reason about meeting performance requirements.
The alternative is to use a protocol that sends data periodically, and at a multiple of the required rate. So, if a value is damaged or lost, it will be sent again soon without any “data not received” signals from the intended receiver. Like the above-mentioned worst case versus average case, we are using more bandwidth than needed in the average case, but not as much more than you might think since we have much less protocol overhead. The impact of a communication error is obvious: A value is delayed until its next scheduled time slot. Analysis is straightforward using well-known statistical models.
Recently, we designed a life-sustaining cardiac device that is required to fail operative for single faults. We designed the system with two CAN busses to each ASIL-D processor. A consultant we retained for an independent review commented that we’d never get the communication protocol verified because of the elaborate protocol he assumed we would need, and all the special cases of error handling.
We applied the alternative method described above, but redundantly. In other words, we sent all data periodically, duplicated on both CANs, at higher-than-necessary rates. We designed a blackboard communication model that simply placed any correctly received (CRC checked) CAN data on the blackboard with a timestamp. No protocol. Well, we had a periodic handshake among the processors, and the CAN hardware has an error retry. It’s pretty easy to see that if either CAN fails, we still get all data. No elaborate protocol, no software special cases for error handling, and trivial verification. Consequently, there’s no change in behavior when there’s a failure. We want all things to be deterministic, and we don’t want to change our behavior because an error happened. We want all the data to get through because it’s critical to operating the device.
Another big advantage of sending all data periodically is that we can perform a mathematical analysis of the CAN configuration, and prove that it is schedulable. The CAN will never be overloaded. Our C++ software team took this one step further and built the schedulability analysis into the product source code using constexpr functions. So, the analysis is done on every build of the software.
We took a similar approach with an autonomous robotic surgery device we designed. With this device, a surgeon uses imaging to plan an excision, lines the robotic tool up with a marker, then sets the robot in motion to do the work. The plan can be fairly complex with lots of curves and angles, and the surgeon controls the motors and the actuators that excise the tissue, so everything needs to go according to plan. If we don’t do things deterministically, then the plan will go awry and we’re going to be cutting tissue that we didn’t want to cut. If for some reason we lose a millisecond of data, there’s no value in resetting that part of the plan because the cut has already been made. The tool is still moving, so there’s no value in going back and determining what it should have done on the previous cut. What’s important is to get back on track and get the correct value for the next cut. In this case, guaranteed delivery protocol doesn’t help, because it doesn’t take into account that the most valuable information is not the data you already missed—it’s what’s coming next.
Other techniques for reliable software we use are to:
- Build timing instrumentation into firmware, and use it in verification testing and in runtime testing; this includes both watchdog timers, and high speed performance timers
- Use a Rate or Deadline Monotonic Scheduler to enable mathematical analyses that we are meeting timing requirements
- Avoid dynamic memory allocation
- Carefully restrict and control sharing among tasks, threads and interrupts
- Protect critical code and data with CRCs or hashes
- Use memory protection hardware when it is available
- Use algorithms that have well established performance complexity, and that we can analyze to prove there is no resource exhaustion
- Avoid data structures that degrade with use (FAT file system is a villain here!)
All of these techniques are considered early in a project, and spelled out in a software architecture document. That document sets the strategies and guidelines for developers to ensure success in meeting our reliability and performance objectives.
We need to take software reliability very seriously… lives are on the line. We should always endeavor to:
- Keep things simple
- Use real-time designs that can be analyzed and proven correct
- Be conservative in assumptions
- Use hardware resources to simplify software and runtime verification
- Optimize the worst case behavior, not the average case
It is also necessary to eliminate the mind-set in some circles that the discipline for designing software is somehow less demanding than that for hardware. The make-it-up-as-we-go -along approach (spiral design) is fundamentally sloppy, as is “just good enough” and the idea that we don;t have to worry much because in the case of an “anomaly” (meaning its broken) we can fix it with an “update” (meaning to fix something that was never right in the first place.)