Most Realtime systems focus on hardware fault tolerance. Software fault tolerance is often overlooked. This is really surprising because hardware components have much higher reliability than the software that runs over them. Most system designers go to great lengths to limit the impact of a hardware failure on system performance. However they pay little attention to the systems behavior when a software module fails.
In this article we will be covering several techniques that can be used to limit the impact of software faults (read bugs) on system performance. The main idea here is to contain the damage caused by software faults. Software fault tolerance is not a license to ship the system with bugs. The real objective is to improve system performance and availability in cases when the system encounters a software or hardware fault.
Most Realtime systems use timers to keep track of feature execution. A timeout generally signals that some entity involved in the feature has misbehaved and a corrective action is required. The corrective action could be of two forms:
The choice between retrying or aborting on timeouts is based on several factors. Consider all these factors before you decide either way:
Most Realtime systems comprise of software running across multiple processors. This implies that data is also distributed. The distributed data may get inconsistent in Realtime due to reasons like:
The system must behave reliably under all these conditions. A simple strategy to overcome data inconsistency is to implement audits. Audit is a program that checks the consistency of data structures across processors by performing predefined checks.
Lets consider the Xenon Switching System. If the call occupancy on the system is much less than the maximum that could be handled and still calls are failing due to lack of space-slot resources, call processing subsystem will detect this condition and will trigger space-slot audit. The audit will run on the XEN and CAS processors cross-check if a space-slot that is busy at CAS actually has a corresponding call at XEN. If no active call is found on XEN for a space-slot, the audit will recheck the condition after a small delay for several times. If the inconsistency holds on every attempt, the space-slot resource is marked free at CAS. The audit performs several rechecks to eliminate the scenario in which the space-slot release message may be in transit.
Whenever a task receives a message, it performs a series of defensive checks before processing it. The defensive checks should verify the consistency of the message as well as the internal state of the task. Exception handler should be invoked on defensive check failure.
Depending on the severity, exception handler can take any of the following actions:
Leaky-bucket counters are used to detect a flurry of error conditions. To ignore rare error conditions they are periodically leaked i.e. decremented. If these counters reach a certain threshold, appropriate exception handling is triggered. Note that the threshold will never be crossed by rare happening of the associated error condition. However, if the error condition occurs rapidly, the counter will overflow i.e. cross the threshold.
In a complex Realtime system, a software bug in one task leading to processor reboot may not be acceptable. A better option in such cases is to isolate the erroneous task and handle the failure at the task level. The task in turn may decide to rollback i.e. start operation from a known or previously saved state. In other cases, it may not be expensive to forget the context by just deleting the offending task and informing other associated tasks.
For example, if the Space Slot Manager on the CAS card encounters a exception condition leading to task rollback, it might resume operation by recovering the space slot allocation status from the connection memory. On the other hand, exception in a call task might just be handled by clearing the call task and releasing all the resources assigned to this task.
Task rollback may be triggered by any of the following events:
Software processor reboots can be time consuming, leading to unacceptable amount of downtime. To reduce the system reboot time, complex Realtime systems often implement incremental system initialization procedures. For example, a typical Realtime system may implement three levels of system reboot :
This is a technique that is used in mission critical systems where software failure may lead to loss of human life .e.g. aircraft navigation software. Here, the Realtime system software is developed by at least three distinct teams. All the teams develop the software independently. And, in a live system, all the three implementations are run simultaneously. All the inputs are fed to the three versions of software and their outputs are voted to determine the actual system response. In such systems, a bug in one of the three modules will get voted out by the other two versions.