Fault Tolerant Systems

Fault Tolerant Systems

DOI: 10.4018/978-1-4666-9429-3.ch011
OnDemand PDF Download:
No Current Special Offers


Another alternative derating, which was described in the previous chapter, is application of fault tolerant structures for the power converter. Fault tolerance is the property that enables a system to continue operating properly in the event of a failure of (or one or more faults within) some of its components. Fault tolerant systems are systems that can be operating after fault occurrence with no degraded performance in their basic functional requirements. This is the main difference between fault tolerant systems and derated systems. In this chapter, some methods for fault tolerance in electric power converters are presented. Fault tolerance is almost the only method for achieving a desired reliability in a converter that operates with non-zero fault possibility. There are two main approaches for this aim: re-configuration of the faulty system and using redundant systems. Redundancy is the provision of functional capabilities that would be unnecessary in a fault-free environment. Various types of redundant systems as passive and active redundancy are described and their application in power supply systems is presented. A new approach for a reliable and fault tolerant power supply is proposed and justified with experimental results. The concept of fault tolerance in electrical machines is presented.
Chapter Preview

Introduction: Robustness Against Faults

The presented methods of reliability improvement as well as reliability calculation techniques help to have a safe converter without catastrophic failures. Derating method is usually exclusive method of reliability improvement for a faulty converter. However, derating is derating! It usually means the derated converter continues to operate with new rated characteristics which are less than the converter original nominal specifications. In many cases, this is not acceptable and it is needed to keep the original nominal rating of converter. For example, consider a DC power distribution unit with several output voltage levels. In a power distribution unit, as it was presented for a satellite in chapter 1, there are several voltage regulators and they provide some output voltage levels from a common DC input voltage source such as a battery. In this system, failure in one of the output voltage levels causes to failure in the subsystem related to the failed output voltage level. It is true that just one of the output channels is failed and it is not whole of the power distribution unit. But, the system does not operate properly even other output channels operate normally (Gorginpour, Jandaghi, & Oraee, 2013).

Fault tolerant methods are solution of this drawback and we will discuss about them in this chapter. Figure 1 shows the state of this chapter in the flowchart of the book. Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of (or one or more faults within) some of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naïvely designed system in which even a small failure can cause total breakdown. Fault tolerance is particularly sought after in high-availability or life-critical systems. A fault-tolerant design enables a system to continue its intended operation, possibly at a reduced level, rather than failing completely, when some part of the system fails. Recovery from errors in fault-tolerant systems can be characterised as either roll-forward or roll-back. When the system detects that it has made an error, roll-forward recovery takes the system state at that time and corrects it, to be able to move forward. Roll-back recovery reverts the system state back to some earlier, correct version, for example using checkpointing, and moves forward from there. Roll-back recovery requires that the operations between the checkpoint and the detected erroneous state can be made idempotent. Some systems make use of both roll-forward and roll-back recovery for different errors or different parts of one error.

Figure 1.

State of chapter 11 in the flowchart of the book


Fault tolerance can be achieved by anticipating exceptional conditions and building the system to cope with them, and, in general, aiming for self-stabilization so that the system converges towards an error-free state. However, if the consequences of a system failure are catastrophic, or the cost of making it sufficiently reliable is very high, a better solution may be to use some form of duplication. In any case, if the consequence of a system failure is so catastrophic, the system must be able to use reversion to fall back to a safe mode. If each component, in turn, can continue to function when one of its subcomponents fails, this will allow the total system to continue to operate as well.



Redundancy is the provision of functional capabilities that would be unnecessary in a fault-free environment (Hao, Covic, & Boys, 2014). This can consist of backup components which automatically “kick in” should one component fail. The idea of incorporating redundancy in order to improve the reliability of a system was pioneered by John von Neumann in the 1950s.

Providing fault-tolerant design for every component is normally not an option. Associated redundancy brings a number of penalties: increase in weight, size, power consumption, cost, as well as time to design, verify, and test. Therefore, a number of choices have to be examined to determine which components should be fault tolerant:

Complete Chapter List

Search this Book: