Conclusion

Conclusion

Vincenzo De Florio (PATS Research Group, University of Antwerp and iMinds, Belgium)
Copyright: © 2009 |Pages: 24
DOI: 10.4018/978-1-60566-182-7.ch011
OnDemand PDF Download:
$37.50

Abstract

We have reached the end of our discussion about application-level fault-tolerance protocols, which were defined as the methods, architectures, and tools that allow the expression of fault-tolerance in the application software of our computers. Several “messages” have been given: • First of all, fault-tolerance is a “pervasive” concern, spanning the whole of the system layers. Neglecting one layer, for instance the application, means leaving a backdoor open for problems. • Next, fault-tolerance is not abstract: It is a function of the target platform, the target environment, and the target quality of service. The tools to deal with this are the system model and the fault model, plus the awareness that (1) all assumptions have coverage and (2) coverage means that, sooner or later, maybe quite later but “as sure as eggs is eggs,” cases will show up where each coverage assumptions will fail. • This means that there is a (even ethical) need to design our systems thinking of the consequences of coverage failures at mission time, especially considering safety critical missions. I coined a word for those supposed fault-tolerant software engineers that do not take this need into account: Endangeneers. Three well-known accidents have been presented and interpreted in view of coverage failures in the fault and system models. • Next, the critical role of the system structure for the expression of fault-tolerance in computer applications was put forth: From this stemmed the three properties characterizing any application-level fault-tolerance protocol: Separation of concerns, adequacy to host different solutions, and support for adaptability. Those properties address the following question: Given a certain fault-tolerance provision, is it able to guarantee an adequate separation of the functional and non-functional design concerns? Does it tolerate a fixed set of faulty scenarios, or does it dynamically change that set? And, is it flexible enough as to host a large number of different strategies? • Then it has been shown that there exist a large number of techniques and hence of system structures able to enhance the fault-tolerance of the application. Each of these techniques has its pros and cons, which we tried to point out as best as we could. We also attempted to qualify each technique with respect to the above mentioned properties1. A summary of the results of this process is depicted in Fig. 1. • Another key message is that complexity is a threat to dependability, and we must make sure that the extra complexity to manage fault-tolerance does not become another source of potential failures. In other words, simplicity must be a key ingredient of our fault-tolerance protocols, and a faulty fault-tolerant software may produce the same consequence of a faulty non fault-tolerant software—or maybe direr. • Finally, we showed with some examples that adaptive behavior is the only way to match the ever mutating and unstable environments characterizing mobile systems. As an example, static designs would make bad use of the available redundancy.
Chapter Preview
Top

An Introduction And Some Conclusions

We have reached the end of our discussion about application-level fault-tolerance protocols, which were defined as the methods, architectures, and tools that allow the expression of fault-tolerance in the application software of our computers. Several “messages” have been given:

  • First of all, fault-tolerance is a “pervasive” concern, spanning the whole of the system layers. Neglecting one layer, for instance the application, means leaving a backdoor open for problems.

  • Next, fault-tolerance is not abstract: It is a function of the target platform, the target environment, and the target quality of service. The tools to deal with this are the system model and the fault model, plus the awareness that (1) all assumptions have coverage and (2) coverage means that, sooner or later, maybe quite later but “as sure as eggs is eggs,” cases will show up where each coverage assumptions will fail.

  • This means that there is a (even ethical) need to design our systems thinking of the consequences of coverage failures at mission time, especially considering safety critical missions. I coined a word for those supposed fault-tolerant software engineers that do not take this need into account: Endangeneers. Three well-known accidents have been presented and interpreted in view of coverage failures in the fault and system models.

  • Next, the critical role of the system structure for the expression of fault-tolerance in computer applications was put forth: From this stemmed the three properties characterizing any application-level fault-tolerance protocol: Separation of concerns, adequacy to host different solutions, and support for adaptability. Those properties address the following question: Given a certain fault-tolerance provision, is it able to guarantee an adequate separation of the functional and non-functional design concerns? Does it tolerate a fixed set of faulty scenarios, or does it dynamically change that set? And, is it flexible enough as to host a large number of different strategies?

  • Then it has been shown that there exist a large number of techniques and hence of system structures able to enhance the fault-tolerance of the application. Each of these techniques has its pros and cons, which we tried to point out as best as we could. We also attempted to qualify each technique with respect to the above mentioned properties1. A summary of the results of this process is depicted in Figure 1.

  • Another key message is that complexity is a threat to dependability, and we must make sure that the extra complexity to manage fault-tolerance does not become another source of potential failures. In other words, simplicity must be a key ingredient of our fault-tolerance protocols, and a faulty fault-tolerant software may produce the same consequence of a faulty non fault-tolerant software—or maybe direr.

  • Finally, we showed with some examples that adaptive behavior is the only way to match the ever mutating and unstable environments characterizing mobile systems. As an example, static designs would make bad use of the available redundancy.

    Figure 1.

    Application-level software fault-tolerance protocols according to the three structural attributes SC, SA, and A

Complete Chapter List

Search this Book:
Reset