Operating Systems Research Papers - Academia.edu (original) (raw)
Introduction for second edition / Preface When in 1989 an anonymous reviewer commented on my short paper that “this classification should be extended to description of distributed systems,” (Yet another approach to classification of... more
Introduction for second edition / Preface
When in 1989 an anonymous reviewer commented on my short paper that “this classification should be extended to description of distributed systems,” (Yet another approach to classification of redundancy, CIM IMEKO Symp.1990, Helsinki, pp. 117-124) I was really excited, because people in the research community were thinking much deeper and wider than myself (- I had just defended my PhD).
Further, fault tolerance was migrating to dependability (Jean Claude Laprie was an indisputable authority and expert in this domain, see more: www.springer.com/gb/book/9783709191729, which later emerged as the concept of resilience.
In principle, all these new properties had a concrete reasoning and meaning behind them: when something erroneous has happens, any system of our design should be able to cope with the problem. Options vary, as well as circumstances and area of application, thus:
- If it stops the error propagating and freezes in a safe state, it is fail-stop, or fail-safe;
- If it can cope with permanent faults inside the system, it is a fault tolerant system;
- When it continues with reduced functionality, it is graceful degradation;
- If it is designed with attention having been paid to reliability, availability and maintenance or serviceability, it is
dependable system;
- If it capable to of tolerating obstacles caused by internal and external factors and can spring back, recover and continue, then a system can be considered as resilient.
There are two major ways to achieve any of the properties mentioned above: at system level, or at local level (technological). Obviously, any reasonable combination of both levels is also welcome. In this book we do not want to repeat our papers and books
(https://www.springer.com/gb/book/9783319150680, https://www.springer.com/gb/book/9783319468129)
but to incorporate into the second edition any significant progress that has emerged.
Speaking about ICT systems, especially safety-critical and real-time ones, we might think about the implementation of resilience from the system level down through to hardware and systems software. In addition, we need to consider that each of the parts will both interact with and support each other.
NFRs (Non-Functional Requirements) of each part of the system were considered, such as:
- Performance;
- Reliability;
- Efficiency (mostly energy efficient).
Therefore, the systems that we design should be PRE-smart and provide these properties throughout the life cycle.
Neither of our books to date - have appeared as complete. These books have been used in China, Switzerland, Russia and USA, (mostly Masters and PhD students) and we have received substantial feedback, such as:
- While reliability of hardware and availability at the system level are explained and fine, there are no sections, or chapters about performance, especially where parallel and distributed systems are concerned;
- How to apply (as mentioned in the above review) the classification and properties of resilience for and within distributed systems;
- How real-time and safety-critical applications should be treated considering the system resilience: rules for system and for packages - have they changed?
It was especially satisfying as we have discovered that these segments are being updated by researchers around the globe, providing excellent contributions to the content.
Thus, our book became and evolving system in itself, aggregating our further efforts with the efforts and results of our colleagues from China, Switzerland, UK and Russia. Our book has therefore become itself resilient, benefiting from the contributions from the following:
Performance chapter (including element level performance and parallel design was prepared and included using materials and having contributions from:
- Professor Hao Kai, Shantou University, China;
- Simon Monkman, IT-ACS Ltd researcher.
System software chapters were part of substantial efforts from:
- Professor Eugeny Zuev and his team in Technopolis, Kazan, Russia.
In turn, requested in 1989 consideration of system level of resilience for distributed systems were developed as two chapters: system level and algorithmic implementation prepared by me and Stephen Farrell. In these chapters we have introduced a concept of desperation ( for transactions within distributed systems) and show that our existing and new results, even patented: https://www.ipo.gov.uk/p-ipsum/Case/PublicationNumber/GB2448351
can be extremely useful making the whole network really resilient and achieving by far better service for applications, especially when critical level of their use was assumed.
I Schagaev