51作业君
首页
低价平台
服务介绍
代写程序
代写论文
编程辅导
程序案例
论文案例
联系方式
诚邀英才
代写选择指南
程序辅导案例
>
Program
>
程序代写案例-SWEN90010
欢迎使用51辅导,51作业君孵化低价透明的学长辅导平台,服务保持优质,平均费用压低50%以上! 51fudao.top
High Integrity Systems Engineering Subject Notes for SWEN90010 School of Computing and Information Systems The University of Melbourne r>1 Contents 1 Introduction to High Integrity Systems Engineering 7 1.1 Subject information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2 Introduction to high integrity systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3 Subject outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2 Safety-critical and Security-critical systems engineering 12 2.1 An introduction to software safety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Why safety engineering? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Safety engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 Accidents and safety engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4.1 Case Study: The A320 Airbus . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.2 Case study: Warsaw, Okecie – 14 September 1993 . . . . . . . . . . . . . . . . 20 2.4.3 Case Study: Boeing 737 MAX . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.4 The role of accident analysis in safety engineering . . . . . . . . . . . . . . . . 25 2.5 Tools and techniques for safety engineering . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5.1 Safety standards and safety lifecycles . . . . . . . . . . . . . . . . . . . . . . . 27 2.5.2 Preliminary hazard analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.6 Hazard Analysis Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.6.1 HAZOPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.6.2 Fault Tree Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.6.3 FMEA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.7 Threat Modelling for Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.7.1 The STRIDE Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.7.2 Attack Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3 Model-based specification 49 2 3.1 What is formal specification? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2 What is model-based specification? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3 The costs of formal specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.4 Logic and set theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.5 Introduction to Alloy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.5.1 Key characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.6 Alloy Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.6.1 Everything is a relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.6.2 Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.6.3 Temporal Logic: specifying systems that change over time . . . . . . . . . . . . 61 3.7 Alloy Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.7.1 Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.7.2 Signatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.7.3 Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.7.4 Predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.7.5 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.7.6 Assertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.7.7 The complete LastPass example . . . . . . . . . . . . . . . . . . . . . . . . 65 3.8 Analysis with the Alloy Analyser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.8.1 Runs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.8.2 Checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.8.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.9 Abstraction and incremental specification of state machines using Alloy operators . . . . 69 3.9.1 The specification process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.10 Trace-Based Modelling in Alloy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.11 Verification and validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4 Introduction to Ada 78 4.1 The history of Ada . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.2 Ada for programming “in the large” . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.3 Hello World . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.4 Specifications and program bodies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.5 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.6 Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3 4.6.1 The predefined Ada type system . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.6.2 Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.6.3 Defining new types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.7 Control Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.8 Procedural Abstractions and Functional Abstractions . . . . . . . . . . . . . . . . . . . 90 4.8.1 Parameter Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.8.2 Calling subprograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.9 access Types: Passing Parameters By Reference . . . . . . . . . . . . . . . . . . . . 91 4.9.1 Access Types to Read-Only Data . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.9.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.10 Example: Insertion sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.11 Concurrency with tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.12 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5 Safe programming language subsets 100 5.1 Principles of correctness by construction . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.2 Safe programming language subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.3 SPARK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.3.1 SPARK Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.3.2 What is left out of Ada? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.3.3 Why bother with SPARK or other safe programming language subsets? . . . . . 106 5.4 SPARK Examiner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.4.1 Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.4.2 SPARK Examiner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6 Design by contract 111 6.1 Introduction to Design by Contract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.1.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.1.2 Some Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.2 Specifying ADT interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.2.1 ADT interfaces as natural-language comments . . . . . . . . . . . . . . . . . . 112 6.2.2 ADT interfaces as invariants, preconditions, and postconditions . . . . . . . . . 113 6.2.3 ADT interfaces as contracts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.2.4 Formal contracts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4 6.2.5 Advantages and disadvantages of formal contracts . . . . . . . . . . . . . . . . 117 6.3 Contracts in SPARK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.3.1 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.3.2 Similarity to Alloy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.3.3 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.4 Additional reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7 Reasoning about program correctness 122 7.1 Introduction to reasoning about programs . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.1.1 The correctness of binary search . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.2 A small programming language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.3 Hoare logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 7.3.1 Introduction to Hoare logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 7.3.2 An introductory example — Incrementing an integer . . . . . . . . . . . . . . . 127 7.4 The rules of Hoare logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.4.1 Assignment axiom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.4.2 Consequence rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 7.4.3 Sequential composition rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 7.4.4 Empty statement axiom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 7.4.5 Conditional rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 7.4.6 Iteration rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 7.4.7 Establishing loop invariants . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 7.4.8 Array assignment rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 7.4.9 Procedure call rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 7.4.10 Other rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 7.5 Mechanising Hoare logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 7.6 Dijkstra’s weakest precondition calculus . . . . . . . . . . . . . . . . . . . . . . . . . . 147 7.6.1 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 7.6.2 Program proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 7.7 Additional reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 8 Advanced Verification 149 8.1 Pointers and Aliasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 8.2 Safe Programming with Pointers: Avoiding Unsafe Aliasing . . . . . . . . . . . . . . . 150 5 8.2.1 Unsafe Aliasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 8.2.2 Avoiding Unsafe Aliasing in SPARK . . . . . . . . . . . . . . . . . . . . . . . 151 8.2.3 Anti-Aliasing Assumptions in SPARK . . . . . . . . . . . . . . . . . . . . . . . 152 8.2.4 SPARK Pointer Checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 8.2.5 Other Programming Languages: Rust . . . . . . . . . . . . . . . . . . . . . . . 153 8.3 Separation Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 8.3.1 Pointers in C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 8.3.2 Separation Logic (for C) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 8.3.3 The Meaning of Separation Logic Statements . . . . . . . . . . . . . . . . . . . 156 8.3.4 The Rules of Separation Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 8.3.5 An Example Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 8.4 Security: Information Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 9 Fault-tolerant design 162 9.1 Introduction to fault-tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 9.2 Hardware Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 9.2.1 Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 9.2.2 Static pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 9.2.3 Redundancy and voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 9.2.4 N-Modular Redundancy (NMR) . . . . . . . . . . . . . . . . . . . . . . . . . . 170 9.3 Software Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 9.3.1 N-Version Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 9.3.2 Recovery blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 9.4 Byzantine Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 9.5 Information Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 9.5.1 Duplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 9.5.2 Parity coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 9.5.3 Checksums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 9.6 Fault tolerance in practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 9.6.1 Airbus A310–A380 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 9.6.2 Boeing 777 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 9.6.3 Boeing 737 MAX Flight Control Redesign . . . . . . . . . . . . . . . . . . . . 180 9.7 Additional reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 6 Chapter 1 Introduction to High Integrity Systems Engineering 1.1 Subject information Subject aims The main aim of the subject is to explore the principles, techniques and tools used to analyse, design and implement high integrity systems. High integrity systems are systems that must operate with critical levels of security, safety, reliability, or performance. The engineering methods required to develop high integrity systems go well beyond the standards for less critical software. Topics will include: • high integrity system definition; • programming languages for high-integrity systems; • safety and security analysis; • modelling and analysis; • fault tolerance; and • proving program correctness. Subject outcomes On completion of this subject students should be able to: • Demonstrate an understanding of the issues facing high integrity systems developers. • Demonstrate the ability to analyse highly dependable systems and to synthesise requirements and operating parameters from their analysis. 7 • Demonstrate the ability to choose and make trade-offs between different software and systems architectures to achieve multiple objectives. • Develop algorithms and code to meet high integrity requirements. • Demonstrate the ability to assure high integrity systems. Generic Skills The subject is a technical subject. The aim is to explore the various approaches to building software with specific attributes into a system. The subject will also aim to to develop a number of generic skills. • We encourage independent research in order to develop your ability to learn independently, assess what you learn, and apply what you learn. • You will be developing experience at empirical software engineering, and the interpretation and assessment of quantitative and qualitative data. • Finally, and perhaps most importantly, you will be encouraged to develop critical analysis skills that will complement and round what you have learnt so far in your degree. Preconditions The reason for giving preconditions is that the specify the knowledge assumed at the start of the subject and consequently the knowledge and skills upon which the subject builds. Preconditions also allow students from a Universities other than The University of Melbourne and who do not have the formal prerequisites listed above to judge whether or not they have the knowledge required for the subject. Specifically the subject assumes: 1. Knowledge and experience of software engineering processes and practices, the risks associated with software and system development, quality and quality assurance, software process, and soft- ware project management. 2. Knowledge and experience of requirements elicitation and analysis techniques, requirements mod- elling and specification, use-case modelling, and UML or similar modelling notations. 3. Knowledge and experience of software architectures, architectural styles and design patterns, de- sign with UML or similar graphical notation, developing code to meet architectural and detailed designs. 4. Knowledge and experience of discrete mathematics, including predicate logic and formal proof. Resources The LMS contains information about the subject staff, mode of delivery, and assessment criteria. It is the primary resource for the subject. 8 All announcements will be made via the LMS. It also contains all assignment sheets, notes, lecture recordings, tutorial/workshop sheets and solutions. http://www.lms.unimelb.edu.au/. 1.2 Introduction to high integrity systems What is a high integrity system? High integrity systems a software controlled systems in which, in the event they fail, could result in harm to humans (including loss of life), harm on the environment, mass destruction of property, or harm to society in general. There are four broad areas that are commonly considered as high integrity systems: • Safety-critical systems: Systems whose failure may result in harm to humans or the environment. Examples include aerospace systems, medical devices, automotive systems, power systems, and railway systems. • Security-critical systems: Systems whose failure may result in breach of security. Examples in- clude defence systems, government data stores. • Mission-critical systems: Systems whose failure may result in the failure of some deliberative missions. Examples include navigation systems on autonomous aerospace vehicles, computer- aided dispatch systems (e.g. in emergency services), and command-and-control systems in armed forces. • Business-critical systems: Systems whose failure may result in extreme financial loss to a business or businesses. Examples include stock exchange systems, trade platforms, and even accounting systems (e.g. in a bank). What is high integrity systems engineering? The high impact of failure of a high integrity system means that the software controlling the system must be highly dependable. The standard software engineering methods used for constructing good soft- ware systems typically do not lead to such dependability. As such, specialised methods for engineering software are required for high integrity systems. Avizienis et al.’s work on dependability attributes help us to understand high integrity systems based on six key attributes, shown in Figure 1.1. For a single high integrity system, one or more of the above attributes will be of high importance. How do we achieve high-integrity? High integrity is achieved using a combination of: • Rigorous engineering processes, such as strict configuration management, auditing, and review process. We will not introduce any new theory on these topics in this subject. 9 Maintainability Integrity Availability Safety Reliability Confidentiality Dependability Readiness for use Continuity of Service No catastrophic consequences for users or the environment No unauthorised disclosure of confidential information No improper system changes The ability to undergo repairs and evolution. Figure 1.1: Dependability model from “Avizienis, Laprie & Randell: Fundamental Concepts of Relia- bility” • Specialised analysis and design methods, such as safety & hazard analysis and fault prediction methods. These will be covered in Part II of the subject. • Specialised modelling and analysis methods/notations, such as mathematical logic and set theory. These will be covered in Part III of the subject. • Fault tolerant design methods, such as redundancy, design diversity, and fault detection mecha- nisms. These will be covered in Part IV of the subject. • Specialised assurance methods, such as model checking, proof, and model-based testing. These will be covered in Part V of the subject. How do we achieve high-integrity? Methods that are otherwise considered too expensive to apply to “standard” software systems become cost-effective in high integrity domains. This is because the cost of failure far outweighs the effort involved required to engineer a dependable product. Of importance is that high dependability cannot be demonstrated by just running large set of tests. Al- though testing still plays an important role in high integrity systems, a really large set of tests is still usually only a fraction of all possible behaviour. As such, testing does not demonstrate dependability. Perhaps the important lesson to learn from this introduction is: “Program testing can be used to show the presence of bugs, but never to show their ab- sence!” — Edsger W. Dijkstra [1]. 1.3 Subject outline Now that we have a better idea about high integrity systems engineering, we define the subject outline: Part I: Introduction to high integrity systems. 10 Part II: Safety- and security-critical systems. Topics will be drawn from: safety engineering, and accident & hazard analysis for high- integrity systems; security analysis, cryptography. Part III: Modelling and analysis. Topics will be drawn from: discrete maths for software engineering, model-based specifica- tion and analysis, proof, and lightweight verification. Part IV: High integrity systems design. Topics will be drawn from: reliability, fault detection, fault containment, fault tolerant design, and design by contract. Part V: Assurance. Topics will be drawn from: proving program correctness, programming languages for high integrity systems, safe-subset programming languages, programming tools for high integrity verification, and model-based testing. Bibliography [1] E. W. Dijkstra. Notes on structured programming. T. H. Report 70-WSK-03, Technological Univer- sity Eindhoven, Department of Mathematics, 1970. 11 Chapter 2 Safety-critical and Security-critical systems engineering 2.1 An introduction to software safety The trend in systems engineering is toward multi-disciplinary systems. Many of these systems are de- ployed outside of the standard desktop domain on which programming is typically taught. Such a trend has safety implications: if a piece of software is used to control a hardware system, it may have safety related issues. For example, consider the following list of applications that have severe safety implications if they fail: • Computer Aided Dispatch Systems (CAD); • The A320 Airbus Electronic Flight Control System (EFCS); • The Boeing 737 MAX 8 MCAS system • Biomedical technology, such as pacemakers, and surgical robots; • Train protection systems; • Space shuttle; • Automated vehicle braking systems; and • Chemical plant control systems. Such systems have influence well beyond the desktop and an individual user. Software itself cannot harm people – it is merely an abstraction. However, software, when used to control hardware or other parts of a system (e.g. to make decisions or send orders on behalf of a human), can be a root cause of accidents and incidents. The systems described above all rely on an integration of different technologies, and different engineering disciplines. What is perhaps new to most people in this subject is the reliance on computer systems and software to perform safety critical system functions. 12 Aims The aim of the module on safety engineering is to develop an understanding of the safety engineering approach to developing safety-critical systems. Specifically: • We will study accidents to get an idea of how they arise, how they are analysed, and what can be done to prevent or mitigate accidents. • We will study safety engineering processes as the pathway to developing and assuring safety criti- cal systems. • We will study the methods of safety analysis and assurance that can be applied at all stages of the development process. 2.2 Why safety engineering? In the examples above, software is part of a much larger and much more complex environment. Running a huge set of tests to show that your software behaves as you expected does not really impact safety a great deal. In safety-critical systems, testing is a necessary but insufficient means of assuring a safe system. Re- member that testing is about finding faults – not about showing correctness, and especially not about showing safety. Safety and security is not just a software issue. System safety and security involve the hardware that the software is running on, the physical devices to which the software is connected the people which must operate the systems, and operating procedures in force. Software and its relation to safety The following are important themes in safety engineering: • Software is always part of a larger system. • Design faults can exist at various levels: – Written requirements do not conform to the actual requirements. – The specification does not conform to the written requirements. – Source code does not conform to the specification. – Object code does not match the source code. • Incorrect requirements are a major cause of accidents. • Simple design faults rarely cause accidents on their own. • Human error is inevitable: “Components developed over several years are hard to modify. Components which have taken several million years to evolve are almost impossible to modify.” 13 Is safety equivalent to correctness? One initial mistake to make is to think that, provided our software is correct, it will be safe. This is fallacious thinking. Correctness – Correctness alone does guarantee safety. Consider a knife. If we assume that it is a good knife and so does not contain any “faults” then is it safe? The answer of course relies on how it interacts with its environment, not on whether or not it contains faults. If I do not get my thumb out of the way while cutting, I can harm my thumb. The knife is not safe. But it can still function correctly. Now, let’s turn to a systems engineering example. I can design a braking systems for a car that is 100% correct, but 0% safe. My design consists of a braking system with a brake pedal that does nothing. Implementing the design correctly is straightforward: do nothing. But this is surely not safe – unless the wheels are designed to be square. Again, safety results from how our system interacts with its environment. This does not imply that we should not also strive for correctness. An incorrectly implemented design is more likely to result in safety concerns than a correctly implemented one. Is safety equivalent to reliability? Is safety then about reliability? Reliability – Reliability alone does not guarantee safety. Consider a box of matches. An unreliable match — that is, one that fails to perform its function according to specification — is far more likely to be “safe” than a reliable match. A common misconception is that if we just get all of the faults out of our system and stop it from failing then all will be well and our system will be “safe”. Reliability is the probability that a system will work without failure for a specified period of time in a given environment. Thus, reliability is about failure. Reliability measurements for software are statistical; for example, failure intensity (the number of fail- ures observed per unit time), the Mean Time Between Failure (MTBF), or failure probability gained from testing the computer based part of the system. The word failure is the key to understanding software reliability. Reliability and safety A failure is a deviation from the specified system behaviour. A failure in a safety critical system is more commonly a deviation from the intended behaviour – which is to be safe – and especially where the deviation can lead to harm or loss. In the reliability usage of the word failure, a system may be unreliable but still safe, and worse, it may be completely reliable but totally unsafe. 14 Example 2.1 Let’s consider an example to demonstrate the different between reliability and safety. We have already seen that correctness is not the same as safety through the example of the braking system. A similar example can distinguish the difference between reliability and safety. Consider a collection of systems for aircraft. If I fly on an aircraft, I want a set of systems that always fly (reliable), while also not crashing (safe). However, I can easily implement an unreliable and safe system: never start the engines. The plan will not be able to move, so will be safe. However, it will not be reliable. Now, consider the converse: something that is reliable but unsafe. We can consider the braking system example above: simply design our aircraft system to have no brakes. I can implement the braking system reliability, but the aircraft will not be safe. While this example is somewhat extreme, it distinguishes the difference between reliability and safety well. 2 2.3 Safety engineering Safety engineering is the set of processes, methods, and techniques for assuring the system safety. The safety engineering approach is characterised by monitoring safety throughout the product development process – just like other quality attributes. The safety engineering process ends by ensuring that there is enough evidence to build a safety case to demonstrate that the product is safe. Typically, it is not sufficient just to say (or show) that a large set of tests have been run, or that a design has been carefully crafted. In most countries, regulatory authorities require that a safety case demonstrates that safety concerns have been identified, the system has been designed to consider them, and has been implemented to its design. Definition 2.1 A safety case is a high level argument that is presented to regulatory bodies, or customers, to demonstrate that a system is safe. They are generally built up using all of the evidence gathered during the exploratory, deductive, inductive and design processes. 2 There are several phases throughout the system development lifecycle in which safety engineering plays an important role: 1. Requirements Analysis and Product Definition – Preliminary Hazard Analysis and other meth- ods are used to determine the hazards or potential situations that can cause harm or environmental damage. Their potential causes and impacts are listed. 2. Design and Implementation Phases – To guide the design, one typically uses a deductive ap- proach: one that starts with potential situation of harm or injury and works back to design or implementation elements. is used. This is typically followed up by an inductive approach: one that starts from component or subsystem failures. During these phases, the system designers and safety engineers work closely to ensure that the design considers the high-risk safety concerns. Systems that must defend against situations of harm or damage must typically be ultra-reliable or must have sufficient redundancy so that a safety related subsystem failure does not lead to a safety related system failure. This is the principle of 15 “Defending in Depth” against potential accidents. We will discuss more about redundancy later in this subject. 3. Safety Cases – Finally, throughout the process, a safety case is constructed, which is used to convinced the customer or regulator authority that the system is safe. That is, all non-trivial hazards are identified and the design considers these and deals with them appropriately. Concepts in safety engineering The terminology used here is consistent with common usage but beware that terminology differs between standards. Safety as it concerns us in this subject is chiefly concerned with physical products. • Platform — The largest engineered artifact that contains the software, the physical devices on which the software will run (and which connect people to the system), and the operating proce- dures. • Unsafe — A product is unsafe if it causes unacceptable harm; for example, in the loss of life, injury or environmental damage. Note an important point here. Considering software in isolation from the rest of the system does not tell us anything about its safety or otherwise. When analysing a system for safety the whole system, including hardware, other software systems, humans, and the environment, must be con- sidered. • Safe — A product is safe if it is not unsafe. • Accident — An accident is an unintended event, or more typically an unintended sequence of events, leading to harm. • Incident — An event that significantly reduces the safety of a platform but does not lead to an accident. • Hazard — a physical situation or state of a platform that can lead to an accident. • Failure — An unintended condition of a system that can lead to a hazard. IEC–61508 states that a failure occurs when the delivered service deviates from the specified service. • Error — An unintended state that can lead to a failure. • Flaw – A design defect that can give rise to an error under specific conditions. • Fault — An event that occurs when a flaw is activated resulting in an error. Note the the definition of fault is different in safety engineering than in software testing. 2.4 Accidents and safety engineering Understanding accidents is key to understanding safety. In this section, we look at accidents and causal- ity. First, let’s consider some pertinent examples of accidents in which software was a primary contributor: 16 Therac-25 – Between June 1985 and January 1987, six people died after receiving more than 100 times the intended radiation dose from Therac-25 radiation therapy machine. A commission into the accident concluded that the primary cause of the radiation overdose was from poor software design, not from coding errors. A320 Airbus Accidents – The Lufthansa A320 accident at Warsaw airport, and the Mulhouse-Habsheim Airport accident (discussed in more detail below). London Ambulance – In October 1992, the failure of the London Ambulance Services’ new Com- puter Aided Dispatch (CAD) system resulted in over 30 deaths and problems in dealing with life-threatening incidents [1]. An investigation concluded that the system was poorly designed and implemented, leading to significant delays in the assigning of ambulances. In some cases, an ambulance took up to 11 hours to arrive. An upgrade in July 2006 led to repeated system crashes in July and August, and operators reverted to old pen-and-paper methods for ambulance dispatch. On 8 June 2011, a completely new CAD system, called CommandPoint, was commissioned. On 9 June 2011, the system developed technical problems and operators again had to revert to to old pen-and-paper methods. The London Ambulance Service reverted to the previous CAD system on the same day, and CommandPoint was not re-instated until May 2012. Boeing 737 MAX – Two Boeing 737 MAX 8 aircraft crashed, killing all passengers and crew on board, on October 29 2018 and 10 March 2019. At the time of writing (February 2020) the causes of these accidents are still under investigation. However a flight-control software system called MCAS (Maneuvering Characteristics Augmentation System) of 737 MAX is widely believed to have been a major contributing factor. 2.4.1 Case Study: The A320 Airbus The Airbus A320 is a fly by wire aircraft, in which the pilot’s commands are transmitted to the flight control surfaces as electronic signals rather than mechanically via cables and pulleys. It has been described as a “network with an aircraft wrapped around it”. There are about 150 individual computers on board each of which must communicate with the others. Fly by wire (and drive by wire) are ubiquitous in modern transport systems. They have many advantages over the old mechanical systems [3]: • Saving in weight. • Enhanced control of the aircraft (e.g. anti-lock brakes). • Simplify retraining. • Simplify maintenance (i.e. software instead of hardware). • The flight engineer is made redundant. • Enhanced safety by creating a protective flight envelope and thereby reducing pilot error. • Marketing advantage. 17 A320 On Board Systems The A320 has the following on-board systems: • Electronic Flight Control System (EFCS). • Flight Management and Guidance Control system (FMGC). • Full Authority Digital Engine Control system (FADEC). • Electronic Instruments – the “Glass Cockpit”. Habsheim – 26 June 1988 The intention of the flight was to demonstrate the aircraft’s capability at low altitude. It made the first pass slowly at 34ft/10m, flaps extended and landing gear down, successfully. The second pass was to be done at speed in a clean configuration (landing gear raised). The aircraft failed to regain altitude, ploughed into a forest at the end of the runway, and caught fire, killing 3 of the 130 passengers and injuring 34 others. An official investigation of the accident attributed it to the following causes: • low fly-over; lower than surrounding obstacles; • low speed with engines at flight idle; and • late application of “go around” thrust. Together with the conclusion that the aircraft was flight worthy and that no malfunction of the on-board systems contributed to the crash, the verdict was pilot error. However, an interesting observation has been made by many safety experts that many more aircraft accidents are attributed to pilot error when the pilot dies than when the pilot lives. This indicates a culture of blaming the pilot first, and then seeking their explanation second. Further to this, many accidents attributed to “pilot error” are of the form that: “the aircraft did X, but the pilots did not notice Y, therefore, they caused the accident”. For example, in some cases, pilots and co-pilots have failed to notice warning lights. This can be considered quite unfair on the pilots: if neither the pilot nor the co-pilot noticed a warning light, surely it could be considered a failure of the warning light system for not being noticeable? If we agree with the quote in Section 2.2 that human error is inevitable, we should build our systems to consider that humans will make errors, so the system must try to accommodate them. The Pilot’s Version The pilot actually survived this crash and so his version has been recorded as well as the official finding. The pilot’s versions stated: • Company operating procedures were not followed: 18 – flying lower than the specified minimum runway fly-over altitude (100ft in landing configu- ration and 300ft in clean configuration according to Air France specifications). – The flight dossier including Habsheim and its surrounding environment arrived too late for the crew to study it and the crew were never verbally briefed. The first that the crew new about 30–40ft high trees where when they were 15s flight time away. • The height chosen for the fly-over was 100ft. The crew thought that they were at 100ft. – The captain decided to use the vertical speed indicator and the barometric altimeter to control height (accuracy±30ft). The reason was that the radio altimeter uses a digital display. At low altitude the terrain numbers change so rapidly that it is impossible to read. The barometric altimeter, however, uses an analogue display. – The barometric altimeter was checked in the pre-flight checks but was faulty; it displayed 67ft just before take-off but before the wheels had left the ground. – The crew were used to a large airport where the control tower is 30m high and the runway is 3km long rather than a control tower 12m high and a runway 900m long as at Habsheim. The crew were also deceived by other visual cues. • But technical factors contributed to the accident as well; – There are two Flight Management Guidance Computers (FMGC), one acting as the primary guidance system and one as a standby. These were not communicating prior to the accident. The pilots had to enter all of the flight data by hand into both the primary and the standby computers. Note: if there is a discrepancy between the two FMGCs and the standby was needed then the navigation screens would go blank (a fault noted a day earlier). The crew decided to enter only one data point to minimise data entry errors. – The engines took almost 9s to spool up rather than the 5s that the engines are physically capable of. The pilot’s view, however, disagrees with the official report that the engines performed according to specification. – Pitch control anomalies in the functioning of the Electronic Flight Control Systems (EFCSs). What can be concluded? “L’affaire Habsheim is not over yet” – Mellor [3]. • The spool-up timing of the engines is critical. If it is 5s as claimed then this makes the pilot’s story suspect, but if it is 9s as claimed by the pilot then the official findings become seriously questionable. • Whatever the crew’s mistakes were that led to the situation over Habsheim there is evidence that the functioning of the on-board systems led the pilot to a situation (Hazard) that he could not recover from. This is a point of some controversy still. • The complexity of the A320’s systems seems to be part of the cause for the controversy. However, the most important conclusion that we should take from the case study is: The accident, like many others, was a failure of a system which included operational procedures, crew training and environment as well as the on-board systems. 19 2.4.2 Case study: Warsaw, Okecie – 14 September 1993 The Kulmbach, a Lufthansa A320-200, was making its final approach on runway 11 at Okecie airport, Warsaw with 64 passengers and a crew of six on board. There was a violent storm and the water was not draining well from the newly resurfaced runway. The runway is 2800m in length but the aircraft touched down 750m along the runway at a speed of 170 knots (see Figure 2.1). This was too late and too fast but a fully laden A320 can stop in 1000m. In this case there was an accident and the aircraft hit an earthen embankment at the end of the runway, killing the co-pilot and one passenger. 750m 1700m 2800m D = 750m S = 170knots T = 0sec No Braking D = 1750m S = 154 knots T = 12sec Braking D = 2800m S = 70 knots T = 30sec Braking Approximate analysis; detailed reports are not yet available. Figure 2.1: The Lufthansa A320-200 at Okecie airport, Warsaw. Braking logic: Landing an A320 normally Figure 2.2 presents an architectural view of the A320 braking system. The weight on the wheels (WoW) and the altitude (measure by the altimeter) are the two inputs. The logic of the braking system is as follows: brake if and only if: (WoW > 12 tonnes) and (wheels spinning > 72 km/hr or (wheels spinning and radia alt < 10 feet )) The Okecie accident sequence 1. The crew were informed that the wind on the ground was coming from the right and slight in front of the aircraft at 10 knots. The crew were not informed that during the final approach the wind changed to a tail wind of 25 knots. 2. To compensate for the cross wind coming from the front of the aircraft the crew had chosen a landing speed of 152 knots rather than the standard 132 knots which was normal procedure in bad weather of this kind. 3. Expecting a side wind the crew had banked the plane to the right. The result was that only the right landing gear touched the runway. The left gear did not come down for a further 9s. A/G is indicated by both WOW switches. The LGCIU could not signal A/G and so the SEC did not deploy the ground spoilers and the FADECs did not deploy reverse thrust. 20 FADEC BSCU SEC PILOT INTERFACE EFCS LGCIU Altimeter WoW Landing Gear Control Interface Unit Electronic Flight Control System Spolier and Elevator Control Braking and Steering Control Unit Full Authority Digital Engine Control Figure 2.2: Braking system for an A320 landing normally 4. The wheel that had made contact with the runway was aqua-planing and so did not rotate at the required speed corresponding to 72 knots. Thus the other condition for the SEC to deploy spoilers was not true. Conclusions from the Okecie accident There is a lack of published evidence in the case of the Kulmbach accident at Okecie. Some conclusions are drawn by Mellor [3]: • The crash was not due solely to pilot error. The pilot may have contributed to the accident by selection of the landing speed but the Kulmbach should have been able to stop in the distance even at the speed of 170 knots. • The drainage of water from the runway and the weather conditions were necessary conditions for the accident but probably not sufficient on their own. Several other aircraft landed successfully on runway 11 both before and after the Kulmbach in the similar conditions. • The A320 systems may have performed to specifications but did not behave as required. In retro- spect, the braking logic is restrictive. Again, it was the failure of the entire system that led to the accident. 21 2.4.3 Case Study: Boeing 737 MAX On 29 October 2018 a Lion Air Flight 610 crashed killing all passengers and crew. On 10 March 2019 Ethiopian Airlines Flight 302 also crashed, leaving no survivors. Both aircraft were Boeing 737 MAX 8; 346 people were killed. The precise causes of each of these incidents is still to be determined, yet it is widely believed that a major contributing factor to both was a software flight-control system of the 737 MAX known as Maneuvering Characteristics Augmentation System (MCAS). MCAS is a software component of the 737 MAX’s flight control computer. It is understood to have been added to the 737 MAX in an attempt to have the 737 MAX mimic the handling of a prior 737 series aircraft, the 737 NG. The following is a synthesis of the current understanding of MCAS and its role in these accidents, as drawn from a range of news reports available online in Febuary 2020. This synthesis should therefore be treated as informed speculation as gleaned from reputable news media. However it should not be considered authoritative. Because both accidents are still under investigation, authoritative sources on the precise causes of the accidents are not yet available. Why was MCAS added? The 737 MAX has larger engines than prior 737 series, and those engines are fitted in a different position. Specifically, in order to achieve increased efficiency compared to the 737 NG, the MAX uses a different kind of engine. Those engines are bigger than the ones on the NG. Because of their increased size, they need to be mounted in a different place on the wings, compared to prior 737 series. Changing the placement of the engines means that the aerodynamics of the aircraft are different. In particular, it means that the aircraft becomes more likely to tilt upwards more steeply than is desired, while in flight. MCAS’s goal is to compensate for the MAX’s tendancy to tilt upwards. There seems to be some debate about whether MCAS was introduced because (a) the tendancy to tilt upwards could lead to the plane stalling (e.g. at low speeds), or (b) whether it was introduced only because the tendancy to tilt upwards would mean that the MAX would simply handle differently to the prior NG series 737s. In either case, any deviation in its handling from the prior NG series 737s would have meant that extra training for pilots, who were already trained to fly the NGs, would have been required. As we understand, avoiding extra training was important for ensuring the MAX was as cheap as possible for arilines to buy and to use. We discuss this more below. How MCAS Operated The MCAS system attempts to detect when the aircraft’s nose tilts up too steeply. Technically, the tilt of the aircraft is referred to as its angle of attack (AOA). MCAS attempts to detect when the AOA is too high and, in that case, its job is to automatically push the nose of the plane downwards, lowering the AOA. The aircraft includes AOA sensors. MCAS determines when the AOA is too high by reading one of the AOA sensors. When it determines that AOA is too high, it adjusts the aircraft’s horizontal stabiliser, which causes the nose to pitch downward. When it does this, we will say that MCAS is activated. 22 Immediate Cause of the Accidents It is believed that the immediate cuase of the accidents was that MCAS repeatedly activated. That caused it to repeatedly push the nose of the plane downward, each time it repeatedly activated, forcing the plane to fly downwards without the pilots being able to correct its trajectory until it eventually crashed into the Earth. It is believed that the repeated activation of MCAS was caused by it receiving faulty AOA sensor read- ings. Contributing Factors There are a number of factors that contributed to these accidents that are worth understanding. Lack of Redundancy The first is that, while the 737 MAX has two AOA sensors, the MCAS system took input from only one of them. Thus there was no redundancy that would have allowed it to detect that it was receiving erroneous AOA data. We will discuss more about the need for redundancy later in the subject. However note for now that the single AOA sensor becomes a single point of failure in the MCAS design. As both accidents appear to indicate, a failure of a single AOA sensor led to the plane crashing without pilots being able to intervene. Lack of Human Override It is believed that in each accident, the pilots repeatedly attempted to try to bring the nose of the plane upwards to correct its trajectory, by moving the control column. However MCAS was designed so that it could not be overridden by the pilots moving the control column. This was in contrast to previous 737 series. Lack of Documentation and Crew Training Because MCAS was considered to be a part of the flight control system, it was not desribed in the flight manual or in training materials. It is believed that this was a deliberate choice to maximise the appearance of similarity to the previous generation 737 NG. As noted earlier, the goal was to minimise the cost of the aircraft to airlines by ensuring that pilots and crew who were already trained in the NG series would not need much additional training to fly the MAX series. Thus the pilots were likely unaware of the MCAS system, its potential effects on the plane, nor how to correct for those effects. Culture: Economics vs Safety Studying the 737 MAX 8 incidents and their believed causes shows that a range of factors contributed to them. These factors share some common themes. Perhaps the most striking of which is a repeated desire to apparently prioritise lowering cost over redundancy and safety. For example, the desire to ensure as much commonality between the handling of the MAX and the earlier NG series, while wanting to use larger, more efficient engines, is believed to have been driven by economic concerns. Firstly, more efficient engines mean that the plane is cheaper for arlines to run, because it uses less fuel. Secondly, commonality with the earlier NG series means that little additional training is required for pilots, which further reduces the cost to airlines who purchase the plane. So both 23 of these choices appear to have been driven by economic concerns: to make the 737 MAX as attractive as possible to potential buyers. As we have seen, these choices however led to the introduction of the flawed MCAS software system. We discussed already how MCAS’s design was flawed, because it took readings from only one of the AOA sensors, leading to a single point of failure. It is reported that the reason for this design choice was also economic. Specifically, that design ensured that extra flight simulator training would not be required by pilots. If true, this indicates that the MCAS design violated one of the most fundamental principles of engineer- ing reliable systems—namely, avoiding single points of failure that can lead to catastrophic consequences— for the sake of saving (or, rather, making) money. So not only did economic factors drive the introduction of MCAS, but they also drove its flawed design. Since these indicents, whistleblowers have also larger concerns about Boeing’s aircraft and its engi- neering processes. These include concerns about the oxygen system on the 787 Dreamliner, as well as concerns about the operation of certain Boeing factories in the US, in which safety was not given sufficient priority. While we of course cannot make judgements about the culture at Boeing specifically, together these reports highlight the critical role that culture plays in an organisation. Decisions to prioritise econimics over safety (whether made implicitly or explicitly) are ultimately a product of an organisation’s culture and values. Response and Fix MCAS Fix At the time of writing (February 2020), Boeing is currently working to fix the problems identified with the MCAS system. Its reported fix involves three changes: 1. Having MCAS take AOA readings from both AOA sensors. If the two sensors disagree too much, MCAS will assume one is faulty and will not activate. This guards against a single faulty sensor. MCAS will also only activate when both sensors agree and move from a low AOA reading to a high one. This gaurds against a fault in which both sensors get stuck reading a faulty high value. 2. When MCAS activates, it will now activate once only rather than repeatedly. 3. MCAS’s ability to move the horizontal stabiliser has been reduced, thereby allowing the pilots to override it by moving the control column. As part of the ongoing investigation into the 737 MAX, an additional point of vulnerability was in- dentified in the MAX’s flight control systems. This was the potential for bit-flip errors to occur in the hardware running the flight control computer. As a result the architecture of the flight control system was overhaulded to increase redundancy. We discuss this further in subsection 9.6.3, when we discuss fault tolerant design. Broader Consequences and Reflections On 10 March 2019, all 737 MAX aircraft were indefinitely grounded, pending investigation of both incidents. As of February 2020, all 737 MAX remain indefi- nitely grounded. Aside from the 346 deaths, it remains unclear what precise economic cost the grounding 24 will have on Boeing (remembering of course that the MAX’s original design was motivated by a desire to save money). In December 2019, Beoing removed its CEO over the incidents and their handling. In 2019, airlines reportedly cancelled 183 orders for the 737 MAX. Each MAX reportedly costs at least 100 million US dollars. Those cancelled orders alone therefore represent a potential loss of 18.3 billion US dollars. Of course that cost doesn’t account for indirect costs, including those of investigation, remediation and re- certification of the aircraft, nor of those due to lost production (since production was halted in January 2020). It has been reported that the grounding is projected to lead to losses of 18.4 billion US in future. The 737 MAX case study neatly illustrates the oft cited engineering maxim that the best (i.e. the cheapest and fastest over the long term) way to build a system is to built it right the first time. (As opposed to taking shortcuts that lead to having to re-engineer a system once it is in production.) Returning to the theme of culture, we may conclude that in safety-critical engineering, a culture that prioritises economics over safety may lead to short-term economic gains, but may ultimately lead not only to catastrophic loss of life but also to higher economic costs over the long term. 2.4.4 The role of accident analysis in safety engineering Accidents, and the study of accidents, are an important part of safety engineering. The most useful way to identify hazards in safety-critical systems is to look at similar systems and study the factors that contributed to accidents (and incidents) in those systems, and to specify safety requirements from both these and the system models (see Figure 2.3). Accident analyses are used during Preliminary Hazard Analysis (for example during HAZOP) to understand potential accidents, causes consequences and risks. Accident Analysis Prelim. Hazard Analysis System Models Safety Requirements Figure 2.3: The role of accidents in safety engineering Analysing Accidents: Event Chains The common method for analysing accidents is to create an event chain leading back to a set of accident causes. Figure 2.4 shows an example of an event chain for an accident in a steam boiler. 25 Moisture Tank Ruptured Weakened MetalCorrosion Moisture and Fragments Projected or People Injured Equipment Damaged Figure 2.4: An event chain for a steam boiler Each event in the chain is the cause of the next; for example, moisture is the cause of corrosion, corrosion is the cause of weakened metal in the boiler, moisture and weakened metal is the cause of a ruptured tank, etc. Causality When can it be said that event A causes event B? In accident analysis one common way is through the counterfactual definition of causality. Definition 2.2 A counterfactual is commonly defined as a conditional expressing what could have happened under different conditions, but did not happen. It is typically expressed in the subjunctive mood: a conditional (or if-then) statement indicating what would be the case if its antecedent were true although it is not true. 2 As an example, consider the counterfactual: “They would have done better on their assignments, if they had first done the workshops” (take a hint from this). In accident analysis, “A causes B” if we assume that A did not occur then B would not have occurred. Note that this does not mean that A is the sole cause of B. As in the accidents described above, often multiple factors contribute in unlikely sequences to cause a single accident. Counterfactuals and Causality In the steam boiler example we would reason as follows: If the metal had not been corroded then the metal would not have been weakened. The metal was corroded but the counterfactual deals with what could have been — in this case, if we remove the corrosion then we can imagine that the metal would not have been weakened (based on our knowledge of corrosion and metal). Consequently the counterfactual is true and we conclude that the corrosion of metal causes the weakening. The essential nature of counterfactuals in accident analysis is the imaged absence of one of the precon- dition events leading directly to the event being studied. If the consequent is true when we imagine that 26 the antecedent is true then the counterfactual is true — that is, we imagine the absence of the cause A and if the event B no longer occurs then A causes B. Counterfactuals and correlation We must also be careful not to confuse causal relationship with correlations. If the were the case that the barometer falls then a storm would occur. In this case the barometer falling and the storm occurring are not causally linked — the counterfactual is incidental and not true. 2.5 Tools and techniques for safety engineering The aim of this section is to examine some of the concepts, tools and techniques involved in safety engineering in a bit more depth. The problem in safety engineering is to develop – with a high level of assurance: • Systems that do not lead to unintended harm or loss; • Systems that continue to function in the presence of component faults; • Systems that perform within hard real-time bounds; and • Systems that protect data. 2.5.1 Safety standards and safety lifecycles Software engineering for safety critical systems involves numerous standards each of which requires additional activities in order to guarantee due care with safety critical functions. For example, in IEC–61508, the safety lifecycle is used to identify: • key activities where safety analysis techniques are used; and • the relation between safety critical analysis methods and traditional lifecycle processes. Engineering for safety, like engineering for other product attributes such as performance, correctness, and usability, is something that must be done over the entire lifecycle. Figure 2.5 shows a possible safety engineering process, and notes some of the techniques and types of methods used in safety engineering. For each project, or each iteration of a project, the phases of requirements, design, and implementation will have corresponding verification and validation techniques. At the requirements phase, the focus is more on exploratory hazard analysis to determine what could go wrong, while at later phases, it is about designing the system to prevent these. In the remainder of this chapter, we look at some safety engineering techniques. First, we present a more general overview of preliminary hazard analysis. 27 Prelim. Hazard Analysis: Exploratory (HAZOP) Safety Case: Deductive and Inductive Inductive Requirements Architecture Detailed Design Implementation Unit Test Integration Test Acceptance Test Deductive (Fault Trees) Deductive or Inductive (FTA or FMEA) Inductive Figure 2.5: A safety engineering process 2.5.2 Preliminary hazard analysis A preliminary hazard analysis is used to identify the hazards and accident scenarios associated with a new system. It is essentially an information collation and filtering exercise, and is usually the first activity carried out in the safety lifecycle. Preliminary hazard analysis methods ask: “What possible accidents or incidents could occur?” In this section, we look at some of the basic concepts in preliminary hazard analysis. Phase 1 — Preliminary hazard identification This is the phase that identifies the hazards and accident scenarios associated with the new system. • Inputs – Information sources – Incident and hazard data: previous in-service incident and accident data of similar systems (usually as checklists). – Analyses: previous dependability analyses for similar systems. • Method – Group based: members experienced in the domain. This is important because people who have domain knowledge and knowledge about accidents are far more likely to think of acci- dents for the new system. – What ifs: hypothesise safety related difficulties. • Output – Preliminary hazard list: a list of all hazards, their causes. 28 Phase 2 — Preliminary Hazard Analysis (PHA) This occurs after the preliminary hazard identification and usually at the later stages of requirements analysis and early stages of design. It has the following properties: • Possibly included into preliminary design reviews. • Early identification of hazards and causes in designs. • It is a poorly defined method: – uses brainstorming and questionnaires; and – depends on experience and domain knowledge. • Checklists are commonly used, in an attempt to make the method more systematic, and less reliant on experience and domain knowledge. Inputs • design (or perhaps a detailed requirements model). • preliminary hazard list Outputs • list of causes for hazards based on the design. • risk assessment and risk classes (provisional). • integrity levels (provisional). • target failure rates (provisional). Hazard Logs The results of the preliminary hazard identification and preliminary hazard analysis form the basis of a hazard log. Definition 2.3 A hazard log lists the hazards, their causes and their severity, as well as other data such as target frequen- cies or hazard types. It is a living document. It evolves during the course of the safety analysis, and can be part of the common literature in special fields such as avionics. 2 For example, a preliminary hazard analysis may be summarised in the following hazard log. Ref. Hazard Frequency Severity Cause Integrity Target Level . . . . . . 29 Risk Risk is function of the severity of a hazardous event and the likelihood of it occurring. The frequency of the event is the probability that the event occurs in a given period of time. Assessing Risk To arrive at a qualitative risk assessment, the idea is to assess consequences against frequencies. Stan- dards usually provide a table for this. The safety standard IEC – 61508 classifies consequences into: catastrophic critical marginal negligible It classifies frequencies into: frequent probable occasional remote improbable incredible From these two classifications, the standard defines the following table, which classifies each conse- quence/frequency pair into one of four risk classes. Frequency Consequence Catastrophic Critical Marginal Negligible Frequent I I I II Probable I I II III Occasional I II III III Remote II III III IV Incredible IV IV IV IV The four risks classes, according to IEC – 61508, are: Class I Intolerable risk. Class II Undesirable risk and tolerable only if risk reduction is impractical. Class III Tolerable risk if the cost of risk reduction would exceed the improvement gained. Class IV Negligible risk. Using this, we can interpret that frequent events with a catastrophic must not be tolerated, while incredi- bly rare events in general can be tolerated. Integrity Levels Definition 2.4 Safety integrity is the likelihood of a safety related system satisfactorily performing the required safety functions under all stated conditions. 2 For example, a system with known failure rates for subsystems: 30 Integrity Probability of failure to Level perform system function 4 ≥ 10−5 to < 10−4 3 ≥ 10−4 to < 10−3 2 ≥ 10−3 to < 10−2 1 ≥ 10−2 to < 10−1 Note: the distinction between risk classes and integrity levels. Risk classes are assigned to hazards. Integrity levels are assigned to systems. Assigning Integrity Levels It is not always possible to remove hazards entirely. Safety engineers must instead reduce hazard risks to tolerable risk levels. The amount of risk reduction determines the integrity level. The more that the system is required to reduce the risk of hazards the higher the system’s integrity level. There are a number of ways of assigning integrity levels to subsystems, and these can be broken into two categories: 1. Quantitative methods – first determine the uncontrolled risk and then target risk levels numerically, by analysing the failure rates of subsystems from similar situations; 2. Qualitative methods – use risk graphs to determine risk qualitatively. Some standards, such as IEC-61508, provide for integrity levels which are set according to the severity of the accident. 2.6 Hazard Analysis Methods Hazard analysis methods fall into three categories: 1. exploratory: these methods use expert knowledge and some systematic techniques to imagine potential hazards and their impacts; 2. causal: these methods work from system designs and try to deduce what failures are possible from this; and 3. consequence: these methods work backwards from possible failures to see how they could occur. Table 2.1 summarises these three types of methods. In the remainder of this chapter, we will explore the three methods listed as examples in Table 2.1. 2.6.1 HAZOPS HAZard and OPerability Studies (HAZOPS) is a technique that originated in the chemical processing industry. It has been adapted to many other industries and specifically in software engineering to software HAZOPS and — surprisingly — the study of interfaces in human-computer interaction. 31 Technique Example Methods Good For . . . Exploratory HAZOP (Hazard and Operabil- ity Studies) Exploring possible hazards, their causes and for completeness, but is labour intensive and hard to change. Causal Fault Trees This is the classic deductive approach and is good for determining what combinations of component failures lead to hazards. Consequence FMEA (Failure Modes and Ef- fects Analysis) This is the classical inductive approach used for working from component failures (failure modes) back to hazards and is good for veri- fying ideas and designs. Table 2.1: Types of hazard analysis methods. A HAZOP study is a well-established technique for preliminary hazard analysis (PHA). The process of identifying hazards requires creativity, lateral thinking, and expert knowledge. The HAZOPS method was introduced into software engineering to provide some repeatable and systematic process into hazard analysis. The basic technique is straightforward: a system consists of some behaviour. A HAZOP study explores the behaviour by taking a set of generic guidewords that describe some alternative to intended behaviour, and exploring what hazards could occur if that alternate behaviour occurred. Note: this is different to testing where we wish to find deviations from specified behaviour. Guidewords The guidewords for every HAZOP study are the same1, and are listed in Table 2.2. Guide Word Deviation NO or NONE This is the complete negation of the design intention. No part of the intention is achieved and nothing else happens. MORE This is a quantitative increase. LESS This is a quantitative decrease. AS WELL AS All the design intention is achieved together with additions. PART OF Only some of the design intention is achieved. REVERSE The logical opposite of the intention is achieved. OTHER THAN Complete substitution, where no part of the original intention is achieved but something quite different happens. EARLY Something happens earlier than expected relative to clock time. LATE Something happens later than expected relative to clock time. BEFORE Something happens before it is expected, relating to order or sequence. AFTER Something happens after it is expected, relating to order or sequence. Table 2.2: HAZOPS guidewords 1Although there is nothing to prevent software engineers from adding or removing guidewords. 32 Roles There are several roles that can take part in a HAZOP study, including: 1. A Team leader who is responsible for: • Planning and Preparation. • Chairing the HAZOPS meetings. • Signing any documentation. • Ensuring follow-up work is completed. 2. Recorder who is responsible for: • Documenting the analysis. • Participating in the study. 3. Designers • Understand and explain the (potential) design. • Answer questions about the process. 4. User(s) • May be operators, or representatives of other user groups such as maintainers of the system. • Supply information about the context in which the system will be used. • Help decide issues that may affect safety. 5. Expert(s). The key function of experts is to explore: • Ask questions. • Suggest deviations, causes and effects. Process The process for a HAZOP study is outlined in Figure 2.6, taken from [2]. From this figure, we can see that each design item, such as an event in a process, or a transition in a state machine, is inspected. For each item, the item is analysed using the HAZOP guidewords. For example, consider our earlier idea of brake-by-wire system for a car. One design item contains the intended behaviour that, when the brake pedal is pushed, a signal must be sent to the ECU immediately to let it know that this event has occurred. Using a HAZOP study to analyse what happens if the systems deviates from intention, we can apply the guidewords NONE, EARLY, and LATE, and then reason the following: • NONE: If no signal is sent, the ECU will not know that the brake has been pushed, so will not send the signal to apply the brakes to the wheels. This will result in the vehicle not decelerating, ultimately resulting in a hazard. 33 Figure 2.6: A HAZOP study process. • EARLY: We want the signal to be sent immediately, so a signal cannot be sent early, and therefore, no hazard occurs. • LATE: If the signal is sent late, the ECU will process it late and send a signal late to the brakes. This will result in the vehicle decelerating later than anticipated, ultimately resulting in a hazard. The result is a log of the entire analysis. For each design item + guideword pair that may result in a hazard, the following details are recorded: 1. The deviation from intent (the interpretation of the design item + guideword for this system). 2. The causes of the deviation. 3. The consequences of the deviations. 4. Any safeguards that are in place to prevent the deviation or its cause. 5. A list of recommendations for mitigating the problem. A template/examples for this is shown in Table 2.3. 34 ID Guideword Deviation Causes Consequences Safeguards Recommendations 1.1 NONE . . . . . . . . . . . . . . . 1.2 EARLY . . . . . . . . . . . . . . . 1.3 LATE . . . . . . . . . . . . . . . . . . Table 2.3: A template for a HAZOP log. Limitations There are several limitations with using HAZOPS as preliminary a hazard analysis technique: 1. It requires a well-defined description of the intended behaviour. In some projects, this may not be available until quite late. 2. It is extremely time and resource intensive. Several people in a team analysing every design item for every guideword is time consuming, and the people in the team are generally highly experienced. Despite this, the process is creative and challenging, and many people find it rewarding and inter- esting. Not me though . . . 3. To conform to most safety standards, to devise a safety case, and to enable traceability of decisions, the result of the entire analysis must be recorded. That is, the outcomes of each guideword + design item pair must be recorded, including causes, consequences, and recommendations for mitigating the problem. This results in large amounts of documentation that need to be reviewed and maintained. 4. It focuses on single failures that cause of deviation from intent. However, an accident can occur due to a series of unlikely circumstances along a chain of events in a system. Despite this, the use of guidewords may prompt system designers to consider chains of deviations. Example 2.2 The following example of a stepper motor HAZOP study is taken from Wei and Zhangjin [5]. A stepper motor is a electric motor used for rotation of mechanical parts. The rotation is divided into discrete number of equal size steps. The position of the motor can therefore be determined without the use of sensors for feedback, provided the motor size is known. Figure 2.7 shows the workings of a stepper motor: a single step moves the teeth around until they are locked in, so counting the steps means that the position of the motor is known. The system is real-time, so when the controller of the motor sends a message for it to rotate, power on, etc., it must do it within a specified (and short!) time frame. The controller uses a C interface to the control the library, shown in Figure 2.8. Motor identifiers can be created, and a specified motor can be powered on/off, moved forward a number of steps, and braked. First, we need to infer what the deviations from intent are for each guideword: 35 Figure 2.7: An overview of a stepper motor. void write_lpt(struct motor_t *motor); struct motor_t stepper_motor_setup(int motor); void stepper_motor_power_on(struct motor_t *motor); void stepper_motor_power_off(struct motor_t *motor); void stepper_motor_move_steps(struct motor_t *motor); void stepper_motor_set_safe(struct motor t *motor); void stepper_motor_brake(struct motor t *motor); Figure 2.8: C interface for a stepper motor. Item Guidewords NONE MORE LESS EARLY LATE BEFORE AFTER REVERSE ... Power No Over Under Voltage Voltage — — Backflow voltage voltage voltage change change Step No Less More — Late — Wrong Wrong steps steps steps time order direction Brake No Too Too Too Too Wrong Wrong Accel- brake quickly slowly early late order order erate . . . . . . Now that we have the deviations from intent, we need to analyse what the possible hazards that arise as a consequence of these. We do just part of the functionality for moving the stepper motor: 36 3.0 stepper motor move steps(struct motor t *motor); ID Guideword Deviation Causes Consequences Safeguards Recommendations 3.1 NONE No steps No power to motor Motor does not move 3.2 MORE More steps Power too high 1. Motor goes too fast. 2. Motor may burn out. Install a voltome- ter. Do not apply voltage is over a specified threshold 3.2 LESS Less steps Low power due to old wire or large resistor Motor ro- tates too slowly Install a voltome- ter. Raise an error is voltage is under a specified threshold . . . From just these few guidewords, we have noted some potential hazards and provided recommendations for them. Clearly, from the report, some knowledge of stepper motors and their operation is required, even for this small part of the analysis. 2 2.6.2 Fault Tree Analysis Fault trees are one of the classic deductive techniques for hazard analysis and also one of the most common deductive technique in use today. Fault trees are typically drawn by working back from hazards to basic causes. Fault trees are a graphical notation and consist of events and gates. Event and gate symbols Figures 2.9 and 2.10 show the symbols that are used in fault-tree analysis, dividing into event symbols and gate symbols respectively. Method The basic method for working with fault trees is the following: 1. Select an event: Initially we select top-level events, which are typically the hazards, or events that can lead to the hazards, identified in the PHA. 2. Identify immediate, necessary and sufficient causes of this event: • Immediate: means that we need to avoid missing out intermediate events and should apply the “think small” principle; • Necessary: means determining those events which must occur for the top-level event to occur and these should be linked with an AND gate; • Sufficient: means getting those events which alone are sufficient to cause the top-level events on their own and these are linked with an OR gate. 37 Basic Event Undeveloped Event An initiating fault requiring no further action An event which is not developed further, either because it is considered unnecessary or because insuficient information is available. Intermediate Event Normal Event Conditioning Event An event arising from the combination of other, more basic events. An event which is expected to occur as part of the normal operation of the system. Specific restrictions that apply to some types of logic gates, for example the PRIORITY AND or INHIBIT gates. Figure 2.9: Fault tree symbols AND gate OR gate INHIBIT gate All of the inputs must occur for the output to occur One or more of the input events must occur for the output to occur EXCLUSIVE OR gate PRIORITY AND gate The output occurs if the input events occur in a specific sequence which is described in a conditioning event. The output occurs if exactly one of the input events occurs. The output occurs if the single input event occurs in the presence of the enabling condition which is described in the conditioning event. Figure 2.10: Fault tree gates 3. Classify intermediate events: • Basic event requires no further decomposition. • Compound defect, which may one of (1) a primary failure, which is a simple component 38 failure and may be classified as a basic event; (2) a secondary failure in which components fail because of external influences and this usually requires further investigation; or (3) com- mand failure in which components receive incorrect signals and this always requires further investigation. • System defect, which is not attributable to a single component and always requires further investigation. 4. Repeat as required. Some heuristics There are some some heuristics that have evolved over time for the correct construction of fault trees: 1. All inputs to a gate should be defined before any one is examined in more detail. 2. The output of a gate must never be directly linked to the input of another gate – all outputs of gates must go to events. 3. The text in event boxes should be complete in the sense that they should always say what the event is and when it occurs. 4. Causes always chronologically precede consequences. Example: Chemical mixing plan Consider the simple chemical mixing plant shown in Figure 2.11 and its accompanying fault tree [2]. The top-level event that leads to a hazard in this case is that the tank may overflow and cause environmental damage. The immediate, necessary and sufficient causes of this event are that the valve marked A is closed and that the valve marked B is open. Note that these events must occur before the tank overflows. The fault tree does not investigate the causes of the event “valve A closed” further. On the other hand, valve B is open if valve B has failed, which is a component failure but is not inves- tigated further, or an incorrect signal has been sent to valve B, which is a command failure and must be investigated further. 2.6.3 FMEA Failure Modes and Effects Analysis (FMEA) is a tabular technique and is one of the most widely applied inductive (consequence analysis) techniques in industry. The method specifies that safety engineers should focus on a single component failure, or failure mode, and work forward through subsystems to determine the consequence of the failure mode on the rest of the system. There is a great deal of variation in FMEA practice and, like most hazard analysis methods, it is hard to give a general algorithm for conducting an FMEA. 39 Controller X Y B A Tank overflows Valve A Closed Valve B open Valve B Failed Incorrect Control to Valve B Failed Controller Level Sensing Failed Sensor X Failed Sensor Y Faield Figure 2.11: A design of a simple chemical mixing plant and its corresponding fault tree. Applications FMEA is good at identifying individual elements or operations within a system that render it vulnerable; that is, finding single points of failure. Industries/organisations that frequently use FMEA include: con- sumer products, automotive, home appliances, aerospace, NASA, Department of Defence, and process industries (e.g. chemical processing). Failure modes An FMEA is about identifying the failure modes of a system. Definition 2.5 A failure mode is a manner in which a failure can occur, in terms of failure of the item function (such as a function to a sub-system or component) being studies. 2 Typical failure modes are: 1. Premature operation. 2. Failure to operate at the required time. 3. Failure to stop operation at the required time. 4. Failure during operation — and this is specific to the equipment. 5. Degraded operational capacity. 40 6. Excessive operational capacity. There is a symmetry here between failure modes and the guidewords for HAZOPS; e.g. excessive oper- ational capacity and the guideword “MORE”. Method Step 1 Define the system to be analysed and obtain necessary designs, descriptions, diagrams, charts, component lists, etc. The idea is to understand in depth what exactly it is that you are analysing. • All of a subsystem or just part of it? • What subsystems or operators must be protected? • What other subsystems must the subsystem being analysed interact with? Step 2 FMEA Analysis then uses two questions to guide the analysis. 1. For each analysed element, what are the failure modes? • The failure modes can be identified from testing or prior experience with the component; or • By appeal to similar components used before; or are • Technology specific failure modes; or • Can be analysed from damaged components 2. For each failure mode, what are the failure effects? The safety engineer will provide general descriptions of the failure mode and effects, which may also include severity and probability assessments. Worksheet The result of an FMEA is an FMEA worksheet – a monster of a spreadsheet that documents the failure modes and effects, along with other items, such as causes, probabilities, severities, and recommendations. A worksheet template is shown in Figure 2.12. Limitations Like other safety analysis methods, FMEA has several limitations: • Frequently, human errors and hostile environments are overlooked. • Because the technique examines individual faults of system elements taken singly, the combined effects of coexisting failures are not considered. • If the system is at all complex, the process can be tedious and time consuming. 41 • Failure probabilities can be hard to estimate. • Sometimes FMEA is done only to satisfy the altruistic urge to “do safety”. FMEA will find and summarise system vulnerability to single point failures, and it will require lots of time, money and effort. An important question to ask before undertaking an FMEA is: how will the results be used? 2.7 Threat Modelling for Security In this section we will discuss common threat modelling methods. Similar to hazard analysis, which aims to enumerate hazards, threat modelling aims to enumerate potential security threats to a system. The STRIDE methodology, developed at Microsoft, is a common threat modelling method that has some similarities with the HAZOPS method (see subsection 2.6.1). Attack Trees are another common method for threat modelling that has similarities to Fault Tree Analysis (see subsection 2.6.2). We briefly introduce both, drawing heavily on the opening chapters of [4]: Threat modeling: Designing for security, by Adam Shostack. Wiley, 2014. You can learn more there, and this book is currently available as an e-book from the University’s library: http://library.unimelb.edu.au. 2.7.1 The STRIDE Method The STRIDE method, akin to HAZOPS, can be seen as a systematic process for brainstorming secu- rity threats to a system. While HAZOPS focuses on design items, in which one tends to concentrate on the interactions and communications between system components, STRIDE instead focuses on the interactions that occur between a system’s trust boundaries. Trust Boundaries Definition 2.6 A trust boundary separates one part of a system from another, where the two parts are under the control of different entities who are trusted to differing degrees. 2 In any system, there will be parts that we as system designers control, such as the software or the hard- ware components that we supply ourselves. There will also be other parts that we don’t control: for instance, if we are using an external cloud provider then we likely will not have full control over those parts of our system that run on the cloud platform. A trust boundary exists, therefore, between the parts of the system that we control and those parts that we do not control. For the parts we don’t control, some may be under the control of one entity while others might be under the control of another. If we trust these entities to differing degrees (e.g. perhaps we trust one more than the other because we have a contractual relationship with them) then a trust boundary will also exist between those parts of the system. Example 2.3 Suppose we are building a website for a business, like the University. The website itself will be hosted 42 on University servers. However, media like videos and images will be served via a third-party content distribution network (CDN). This scenario is depicted in Figure 2.13. Each component in this system lives inside a separate trust boundary. In particular, while the web server might be under our control, the user and her web browser is not, and neither is the third-party CDN. However, since we have a contractual relationship with the third-party CDN, we are willing to trust it not to act in a way that would violate that contract. Therefore, we trust it more than we trust the users (and potential attackers who might be on the network). 2 The STRIDE Acronym STRIDE is an acronym: each letter corresponds to a class of threats. The STRIDE approach is about considering each class of threats and how it might apply to the system. This is somewhat analogous to HAZOPS in which each guideword is applied to a system to enumerate hazards. In a similar way, for each of the classes of threat that STRIDE describes, we consider all of the ways in which that class of threat might apply to our system, i.e. what sorts of attacks might be carried out. STRIDE’s purpose is to help systematically enumerate the attacks or threats that our system might be exposed to, and to force us to think about which threats are more serious and should be mitigated in our system’s design. STRIDE therefore forces security engineers to “think like attackers”. The threat classes that each letter of the STRIDE acronym correspond to are: Spoofing: pretending to be something or someone that you are not. Tampering: modifying something you are not supposed to. Repudiation: claiming that you didn’t do something (when you actually did do it) Information Disclosure: revealing or stealing information that you are not supposed to be able to ac- cess. Denial of Service: preventing others from accessing a system, e.g. by crashing it, making it too slow to use, exhausting its storage, etc. Elevation of Privilege: being able to do something that you, technically, you are not allowed to to (e.g. obtaining Administrator or root access on a server). We consider each of them briefly, illustrating each using the example system above by giving a few examples. Note that the lists below are not exhaustive. Also note that the different threat categories are not non-overlapping. Some threats might legitimately belong in multiple categories. For instance, suppose an attacker is able to modify the access logs for a system. This threat could be a repudiation threat, since changing the logs might allow them to claim that they did not do some action that would have otherwise been recorded in the log. However this threat is also a tampering threat. The goal with STRIDE is not to achieve a perfect categorisation of threats. But rather to aid as a tool to help threat brainstorming. 43 Spoofing This threat occurs when an attacker pretends to be someone or something that they are not. For instance, an attacker might impersonate a legitimate user of a system (e.g. if they can guess the user’s password). Or, in the case of a phishing attack, an attacker might impersonate a legitimate website. Spoofing threats violate the security goal of authentication. Example 2.4 (Spoofing Threats for Figure 2.13) • One user pretends to be another user of the web- site. • The attacker impersonates the university’s website (via a phishing attack). • A third party site contains a link to one of the university’s pages that causes the user to perform some action on the university site; Cross-Site Request Forgery (CSRF) attack. 2 Tampering This threat occurs when some data is modified that is not allowed to be modified. The modification could occur to data at rest (e.g. stored in a database or on a website) or do data in transit (e.g. as it is being transmitted over the network). Tampering threats violate the security goal of integrity. Example 2.5 (Tampering Threats for Figure 2.13) • A rogue administrator for the CDN company tampers with the website content hosted on the CDN. • An attacker tampers with data in transit between the web server and the CDN. • The attacker modifies the website contents (website defacement). 2 Repudiation This threat occurs when an attacker is able to do something bad, but then later on claim that they did not do it. It usually occurs when a system does not have sufficient logging of actions, or when the logs are not properly protected. Repudiation threats violate the security goal of non-repudiation. Example 2.6 (Repudiation Threats for Figure 2.13) • The attacker deletes or modifies the access logs for the web server. • The attacker fills up the disk space on which the log file is being stored. 2 Information Disclosure This threat occurs when an attacker is able to learn information that they should not be able to learn. It also includes attacks in which an attacker is able to cause sensitive information to be publicly released. Information disclosure threats violate the security goal of confidentiality. Example 2.7 (Information Disclosure Threats for Figure 2.13) • The attacker is able to steal the private key used to secure the users’ connections to the university web site. • A rogue administrator at the CDN accesses sensitive website media. 44 • An attacker is able to steal the password file for the website. 2 Denial of Service This threat occurs when an attacker is able to prevent the system from providing service to a legitimate user. Denial of Service threats are often much harder to stop than other kinds of threats. Denial of Service threats violate the security goal of availability. Example 2.8 (Denial of Service Threats for Figure 2.13) • The attacker commands a bot army to flood the website with traffic. • The offsite CDN server goes down. 2 Elevation of Privilege This threat occurs when an attacker is able perform some action that the sys- tem’s security policy does not allow. The most common form of this threat is when an attacker is able to gain “superuser” access (i.e. Administrator or root access) to a system, when logged-in as an ordinary user. Elevation of Privilege threats violate the security goal of access control. Example 2.9 (Elevation of Privilege Threats for Figure 2.13) • Somebody connects directly to the CDN server to directly access website content that would otherwise be password-protected. • The web server does not perform access control checks, allowing one website user to access an- other user’s content by issuing appropriate HTTP GET requests. 2 2.7.2 Attack Trees Similar to fault trees, attack trees aim to deductively explore all of the ways in which a threat might be realised. The simple variant of attack trees that we consider here has only one type of node, representing some event that takes place. Similarly to fault trees, the root node of the tree is the ultimate threat or event whose occurrence we are worried about. The attack three answers the question of “how could this event occur”? The children of each node in the tree are the immediate causes or ways in which the event that the node represents might be realised or might occur. When a node has multiple children, they usually represent different ways in which an event can occur. In comparison to fault trees, they therefore represent an implicit OR node. However different attack tree variants can contain expliti AND and OR nodes, similarly to fault trees. Given their similarity to fault trees, we do not discuss attack trees in detail here. Consult Shostack [4] for further details or see the various online resources on attack trees for more information. 45 Bibliography [1] Anthony Finkelstein and John Dowell. A comedy of errors: the London Ambulance Service case study. In Proceedings of the 8th International Workshop on Software Specification and Design, page 2. IEEE Computer Society, 1996. [2] J. McDermid, M. Nicholson, D. Pumfrey, and P. Fenelon. Experience with the application of HAZOP to computer-based systems. In Proceedings of the Tenth Annual Conference on Computer Assurance, pages 37–48. IEEE, 1995. [3] Peter Mellor. CAD: computer-aided disaster. High Integrity Systems, 1(2):101–156, 1994. [4] Adam Shostack. Threat modeling: Designing for security. John Wiley & Sons, 2014. [5] Z. Wei and W. Zhangjin. HAZOP introduction. 46 Figure 2.12: An FMEA worksheet template 47 Web Server Third- Party CDN User’s Browser Figure 2.13: STRIDE Example System. 48 Chapter 3 Model-based specification In this chapter, we will look at model-based specification — one style of formal specification. We will look at the Alloy language for specification, and how this can be used to model systems and check properties about the models. In the workshops, we will also look at a tool, called the Alloy Analyser for bounded checking of Alloy specifications. This chapter currently covers Alloy 6, released in late 2021. 3.1 What is formal specification? Formal specification is the use of mathematical notation to describe properties of a system. The use of mathematical notation provides precision that is not possible with natural language specification, which is subject to ambiguity and incompleteness. Formal specifications provide the specifier with a way to formally define what a system must do, without saying how it should do it. Thus, they are as formal as a programming language, but they do not constrain the design of the underlying algorithms or data structures. Because formal specifications do not constrain design detail, and are independent of their implementa- tion, they can be produced early in the project lifecycle; e.g. at the requirements and architectural design phases. At these phases, they are particularly valuable. As well as providing an unambiguous refer- ence for system designers and implementers, the mathematical nature of formal specifications provides software engineers with the opportunity to prove properties about the specification, such as correctness, completeness, and consistency, but also other properties, including safety and security. In this chapter, we will use the Alloy specification language [1]. Alloy is based on the Z language [4], which is the most widely used and probably most widely known formal specification language. Alloy sacrifices the expressiveness of Z for computational tractability; that is, some expressive features in Z are removed to allow us to automatically check properties of Alloy specifications. In this subject, we use Alloy for two reasons: 1. It offers an intuitive notation. 2. The Alloy Analyser tool for checking Alloy specifications is a state-of-the-art research prototype, which has been applied to many industry systems, and is supported by a large community of users. 49 3.2 What is model-based specification? Model-based specification is an approach to formal, system specification in which the system is defined as a state machine model. That is, there is a state representing the data in the system, and operations specify the transitions that can take the system from one state to another. This is much like the idea of a module/package in Ada: the variables define the state, and the subprograms define the transitions. The clean mapping from model-based specifications to modules is one reason that model-based specifi- cation has become the preferred method for formal specification, as opposed to axiomatic or algebraic approaches, which specify the properties of the systems via sets of axioms. Model-based specification in the software engineering lifecycle. Model-based specification is not a requirements elicitation technique, nor is it generally a requirements engineering technique. Many model-based specifications are finalised during architectural design, once the components of the system are known. However, the formalisation of the behaviour of system components implicitly leads to a formalisation of those system requirements that the components are implement. As such, model-based specification falls over the phases of requirements engineering and architectural design. We should be comfortable with this, as in modern software engineering, the line between requirements and architec- tural design is blurred. Figure 3.1, adopted from Sommerville [3, Ch. 27] shows the relationship between different activities in the early software engineering phases, and how model-based specification is typically carried out in parallel with other activities. Information flow between the activities is two-way — information is always passed between the requirements and design phases. Figure 3.1: Model-based specification in the software process. Languages for model-based specification. There are several well-known languages for model-based specification, including VDM, B, Z, and Alloy. These languages all have commonalities, especially that data types are based on sets and relations, and operations are specified using predicate logic. UML statecharts are another example of model-based specification, although the level of formality in these is less than in VDM, B, Z, and Alloy, as the intended areas of use are different. 50 Model-based specification and natural language. Like programming languages, formal, model-based specifications cannot be understood fully by a simple read over. As such, documenting the specification using natural language is important, just like commenting program code. Natural language descriptions not only allow the reader to understand the specification at a superficial level without having to worry about detail, but also allow the reader to understand the formal details, by guiding them through the meaning. 3.3 The costs of formal specification Using a rigorous process and method for formally specifying system properties can clearly be a costly exercise. The activity of specifying the precise detail of systems and sub-systems using mathematical notation requires careful thought and increased knowledge about the systems. Further, they must come under the same scrutiny of quality assurance using reviews and inspections as other requirements speci- fication techniques. The question to be asked is: “Is the cost worth it?”. A major limitation to people using formal specification, even in high integrity domains, is that the per- ceived cost is too high. However, studies repeatedly show the following: the cost of specification is higher, but some of the added costs are recouped later in the process. Figure 3.2, taken from Phillips [2], shows data from a project, comparing the number of errors found per thousand lines of code of using the Z specification language, against not using formal specification at all. The project from which the data comes from is the CICS project undertaken at IBM. The project introduced 268,000 lines of new code into a system already containing 500,000 lines of code. Of the new code, 48,000 lines was engineered from a formal specification written in Z, and Z was used at the product-, component-, and module-level design stages. Others studies show similar results. Let’s analyse this figure and ask ourselves why formal specification would reduce the costs of later phases. There are a number of reasons for this: 1. Using mathematical notation removes ambiguity from specifications, which is a cause of many problems in software. This helps to limit the amount of re-work in later phases, driving down their costs. In high-integrity systems, requirements typically change must less frequently than in say, enterprise systems, so the models do not require major changes every increment. 2. Some cost incurred during design and implementation is simply trying to decide exactly what the behaviour of the system could be. Using natural language to specify requirements means that we can be shorthanded and therefore miss or ignore important detail. Using mathematical notation does not afford this because it is easy to show incompleteness, and many of the issues that generally present themselves during design and implementation must be solved during specification instead. In other words, the people doing the specification do some of the work typically undertaken at the design and implementation phases. 3. Despite opinion to the contrary, designing and implementing a system from a formal specification is more straightforward than doing so from natural language requirements. For a novice, this may not be the case, but for someone who is as comfortable with mathematical notation as they are with programming languages, this is most definitely the case. 4. Using formal notation offers the possibility of proving that our specification satisfied certain prop- erties. Discovering in later phases that our specification does not satisfy some properties is more 51 Figure 3.2: The costs of formal specification on the CICS project. costly than discovering these in early phases. 5. Post-release costs are reduced due to the fact that less mistakes are introduced over the process; or more accurately, the same number of mistakes are made, but more are made and found during specification, rather than post release. A simple (and common) objection to formal specification is that it is just too hard: the formality is too difficult for the average programmer to grasp. I agree that this is case. I would not advocate employing formal specification on a project with programmers who have no experience at it. However, I disagree that this has to be the case. The average programmer is already intelligent enough to have learnt a formal language: all programming languages are formal. As such, we know they are equipped to deal with formal languages, even at the specification language. However, an important extension to this argument is the following: formal specification is much simpler than programming — we only have to specify what a system should do, whereas our program must specify how to do it. If programmers were trained in formal specification languages the same way they were trained in pro- 52 gramming languages, reading and using formal specifications would not be an issue. Thus, the cost of formal specification would be significantly less than the cost of implementation – although its value may not be. That is not to say that formal specification should be used in every organisation or on every project: it is only cost effective when the cost of releasing software with faults is greater than the cost of preventing them. In environments where the biggest issue is responding to change (e.g. change of customer prefer- ences, change of requirements from a client), ensuring the software is correct is not as high of a priority as getting something out that works for most normal cases. In high integrity systems, formal specification is generally only applied to those parts of the system that contain high-integrity requirements. Other parts of the system, for example, data storage or user interface, are engineered using more standard techniques. 3.4 Logic and set theory Most formal model-based specification languages are based on first-order predicate logic and set theory; both topics with which all students in this subject are expected to be familiar. The set theory part of Alloy contains the standard set theoretic concepts, such as set comprehension, power sets, set operators (union, intersection, subset, etc.), Cartesian products, relations, and functions. The predicate logic part of Alloy contains propositional operators (“and”, “or”, “implies”, etc.) and existential and universal quantifiers (“for all”, “there exists”). The most important concepts with which you are expected to be familiar are: • First-order logic: conjunction (“and”), disjunction (“or”), implication, negation, universal quan- tification (“for all”), and existential quantification (“there exists”). • Set theory: set membership, set comprehension, set union, set intersection, set difference, and subsets. • Relations: Cartesian product, domain and range of relations, sequences, partial functions, and total functions. If you are not familiar these concepts, or the details are hazy, Chapter 3 of Daniel Jackson’s book Software Abstractions: Logic, Language, and Analysis [1], provides a detail description of these concepts, and how to express them in Alloy. If you cannot access this, Chapters 2-9 in Woodcock and Davies book “Using Z” [4] provides a detailed description of basic logic and set theory, which can also be read lightly to pick up the most important aspects. A PDF version of this book is available on the LMS for download. Why sets and predicates? Most model-based languages are based on sets and predicates. The reasons are quite simple: 1. Set theory (sets, relations, functions, etc.) provides us with enough notation to specify the data of any system. Just as importantly though, it restricts us from specifying design detail, and forces us to think more abstractly — something which is important at the specification level. 53 2. Predicate logic provides us with sufficient expressive power to specify what changes the data of a system can undergo over time (or the effect of operations on data). Just as importantly, it restricts us from specifying how those changes should be implemented. These first two points are important: requirements engineering is already difficult enough without us thinking about design and implementation problems at the same time. 3. Together, proof theory and understanding of set theory and predicate logic is strong enough that we can prove many properties specified using these parts. We can take advantage of hundreds of years of work in these areas and apply this to software engineering — a field that is relatively new in comparison. Example 3.1 As an example of using logic and set theory to specify software and system behaviour, and how this differs from programming, consider the trivial example of sorting a list of integers. To specify the behaviour of a procedure that takes a (possibly unsorted) list and sorts it, we have to say that: 1. each element in the input list is present in the output list (that is, the output list is a permutation of the input list); and 2. each element of the output list is less than or equal to the next (that is, the list is sorted). We can specify what it means for a list of integers to be sorted: ∀ list : seqZ • sorted(list) ⇔ ∀ i : 1 . .#list − 1 • list(i) ≤ list(i + 1) This states that, for all sequences of integers, that sequence is sorted if and only if : for each index (except the last) in the list, the integer at that index is less than or equal to the integer at the next index. We can also specify what it means for one list to be a permutation of the other. ∀ a, b : seqZ • permutation(a, b) ⇔ ∀ el : Z • count(el, a) = count(el, b) in which count is defined as: ∀ el : Z; list : seqZ; c : N • count(el, list) = c ⇔ c = #(list B {el}) That is count(el, list) = c if and only if c is the number of elements when we take the range restriction of list on el only. Finally, we can specify the precondition and postcondition of a procedure using these concepts: procedure sort(in list : in seqZ; out list : out seqZ) is pre #in list > 1 post sorted(out list) ∧ ispermutation(in list, out list) Here, we have added a superficial precondition that the list is not empty. So, this example specifies the functional behaviour of sorting a list, but does NOT specify how the list should be sorted. Any correct algorithm for sorting integers will suffice to fulfil the specification above. 2 54 Example 3.2 Using the definitions above, we can specify the behaviour of a binary search algorithm in a straightfor- ward manner: procedure binary search(list : in seqZ; target : in Z; index : out Z) is pre sorted(list) post target = list(index) ∨ (target = −1⇔ target 6∈ ran(list)) In this example, we have called the procedure “binary search”, but only because we specify in the precondition that the input list must be sorted. Other than this, the algorithm is not constrained by any particular binary search algorithm (e.g. recursive or iterative), and in fact, does not truly constrain it to a binary search algorithm at all. We can see from this example: specifying binary search behaviour is much neater than specifying the algorithm to achieve it. 2 3.5 Introduction to Alloy In this chapter, we will present the Alloy language. We will present only a small subset of the language, but it is enough to allow specification of and reasoning about abstract state machines. The Alloy has a rich notation for combining predicates, which allow flexible and incremental building of specifications. We will not go into detail on all of these operators, but we will use some of the basic components to enable clean specification. An Alloy reference manual is available online at: https://alloytools.org/spec.html It provides detailed syntax and semantic definitions. 3.5.1 Key characteristics There are a few key ideas to Alloy that are worth noting. Alloy = logic + language + analysis The complete Alloy framework consists of three parts: 1. Logic: This consists of first-order predicate logic, plus a simple relational calculus. 2. Language: This consists of a handful of operators for structuring specifications, such as specifying types, predicates, functions, and assertions. 3. Analysis: This performs an exhaustive search (in bounded domains) for counterexamples using SAT (satisifiability check). 55 Non-specialised logic The logic of Alloy is not tied to a particular paradigm, such as concurrent modelling or state machines. However, as of Alloy 6, it has a number of useful features for modelling state machines. In this subject, we will focus primarily on using Alloy as a model-based language for specifying state machines. Not proof, but counterexamples The focus of the Alloy Analyser is not to prove a model has a particular property, but instead is to find counterexamples that demonstrate the property does not hold. Finding such a counterexample demon- strates that the property does not hold, while failing to find one demonstrates that the property may hold. In particular, Alloy encourages the use of small, finite domains on which to do analysis. That is, rather than search for a counterexample in a large or finite set of objects, it encourages (at least during initial analysis) to use small domains of only a few objects; e.g. only three usernames, URLs, and passwords. This approach has two advantages: 1. Using small domains means the analysis is much faster, as there are less objects to check. 2. Most assertions are (initially) wrong, and small counterexamples to them can be found. Having small counterexamples means understanding the counterexample itself is more straightforward. The aim of the Alloy Analyser though is not to search a handful of arbitrary “test” cases in a small domain, but search for all possible cases in that domain. The real domain may be infinite, but a small, finite set representing that domain will find almost all flaws. 3.6 Alloy Logic In this section, we present the logic used in Alloy, which is based on predicate logic and relational calculus. We will use the example of a password management system, such as LastPass, which remembers the username and password combinations for particular websites (represented by their URL). 3.6.1 Everything is a relation Alloy uses relations for all data types, including scalars and sets. Sets are just unary relations; e.g.: Username = {(U0), (U1), (U2)} URL = {(UR0), (UR1), (UR2)} Password = {(P0), (P1), (P2)} scalars are just singleton sets: myUsername = {(U0)} myPassword = {(P1)} and relations are just sets of tuples: 56 passwords = {(U0, P1), (U1, P2)} This above represents a mapping of usernames to passwords, in which username U0 maps to password P1. An alternative could be to have a password for a particular URL: urlPasswords = {(U0, UR0, P1), (U0, UR1, P2), (U1, UR0, P2)} Here, user U0 has two passwords for two different URLs (UR0 and UR1), while user U1 has only one password. An important restriction of relations in Alloy is that all relations are first-order; that is, we cannot have sets of sets, or relations of relations. If using a set of sets is a natural way to represent a problem, we instead need to find another representation. This restrictions are is what permits the Alloy language to be (relatively) easily analysed by automated tools. 3.6.2 Operators Set operators Sets (represented as unary relations) are a commonly-used data type in formal mod- elling. In Alloy, the main set operators are included in the logic: Operator Meaning Example + union {(U0)} + {(U1)} & intersection {(U0)} & {(U1)} - difference {(U0), (U1)} - {(U1)} in subset {(U1)} in {(U0), (U1)} = equality {(U0), (U1)} = {(U0), (U1)} Cross product A cross product, represented using the operator × in set theory, is represented as the following in Alloy: Operator Meaning Example -> cross product Username->Password As an example, consider the cross products taken from URLs, usernames, and passwords: Username = {(U0, U1, U2)} URL = {(UR0, UR1, UR2)} Password = {(P0, P1, P2)} Username->Password = {(U0, P0), (U0, P1), (U0, P2), (U1, P0), (U1, P1), (U1, P2), (U2, P0), (U2, P1), (U2, P2)} Username->URL->Password = {(U0, UR0, P0), (U0, UR0, P1), (U0, UR0, P2), (U0, UR1, P0), (U0, UR1, P1), (U0, UR1, P2), ... (U2, UR2, P0), (U2, UR2, P1), (U2, UR2, P2)} Join operators There are two operators for joining relations in Alloy: 57 Operator Meaning Example . dot join URL.passwords [] box join (Username->Password)[myUsername] These operators are are related via the following properties: a.b = b[a] a.b.c[d] = d.(a.b.c) A dot join is a relational join, which maps the range of the relation on the left to the domain of the relation on the right. For example: A B A.B (a,b) ,, %% (a,d,c) (a,c,c) (a,c) ,, (b,c,c) 22 (a,a,d) (b,d) (c,c,c) 22 (b,a,d) 55 For the relations: myUsername = {(U0)} myUrl = {(UR1)} urlPasswords = {(U0, UR0, P1), (U0, UR1, P2), (U1, UR0, P2)} the following hold: myUsername.urlPasswords = {(UR0, P1), (UR1, P2)} myUsername.urlPasswords[myUrl] = {(P2)} which matches the U0 in myUsername to the same value in urlPasswords. Restriction and override The concepts of domain restriction, range restriction, and function override are represented as follows: Operator Meaning Example <: domain restriction urlPasswords <: myUrl :> range restriction urlPasswords :> Password ++ function override urlPasswords ++ (myUsername->myUrl->newPassword) Function override is defined as: A ++ B = (A - (domain[B] <: A) + B). Informally, domain restriction, A <: B takes the relation A and restricts it to contain only the tuples whose domain is in B, while range restriction, A :> C restricts A to contain only the tuples whose range is in C. For example, if we have: myUsername = {(U0)} myUrl = {(UR1)} someUsernames = {(U0), (U1)} somePasswords = {(P1), (P3)} newPassword = {(P4)} 58 urlPasswords = {(U0, UR0, P1), (U0, UR1, P2), (U1, UR0, P2), (U2, UR3, P3)} then the following hold: urlPasswords <: someUsernames = {(U0, UR0, P1), (U0, UR1, P2), (U1, UR0, P2)} urlPasswords :> somePasswords = {(U0, UR0, P1), (U2, UR3, P3)} For override, we supply at least one new tuple, which is added to the relation, but with the further constraint that it overrides the value of any existing tuple with the same domain; e;g.: urlPasswords ++ (myUserName->myUrl->newPassword) = {(U0, UR0, P1), (U0, UR1, P4), (U1, UR0, P2), (U2, UR3, P3)} Propositional logic operators Alternative Operator Operator Meaning ! not negation && and conjunction || or disjunction => implies implication <=> iff if-and-only-if Predicate logic quantifiers Assuming that x is a variable, e is a set expression, and F is a formula, the following are valid formula: Operator Meaning all x : e | F F holds for every x in e no x : e | F F holds for no x in e some x : e | F F holds for at least one x in e lone x : e | F F holds for at most one x in e one x : e | F F holds for exactly one x in e For example, the following two predicates specify that: (1) there is at least one password for the user corresponding to myUsername; and (2) no username has more than one password: (1) some pwd : Password | (myUsername->pwd) in passwords (2) all user : Username | lone pwd : Password | pwd in user.passwords The last four of these keywords can also be applied to relations: Operator Meaning no e e has no elements/tuples some e e has at least one element/tuple lone e e has at most one element/tuple one e e has exactly one element/tuple For example, we can re-write the two examples above as: 59 (1) some myUserame.passwords (2) all user : Username | lone user.passwords Set declarations In predicates and other constructs that use sets, modifiers are used to restrict the elements in declarations: Operator Meaning x : set e x is a subset of e x : one e x is a singleton subset of e x : lone e x is an empty or singleton subset of e x : some e x is a non-empty subset of e x : e x : one e Set comprehensions New sets can be constructed from expressions: Operator Meaning {x : e, y : f | F} A set of tuples that satisfy formula F We can construct a set of usernames that have the password myPassword: {user : Username | user in password.myPassword} Cardinalities Operator Meaning #e The number of tuples in expression e 0,1,2 Integer literals +, - Addition, subtraction =, <=, <, >=, > Arithmetic operators Note that there is no multiplication or division! Multiplication and division are not included in the language because SAT solvers are typically unable to deal with them in the general case. let formula and expressions A let expression is a shorthand: Operator Meaning let x = e | formula let x = e | expression For example: 60 all user : Username | let p = user.passwords | lone p && some p 3.6.3 Temporal Logic: specifying systems that change over time As of Alloy 6, released in late 2021, Alloy’s logic now includes termporal operators that allow one to talk not only about whether something might be true in a single state but also in future or even past states. These changes add features to Alloy’s logic that do not make it less expressive but do make it easier to specify state machines and, in particular, transitions between states and properties that need to hold at certain times. Next State The ’ operator is used to evaluate an expression in the next state. For instance, we can specify that the value of user.passwords does not change between the current state and the next. user.passwords’ = user.passwords Temporal Predicate Operators Given predicates P and Q, the temporal operators allow us to talk about whether they hold in certain states before or after the current one. Specifically, a series of states of our model is called a trace. Given a trace, suppose the current state is the ith state in the trace. Then we define what it means for each of the temporal operators to hold in state i as follows: Operator Meaning always P P holds in all states ≥ i after P P holds in the i + 1th state eventually P P holds in some jth state where j ≥ i P until Q P holds in some state ≥ i, and Q holds for every kth state such that i ≤ k < j P releases Q Q is true in every state ≥ i up to and including a state k in which P is true, or there is no such k in which case Q holds in any state ≥ i P ; Q P is true in the ith state and Q is true in the i + 1th state For instance we can specify that in the next state user.passwords is empty: after (no user.passwords) 3.7 Alloy Language The Alloy language provides constructs to structuring specifications using the logic presented in the previous section. 3.7.1 Modules Alloy modules are analogous to modules in programming languages, so are used to model self-contained specifications that can be imported into other modules and re-used. 61 3.7.2 Signatures Signatures are essentially type declarations: Operator Meaning sig A {} Declare a set of atoms called A sig B {} Declare a set of atoms called B, which is disjoint from A sig A, B {} As above sig B extends A {} Declate sets A and B where B a subset of A (B in A) abstract sig A {} An abstract set of atoms sig B, C extends A {} B and C are disjoint, and ((B+C) in A) sig A {f : e} f is a binary relation of type f : A -> one e The var keyword can be applied to signatures or to fields to specify things that can change over time. For example we can declare a set A of atoms as a a variable set: var sig A {}. Doing so means that the atoms contained in A in the current state might not be contained in A in future states, and vice-versa, i.e. atoms from A can appear and disappear over time, depending on how we constrain the transitions of our model. The var keyword is more often applied to signature fields, to model fields whose values can change over time. For example, we can construct types for keeping track of passwords at particular URLs: sig URL {} sig Username {} sig Password {} sig PassBook {var password : Username -> URL -> Password} Here, PassBook models a set of “known” username and URL combinations, and the passwords for these. There is, at most, one password per pair (passwords is a function). Note that if A is a non-var set, then we cannot write var sig B extends A {}, i.e. variable signatures cannot extend non-variable ones. However, non-variable sigs can extend variable ones. 3.7.3 Facts Facts are (possibly named) constraints that are assumed to hold in all initial states. They are introduced using the fact keyword. Operator Meaning fact f { F } Declare a fact called f that introduces a formula F. For example, we can express that a PassBook can have at most one password for each user/URL pair: fact NoDuplicates { always all pb : PassBook, user : Username, url : URL | lone pb.password[user][url] 62 }Note that this fact uses the always keyword to say that it holds not only in the initial state, but in all subsequent states too (e.g. after certain operations might be performed on the PassBook). Such facts can be used to model invariants over the data types: constraints that always hold for the data types. 3.7.4 Predicates Predicates are named formula with declared parameters. Predicates can be accessed from within other predicates. Operator Meaning pred p [x : e] { F } Declare a predicate called p, which has variable(s) x, and formula F In this subject, we will primarily be using predicates to introduce operations over a state. That is, a set of signatures will define the state space, and predicates will define the possible transitions over that state space. For example, we can have operations that add and delete passwords to a PassBook: //Add a password for a new user/url pair pred add [pb : PassBook, url : URL, user: Username, pwd: Password] { //no pb.password[user][url] pb.password’ = pb.password + (user->url->pwd) } //Delete an existing password pred delete [pb : PassBook, url : URL, user: Username] { one pb.password[user][url] pb.password’ = pb.password - (user->url->Password) } Notice how we use pb.password’ to refer to the password mapping in the next state after these operations have run. The add predicate specifies an operation over a PassBook pb, and three input variables user, url, and password. The first line of the predicate part states that there is not already a password for this user/URL pair (the set formed by pb.password[user][url] is empty). This is a precondition: the behaviour of the operation is not defined if this doesn’t hold. A precondition specifies an assumption made about the initial state from which an operation occurs. We refer to that state as the pre-state. The second line is the postcondition, which specifies what transition should happen if the precondi- tion holds. In this case, a union (+) is used with the ’ symbol to specify that in the subsequent state pb.password is the same as in the pre-state, but with the new details added. The subsequent state reached after an operation is performed is called the post-state, so pb.password’ refers to the value of pb.password in the post-state. The delete predicate specifies an operation that deletes a password from the PassBook. The precon- dition requires that a password exists, and the postcondition uses set difference to remove the tuple from 63 the relation of passwords. Predicates can be accessed within other predicates. For example, we could break the precondition of the add predicate into its own predicate, and reference this: //True if and only if the user has no password for this URL pred preAdd [pb : PassBook, url : URL, user: Username] { no pb.password[user][url] } //Add a password for a new user/URL pair pred add [pb, url : URL, user: Username, pwd: Password] { preAdd [pb, url, user] pb.password’ = pb.password + (user->url->pwd) } The semantics of the add operation is exactly as if the body of preAdd was included where it was referenced, with the appropriate variable renamings. Note that this is not sequence composition. That is, preAdd is not “executed”, followed by the line after it. The formulae in preAdd and add are evaluated as one larger conjunction. 3.7.5 Functions Functions are named expressions with declared parameters and expressions. They can be ‘invoked’ by predicates or other functions by providing an expression for each parameter. Operator Meaning fun f [x : e1] : e2 { F } Declare a function called f, which has variable(s) x, for- mula F, and is of type e2 For example, we can lookup the password for a given user/URL pair: //Return the password for a given user/URL pair fun lookup [pb: PassBook, url : URL, user : Username] : lone Password { pb.password[user][url] } Note the expression/type given to the function: (lone Password). This function can be used as an expression as part of another formula. For example: //Check if a user’s passwords for two urls are the same pred samePassword [pb : PassBook, url1, url2 : URL, user : Username] { lookup [pb, url1, user] = lookup [pb, url2, user] } 3.7.6 Assertions Assertions are named constraints that are intended to follow from a model. We can express assertions using formula using: 64 Operator Meaning assert a { F } Declare an assertion called a, with the formula F For example, we can express the following assertion which says that the add predicate “works”: //If we add a new password, then we get this password when we look it up assert addWorks { all pb : PassBook, url : URL, user : Username, p : Password | add [pb, url, user, p] => (after (lookup [pb, url, user]) = p) } Notice how we use the after operator to talk about what should be true in the state after add has occurred (i.e. in the post-state). Assertions can be checked by the Alloy Analyser (see Section 3.8). Typically, assertions play two roles: (1) to express some properties that we should hold on the model, which we can then check to see if there are flaws in our model (or flaws in our assertions!); and (2) to express essential properties of the model that are higher than facts or properties; e.g. relationship between predicates; which give us a better understanding of the model itself, and possible ways to refine or restructure it. 3.7.7 The complete LastPass example module LastPass /* * LastPass password map * * A simple example to explain basics of Alloy. * * The ’PassBook’ keeps track of a set of users’ passwords for a set of URLs. * For each User/URL pair, there is one password. Passwords can be added, * deleted, looked up, and changed. * A user can also request all of their password for all URLs. * * author: Tim Miller; updated by Toby Murray for Alloy 6, including simplifications */ sig URL {} sig Username {} sig Password {} sig PassBook {var password : Username -> URL -> Password} fact NoDuplicates { always all pb : PassBook, user : Username, url : URL | lone pb.password[user][url] } //Add a password for a new user/url pair pred add [pb : PassBook, url : URL, user: Username, pwd: Password] { //no pb.password[user][url] pb.password’ = pb.password + (user->url->pwd) } 65 //Delete an existing password pred delete [pb : PassBook, url : URL, user: Username] { one pb.password[user][url] pb.password’ = pb.password - (user->url->Password) } //Update an existing password pred update [pb : PassBook, url : URL, user: Username, pwd: Password] { one pb.password[user][url] pb.password’ = pb.password ++ (user->url->pwd) } //Return the password for a given user/URL pair fun lookup [pb: PassBook, url : URL, user : Username] : lone Password { pb.password[user][url] } //Check if a user’s passwords for two urls are the same pred samePassword [pb : PassBook, url1, url2 : URL, user : Username] { lookup [pb, url1, user] = lookup [pb, url2, user] } //Retrieve all of the passwords and the url they are for, for a given user pred retrieveAll [pb: PassBook, user : Username, pwds : URL -> Password] { pwds = (pb.password)[user] } //Initialise the PassBook to be empty pred init [pb: PassBook] { no pb.password } //If we add a new password, then we get this password when we look it up assert addWorks { all pb : PassBook, url : URL, user : Username, p : Password | add [pb, url, user, p] => (after (lookup [pb, url, user]) = p) } //If we update an existing password, then we get this password when we look it up assert updateWorks { all pb : PassBook, url : URL, user : Username, p, p2 : Password | after (lookup [pb, url, user]) = p => (add [pb, url, user, p2] => (after (lookup [pb, url, user]) = p2)) } //If we add and then delete a password, we are back to ’the start’ assert deleteIsUndo { all pb : PassBook, url : URL, user : Username, pwd : Password | (add [pb, url, user, pwd] ; delete [pb, url, user]) => pb.password = pb.password’’ } run add for 3 but 1 PassBook run add for 5 URL, 5 Username, 10 Password, 1 PassBook check addWorks for 3 but 1 PassBook expect 0 check updateWorks for 3 but 1 PassBook expect 0 check deleteIsUndo for 1 but 1 PassBook 66 3.8 Analysis with the Alloy Analyser The Alloy language supports two types of construct that are related specifically to the Alloy Analyser. That is, they say nothing about the model itself, but are merely there to tell the analyser what to check, and to provide some parameters for this checking. In these notes, we briefly discuss the main concepts in analysing Alloy models, and we will explore this more in the workshops. 3.8.1 Runs The run command instructs the analyser to search for instances of values that satisfy a predicate: Operator Meaning run p scope Find a valid binding of variables for predicate or function p, such that these values satisfy F, using the declared scope. For example, given the operation add defined in the previous section, we can check that the add is satisfiable in a domain of size three using: run add for 3 This instructs the analyser to search for values for the variables, assuming that there are three instances of each signature in the universe; that is, three usernames, three URLs, three passwords, and three pass- books. We can restrict different variables to have different sizes. For instance, add talks only about one Passbook pb, so it makes sense to restrict the size of this set to just 1 when running the predicate. run add for 3 but 1 PassBook As well as bounding the size of signatures, we can also bound the number of transitions that Alloy ex- plores. For instance, add talks only about a pre-state and a post-state, i.e. only one transition. Therefore it makes sense to run it for just a transition between two states: run add for 3 but 1 PassBook, 1..2 steps This specifies that the domain should be of size three, except only one PassBook, and only two states (state 1 and 2), i.e. one transition. If an instance can be found, this means that the predicate/function is indeed satisfiable. If an instance cannot be found, this means predicate/function is possibly unsatisfiable. It may be that the scope we used is simply not large enough to find a combination that satisfies the formula. The idea of scope is discussed in more detail in Section 3.8.3. 3.8.2 Checks The check command instructs the analyser to search for instances of values that violate an assertion — a counterexample to the assertion: 67 Operator Meaning check a scope Find a counterexample to the assertion a, such that these values satisfy not F, using the declared scope. For example, given the assertion addWorks defined in the previous section, we can attempt to find a counterexample to the assertion in a domain of size three using: check addWorks for 3 If a counterexample can be found, this means that the assertion does not hold on the model; i.e. the assertion is incorrect. If a counterexample cannot be found, then the assertion may hold, or there may be a counterexample that exists outside of the specified scope. 3.8.3 Scope The idea of scope is straightforward. The Alloy Analyser does bounded checking, meaning that it can (potentially) try every combination of assignments of instances to variables – although the searching algorithms are much more sophisticated and eliminated many infeasible combinations early on. For models that use the var keyword, the scope also includes the number of transitions in the model to search up to. The scope can be modified to increase or decrease the number of instances that are used in the search. For example, we can use the following: run add run add for 5 run add for 5 but 3 PassBook run add for 5 URL, 5 Username, 10 Password, 2 PassBook run add for 5 but 1..2 steps The first line omits a scope, in which case the default scope of three is used. When the scope doesn’t include a bound on the steps the default is to check for 1..10 steps. The second command specifies that five instances of all signatures should be used, while the third says the same, except only create three signatures for PassBook. The fourth one explicitly lists the number of instances for each signature type. The final one also specified a bound on the number of steps (transitions to search). Perhaps confusingly, 1 steps refers to 0 transitions, i.e. to traces that contain a single state. In general k steps means traces containing k states and k-1 transitions. Therefore “1..2 steps” means traces with a single transition. One can of course search for instances of add within longer traces but doing so doesn’t make much sense for this model, since add only talks about two states (one transition). There are several other ways to restrict the scope used in analysis. See Jackson’s book on Alloy [1] for information on these. The small scope hypothesis By finding instances and counterexamples to predicates, functions, and assertions, we are essentially just doing exhaustive testing using a bounded set of instances. As such, we cannot prove that an assertion 68 holds for all instances in the universe. This is essentially like testing, but in actually, it is much more powerful. The Alloy philosophy relies on the small scope hypothesis, which is: Most faults have small counterexamples. That is, if an assertion in incorrect, it often, but not always, has a counterexample that exists in a small scope of around 3-5 instances per signature and a handful of transitions (around 10). The Alloy Analyser exhaustively analyses all cases within the declared scope, therefore, if a counterexample exists within that scope, it will be found. This is in contrast to testing, which will select a small subset of the scope as test inputs. Using larger scopes increases the chance of finding a counterexample of course, but also takes more CPU time and memory to analyse. A good strategy is to start with a small scope of three and a few transitions, which is usually sufficient to find flaws in our models early on, eliminate these flaws, and once all of our runs and checks pass, extend the scope to be much larger, and leave the analyser to search for a period while we go for a coffee. 3.9 Abstraction and incremental specification of state machines using Al- loy operators In the previous section, we saw how to create operation predicates from smaller operation predicates; e.g. by specifying the precondition of the add operation in its own predicate preAdd and referencing this. Using predicates as building blocks in this way is an excellent abstraction mechanism that can be used to our advantage in specifying systems. The formal specification process is not as simple as sitting down and writing it. As with programming, specifications should be put together incrementally. Abstraction mechanisms can be used to perform this top down, allowing us to specify greater detail once we have a better understanding of the problem at hand. 3.9.1 The specification process Figure 3.3 outlines a process model for defining a formal specification. Clearly, this model is idealised, and at any step, there is a likelihood of returning to any of the earlier steps; however, it serves well as a model. The problem is tackled top down. Given a set of informal requirements, we define the system components and specify them. After this, we bring all components together. To specify a single component in isolation, we first consider the data types, and then the state, init operation, and normal operations. This is a specification of how the components works in an ideal world. After this, we add the exceptional behaviour. The total operations are then composed from the normal and exceptional behaviour operations. Finally, the component specifications are composed into a system specification. State machines In these notes, we focus on the use of Alloy to specify and analyse abstract state machines: that is, an abstract model containing a state (e.g. PassBook) and transitions between these states. The LastPass 69 Figure 3.3: A process to develop a model-based specification of a system. password storage system defined in this chapter is one such example. To do this, we used the abstract state machine pattern. Using this pattern, we use predicates to specify operations on a global state. An abstract state machine in Alloy will have a structure something similar to: sig State { ... } pred init [s: State] { ... } pred inv [s: State] { ... } pred op1 [s: State] { ... } ... pred opN [s: State] { ... } in which init is the initialisation operation, inv is the state invariant, and Op1 ... OpN is a set of N operations, such as add and delete in the password storage system. A state invariant is a formula that all operations in the specification must preserve. Operation preconditions We have already seen an example of splitting a precondition from its postcondition in an operation. How- ever, an alternative way, which is preferred by many modellers, is to specify the precondition and post- condition together as separate predicates, and then bring them together in a single operation predicate. For example, we can declare operations addPre and addPost for the precondition and postcondition respectively, and then conjoin them in addOk: //True if and only if the user has no password for this URL pred preAdd [pb: PassBook, url : URL, user: Username] { no pb.password[user][url] } //Add a password for a user/url pair pred postAdd [pb : PassBook, url : URL, user: Username, pwd: Password] { pb.password’ = pb.password + (user->url->pwd) } //Add a password for a *new* user/url pair pred addOk [pb : PassBook, url : URL, user: Username, pwd: Password] { preAdd [pb, url, user] postAdd [pb, url, user, pwd] } 70 This has the effect of being more verbose, but providing a clear separation between the preconditions and the postconditions that specify the transitions, which many people find preferable. Exceptional cases An important step in Figure 3.3 is the separation of “normal” cases from “exceptional” cases. The idea is to specify normal cases for a model first, which are usually the most difficult parts, and then to enumerate all the relatively-easier exceptional cases after this, many of which just “fall into place”. To achieve this, we specify the normal and exceptional cases in separate predicates, and then combine them using disjunction in one operation predicate: //Add a password for a *new* user/url pair pred addOk [pb : PassBook, url : URL, user: Username, pwd: Password, report : one Report] { preAdd [pb, url, user] postAdd [pb, url, user, pwd] report in Success } //Fail to add a password that already exists pred addDuplicate [pb : PassBook, url : URL, user: Username, report : one Report] { not preAdd [pb, url, user] pb.password’ = pb.password report in Failed } //Add a password for a *new* user/url, otherwise, add nothing pred add [pb : PassBook, url : URL, user: Username, pwd: Password, report : one Report] { addOk [pb, url, user, pwd, report] or addDuplicate [pb, url, user, report] } So, the add predicate specifies an operation in which we attempt to add a duplicate entry, in which case nothing changes, or the entry is not a duplicate, in which case we add the new entry. Note the introduction of the variable report : Report, which is used as an output variable to specify whether the operation was successful or not. The Report signature is declared as: //Report signatures for informing whether an opertion was successful or not abstract sig Report {} sig Failed extends Report {} sig Success extends Report {} Thus, report must take on a value from either Success or Failed, which are disjoint sets. This is useful in assertions where we can specify a case that we expect to e.g. fail, and check that it failed by simply checking that report in Failed. 71 Invariant preservation When using the abstract state machine pattern, Alloy assertions provide the opportunity to check that the invariant is preserved by an operation, using the following template: assert initEstablishesInv { all s : State | init [s] => inv [s] } check initEstablishesInv //for each operation assert opNPreservesInv { all s : State, x : E, ... | inv [s] and opN [s, x, ... ] => after (inv [s]) } check opNPreservesInv This says that the initial state satisfies the state invariant, and for each operation, if the state invariant holds on an arbitrary pre-state s, and we execute the operation, the state invariant holds on the post-state reached after the operation finishes. As an example, consider this for the init and add operations: assert initEstablishes { all pb : PassBook | init [pb] => inv [pb] } assert addPreservesInv { all pb : PassBook, user : Username, url : URL, pwd : Password, report : Report | inv [pb] and add [pb, url, user, pwd, report] => after (inv [pb]) Such checks can be useful not only for checking state invariants, which may satisfy safety properties, but also to check sanity of operations. For example, checking this assertion using the Alloy Analyser on the delete operation actually re- vealed a fault that had not been detected before, which was subtle, but which modelled the semantics completely wrong. The line that removed the tuple from the pb.passbook relation was: pb.password’ = pb.password’ - (user->url->Password) Can you see what the fault was? This predicate states that pb.password is equal in the post-state to itself with the tuple subtracted. The second occurrence of pb.password’ should have been pb.password. This major semantic prob- lem had not been discovered using check assertions deleteIsUndo and deleteDuplicateDoesNothing, due to a strange coincidence in how the operation was modelled and how the way these assertions were written. Bounds When checking that the invariant holds initially, of course only one State is needed. Since no operation occurs, a steps bound of 1..1 is sufficient. 72 When checking that each operation preserves the invariant, likewise only one State is needed, and a steps bound of 1..2 is sufficient (assuming operations only cause a single state transition to occur). Summary So, by using the abstract state machine pattern, and structuring the model in a way that separates normal from exceptional behaviour, we can incrementally build up a model using a process such as the one in Figure 3.3. Further, the abstract state machine pattern then lends itself naturally to some assertions such as checking whether invariants hold. 3.10 Trace-Based Modelling in Alloy Alloy 6’s features for modelling variable signatures and fields, and its temporal logic, are ideal for trace- based models, i.e. ones that talk not just about single transitions (involving at most two states) but models that talk about sequences of such transitions. In that case one can specify temporal properties that need to hold over all such sequences. We call those sequences traces. To construct such a model we use fact declarations to constrain the set of traces that Alloy considers. Specifically, we need to specify the allowed initial states, from which traces can begin, and the allowed transitions that can occur between states. Specifying the initial state Recall that we use a predicate init to specify the initial state. Recall also that a fact declaration specifies what needs to be true of the initial state in a trace. Therefore we can specify that the initial states in the trace must satisfy init as follows: fact { all s : State | init[s] } Note that we do not use the always keyword here: doing so would specify that not only the inital state satisfies init but all subsequent states in the trace do too! Specifying the allowed transitions To specify the allowed transitions between states we use a second fact declaration. Assume we have predicates op1. . .opN to specify the operations (transitions) in our system. Also assume we have a predicate unchanged[s] which holds for a State s when the post- state is equal to the pre-state, and so specifies what it means for a state to be unmodified by a transition. Then we can specify that the only allowed transitions are those given by op1. . .opN or unchanged as follows: always { all s : State | op1[s] or op2[s] or ... or opN[s] or unchanged[s] } This fact says that in all states of the trace (not just the initial ones) one of the operation predicates must hold, i.e. there is a transition to a subsequent state given by one of the system’s operations or the state remains unchanged (the subsequent state is equal to the current one). 73 The inclusion of the unchanged predicate is important for technical reasons to do with how traces are represented in Alloy. In particular, Alloy traces represent infinite sequences of transitions as lasso traces: traces where the final state then transitions to some state already in the trace (possibly the final one). Without including the unchanged[s] possibility above, we rule out many such traces that Alloy should consider when analysing our model. Including it means Alloy will consider at least all traces of finite sequences of operations, each drawn from op1. . .opN above, up to the steps bound, since the final state is always free to loop back to itself via the unchanged transition. We will see examples of this kind of model in lectures, including use of the temporal operators to search for interesting behaviours of it. 3.11 Verification and validation As with any software engineering artefact, verification and validation (V&V) are important. There are several techniques for V&V of formal specifications: 1. Animation: Animation is the process of executing a specification with the intention of finding problems. The formal, unambiguous nature of specifications means that execution is often possi- ble. However, because Alloy (and many other formal specification languages) is built on first-order logic, not all specifications are executable. 2. Model checking: This is the process of attempting to execute every operation in every possible state of a system to show that certain properties hold; or more accurately, the find counterexamples of the properties. Model checking for model-based languages is difficult because the sheer number of states may be exponentially large, if not infinite. There is a lot of research around the world looking at how to reduce the number of states for model checking while maintaining its soundness. If you have taken the subject Modelling Complex Software Systems, you would used model check- ing to detect deadlock and to prove temporal logic properties of concurrency models. 3. Review: The specification can be subject to peer review, like any other document. This remains one of the most important and useful style of review for formal specification, despite advanced tools that can reason about them automatically. 4. Proof : Automated proof is the holy grail of formal specification. Proof is the most valuable of all techniques, because, unlike animation, model checking, reviews, and testing, if a property about a model can be proved, then we can say with strong conviction that it is true (although in reality, there is always the possibility that the proof itself is wrong!). In this section, we discuss proof in more detail. This is not a major part of the subject, however, proof is an important motivation as to why practitioners adopt formal methods in software and systems engineering. Woodcock and Davies [4] present a far more thorough and valuable use of proof in specification. In this section, we discuss the implications of proof and present a trivial example. The importance of proof If we have a specification of a system, then by reasoning about it using proof, we are likely to detect problems. If we have not started implementing the system, this provides us with an opportunity to find 74 those problems before they become costly to fix. Further to this, the very process of constructing a proof can assist us in finding new requirements. That is, if we notice that we cannot prove a property about a system because some exceptional state has not been considered, we may need to add a requirement for this exceptional state. As Woodcock and Davies state: “The practice of proof makes for better specifications”. There is a commonly-held opinion that proof is not possible on large-scale systems. Industrial application of formal specification and proof shows this opinion to be wrong. Proof can be applied to real systems running in real environments, and can provide cost savings too. There are situations where proof is not relevant or necessary, but to high integrity systems especially, there are situations where it is desirable and almost necessary. There may be some situations where a sketch of a proof is sufficient. The trick in apply proof is to know which proofs are worth attempting, which are worth sketching, and which are worth leaving. It is worth referencing the following quote: “Things like even software verification, this has been the Holy Grail of computer science for many decades but now in some very key areas, for example, driver verification we’re building tools that can do actual proof about the software and how it works in order to guarantee the reliability.” — Bill Gates, Chairman of Microsoft, speaking at WinHEC 2002 in Seattle, Washington, USA. What can be proved? Proof is used in some high integrity systems for cases in which the outcome of failure is catastrophic; e.g. resulting in death or complete mission failure. The types of properties that can be proved vary, however, some interesting properties that have been proved about formal models include: • Proofs that certain security protocols cannot be cracked. More importantly, there have been cases of security protocols where proofs have demonstrated that they can be cracked. The Needham-Schroeder Public-Key Protocol is a protocol intended to provide mutual authen- tication between two parties communicating over a network, proposed by Roger Needham and Michael Schroeder in 1978. It was considered to be a failproof way to establish identity between two communicators, however, it was not until Gavin Lowe applied formal proof to the protocol in 1995 that it was discovered that the protocol failed to prevent “man-in-the-middle” attacks. • Proofs about safety properties. Many safety-critical systems need to ensure certain safety prop- erties hold. For example, railway signalling systems need to ensure that, given an intersection of tracks, at no stage will overlapping tracks both have non-red signals. That is, only one train will be allowed on the intersection at any time. These properties are specified a temporal logic properties, and it can be shown that a formal specification preserves these properties. • In many life-critical systems, formal proof is applied to minimise the risk of causing harm or death. In many systems employed on NASA space missions, proof is applied to demonstrate that certain cases cannot go wrong. In particular, NASA has been known to reject entire proposals for systems on aerospace systems or life-support systems for astronauts unless model checking or proof have been used to demonstrate mission-critical properties (mainly around safety). 75 What can’t proof do? It is important not to say that if we “prove” our specification, then it is correct. When we do proofs about a formal specification, we are only proving that certain properties hold. We can say with some conviction that, if our proof can be discharged, then the property holds. However, this is only valuable if the property is correct. Further, we can do many proofs on a system, but omit an important one. Thus, a set of proofs is only as strong as the set of properties they show. In most cases, proofs are not automated. That is, we cannot provide a specification and a property, and then have it proved. There are a large set of proofs for which this is possible, but there is another set for which it is currently not. The holy grail of theorem proving is exactly to prove things automatically, but for now, it remains a pipedream. For many logics, it has been proved that not all provable properties of the system can be automatically proved. An example As a small example of how proof words, we’ll use the chemical storage system outlined in Chapter 2. Example 3.3 In the chemical storage system, the storage tank has a maximum limit, and if that limit is breached, serious safety consequences may result. However, the overfilling problem is resolved using careful specification, and we can prove that the storage tank will not overflow, providing our specification is implemented correctly. If we consider a Fill operation for filling the take, it will consist of a precondition checking that the amount going into the tank will not overflow, and a postcondition calculating the new amount. The safety property we want to check is that the new amount does not overlow the tank: • Precondition: contents + amount? ≤ capacity • Postcondition: contents′ = amount? + contents • Safety property: contents′ ≤ capacity (which is also the state invariant). We can then structure our proof as follows: Precondition⇒ (Postcondition⇒ Safety property) That is, if the precondition is true, then the postcondition implies that the safety property holds. For the Fill operation, this is straightforward to discharge. We have to prove: contents + amount? ≤ capacity ⇒ (contents′ = contents + amount?⇒ contents′ ≤ capacity) Using the one-point rule, we can replace all instances of contents’ with contents + amount?: contents + amount? ≤ capacity⇒ (contents + amount? = contents + amount?⇒ contents + amount? ≤ capacity) The middle proposition contents + amount? = contents + amount? is trivially true, so can be removed from the entire proposition, leaving: contents + amount? ≤ capacity⇒ contents + amount? ≤ capacity, 76 which is itself trivially true. So, we have proved that, for every possible state of the system, if the precondition holds, the storage container will not be filled beyond capacity. This is much stronger than throwing a million tests at it to see if we can find an input/state pair that violates it. 2 The above example is trivial in comparison to the types of proofs that are discharged on industrial scale systems, however, it illustrates how proof can be applied to a model-based specification. Despite the small size, the proof required a few steps are several lines of text. However, significant tool support is available for theorem proving. Using automated and interactive theorem provers, many large proofs can be discharged in a fairly straightforward manner, and the documentation of the proof can be generated automatically and stored digitally. Bibliography [1] Daniel Jackson. Software Abstractions: logic, language, and analysis. MIT press, 2012. [2] Mark Phillips. CICS/ESA 3.1 experiences. In Z User Workshop, Oxford, pages 179–185, 1989. [3] Ian Sommerville. Software Engineering. Pearson Higher Education, USA, 9th edition, 2010. [4] Jim Woodcock and Jim Davies. Using Z: specification, refinement, and proof, volume 1. Prentice Hall, 1996. 77 Chapter 4 Introduction to Ada In this chapter, the programming language called Ada is introduced. We will use Ada in this subject for two reasons. First, due to its features, it is heavily used in high integrity domains, especially safety- critical systems. Second, later in the course, we will look at a language called SPARK, which is a safe programming subset of Ada, and has tool support for high integrity systems. 4.1 The history of Ada Ada is a structured, imperative programming language with static typing, object-orientation, and concur- rency built in. It was designed by a company called CII Honeywell Bull, who were contracted by the United States Department of Defence (DoD) in 1977 to provide a new language to replace the hundreds of programming languages they were using at the time. The DoD were concerned that many of the lan- guages were specific to particular hardware, and thus not reusable, and were becoming obsolete. Further, many of the languages did safe, modular programming, which had emerged as the best way to handle large-scale, complex software systems. At the time, four contractors were asked to provide proposals for a new language, having been given a set of criteria. The CII Honeywell Bull group’s proposal was chosen as the successful proposal, and the Ada language was then designed. Initially intended to be for embedded and real-time systems, Ada has become highly used in high integrity systems in general, especially in the safety-critical domain. Further, it was used as a general purpose programming language in the 1980s, however, due to its property of being designed to be read rather than written, many people do not consider it suitable for the more “enterprise-y” applications, where time-to-market is considered more important than software integrity. The language is named after Augusta Ada King, Countess of Lovelace (more generally known as Ada Lovelace, who is credited as being the first even computer programmer. In 1843, when translating notes a seminar by Charles Babbage on his analytical engine (the first programmable machine) from French to English, Ada Lovelace included an addendum showing how an algorithm for calculating Bernoulli numbers using the analytical engine, which is considered the first program ever written. However, there is significant controversy as to her actual intellectual input into the algorithm, but much of this is incon- sistent with Babbage’s own recollection. 78 4.2 Ada for programming “in the large” A major criteria given by the DoD for a new programming language is that it must provide strong support for good software engineering practices that scale well to very large software systems ( 106 lines of code, and large development teams). There are several aspects of Ada that contribute to this: • A strong, static and safe type system. In particular, all type errors can be detected automatically at compile time. • Modularity. Ada modules (called “packages”) can be compiled separately from all other packages. Interfaces can also be compiled without the need for an implementation, meaning that some design errors can be detected before implementation begins. • Information hiding. In providing interfaces/packages, Ada separates interfaces from implementa- tion, and provides fine-grained control over visibility. • Readability. This is an important aspect. Ada favours the reader of the program over the writer. That is, the aim is to make Ada programs easier to read (compared to other languages), possibly at the expense of making them slightly harder to write. This is because a program is written once but read many times during maintenance. • Portability. Ada does not include any platform-specific operators or data types, and can thus be compiled on many platforms. • Standardisation. Ada is an ANSI and ISO standard. It is these properties, plus others, that makes Ada a great fit for high integrity systems, and is why it is still used in so many high integrity domains today. The language itself continues to evolve, and a new ISO/IEC standard for Ada was published in December 2012. 4.3 Hello World A great way to introduce a programming language to people with programming experience is to start programs. So, let’s start with a very traditional program to introduce Ada: Listing 4.1: hello world.adb 1 with Ada.Text_IO; 2 3 procedure Hello_World is 4 begin 5 Ada.Text_IO.Put_Line("Hello, world!"); 6 end Hello_World; Note the following about this program: • The with statement adds the Ada.Text IO package to the program. This is basically a library import, similar to the “import” functionality in Java, or the #include keyword in C. 79 • Block structure is given by begin. . . end pairs. Thus, a procedure begins and ends with these keywords. • An Ada main procedure does not need to be called “main”, as in some other languages. Any simple name is fine so we have used Hello World. • The name-space is not flat. Therefore, to access the functions declared in the Ada.Text IO package, we need to prefix them with the package name. Package renaming To avoid having to prefix functions with this long name each time, we can use renaming: Listing 4.2: hello world renaming.adb 1 with Ada.Text_IO; 2 3 procedure Hello_World_Renaming is 4 package IO renames Ada.Text_IO; 5 begin 6 IO.Put_Line("Hello, world!"); 7 IO.New_Line; 8 IO.Put_Line("...using package renaming!"); 9 end Hello_World_Renaming; Compiling Hello World Ada programs are divided into compilation units. An Ada specification (or interface) is usually given the extension .ads (ADa Specification) and the body of the program is given the extension .adb (ADa Body). The file name must be the same as the compilation unit name and each compilation unit must hold the name of the procedure to be compiled1. For example, the hello world program must be placed in the file hello world.adb. All compilation units must be compiled, bound and linked. The notes for this subject will use the GNAT Ada compiler, which is downloadable from http:// libre.adacore.com/download/ for Linux and Windows (native, JVM, and .NET)... and Mind- storm! The notes and workshops assume GNAT on Linux. Students are free to use the Windows com- piler, but note that the staff cannot give much useful support on using these tools on Windows. The GNAT Ada compiler can be used on the command line. To compile the hello world program, use the following commands: 1. gcc -c hello world.adb generates the files hello world.ali – the object code – and hello world.o – static analysis information. 2. gnatbind hello world uses hello world.ali to bind the files. 3. gnatlink hello world uses hello world.ali to link the files. 1Although some Ada compilers are forgiving and just produce a warning in this case 80 Alternatively, and more conveniently, use the following: 1. gnatmake hello world.adb is a single command that performs all three of the above steps. The result is an executable program called hello world. You can use the command gnatclean -c hello world to delete any compiler-generated files. 4.4 Specifications and program bodies Ada programs consist of packages, procedures, and functions. Packages are synonymous with modules, functions are defined as in other languages, and procedures are functions with no return value. More on procedures and functions later. Specifications show the types, functions, procedures and tasks that are publicly available for an Ada program. As an example, we can make a simple Greetings package with the interface below, defined in the file greetings.ads: Listing 4.3: greetings.ads 1 package Greetings is 2 procedure Hello; 3 procedure Goodbye; 4 function Talk(Text : in String) return Integer; 5 end Greetings; This declares two procedures, Hello and Goodbye, but does not define their implementation. The package body defines the implementation, possibly including hidden data types (we’ll see an example of this later), defined in the file greetings.adb: Listing 4.4: greetings.adb 1 with Ada.Text_IO; use Ada.Text_IO; 2 3 package body Greetings is 4 procedure Hello is 5 begin 6 Put_Line("Hello, world!"); 7 end Hello; 8 9 procedure Goodbye is 10 begin 11 Put_Line("Goodbye, world!"); 12 end Goodbye; 13 14 function Talk(Text : in String) return Integer is 15 begin 16 Put_Line(Text); 17 return 1; 18 end Talk; 19 end Greetings; 81 The names, parameter types, and return types must match between the specification and the body. Because the two procedures are contained in a package, neither of them are directly executable. Instead, they must be executed from another procedure. Below is a code listing that calls the two procedures in succession: Listing 4.5: greetings main.adb 1 with Greetings; use Greetings; 2 3 procedure Greetings_Main is 4 I : Integer; 5 begin 6 Hello; 7 I := Talk(Text => "What’s happening?"); 8 GoodBye; 9 end Greetings_Main; Note in this example a new keyword: use. The statement use Ada.Text IO makes all of the public entities accessible without qualification in the body of the package, meaning that we no longer need to prefix Put Line with the package name. However, use this with caution: two packages can have the public entities with the same name, so importing both results in an ambiguity. Exercise 4.1 Create two packages that contain a procedure of the same name. Import both of these using the with and use keywords, into a new program that simply calls the ambiguously declared procedure. Try to compile and run this program to see what happens. 2 Following the one-unit-per-file rule the program must be placed in three separate files: 1. greetings.ads: contains the specification of the package. 2. greetings.adb: contains the implementation of the package. 3. greetings main.adb: contains the main program that makes use of the Greetings pack- age. To compile this program, we can compile the package and the program, and then bind them, as follows: 1. gcc -c gmain.adb 2. gcc -c greetings.adb 3. gnatbind greetings main 4. gnatlink greetings main Alternatively, issue the single command: 1. gnatmake greetings main The gnatmake command will locate and compile any packages used in the main program, provided that it can find them on the path. To avoid having to set up paths, we’ll only deal with programs in a single directory. 82 4.5 Comments Comments can be added to programs using the prefix --: Listing 4.6: Comments 1 -- this part ignored 2 Put_Line("Hello, world!"); -- this part too! 4.6 Types The Ada type system is a major part of its success. It is designed for programmers to create abstractions that model real world objects. As such, the type system is not based on a collection of pre-defined types, but instead, allows designers to create new types. Of particular note in the type system are the following properties: • Strong typing: types are incompatible with one another. • Static typing: type checked while compiling. • Abstraction: types represent the real world data or the problem at hand and computer representa- tions of the data. • Name equivalence: not structural equivalence as used in most other languages. Two types are compatible if and only if they have the same name; not if they just happen to have the same size or bit representation. 4.6.1 The predefined Ada type system Despite this, there are still a handful of predefined types in Ada. Predefined types are used as interfaces between different packages or libraries. The predefined library uses these types too. The following predefined types are declared in the standard package: • Integer: The standard integer type but must cover at least integers in the range−215+1 . . . 215−1. • Float: The type of floating point numbers with mantissa and exponent but the implementation does not follow the IEEE standard. • Duration: A fixed point type used for timing and representing a period of time in seconds. • Character: A special form of enumeration type. There are three predefined kinds of character types: 8-bit characters, 16-bit characters, and 32-bit characters. • String: Three indefinite array types, of 8-bit characters, 16-bit characters, and 32-bit characters respectively. • Boolean: A Boolean in Ada is an enumeration of False and True with special semantics. 83 The packages System and System.Storage Elements predefine some types that are primarily useful for low-level programming and interfacing to hardware: • System.Address: An address in memory. • System.Storage Elements.Storage Offset: An offset, which can be added to an address, to ob- tain a new address. You can also subtract one address from another to get the offset between them. Together, Address and Storage Offset and their operators provide address arithmetic. • System.Storage Elements.Storage Count: A subtype of Storage Offset which cannot be neg- ative, and represents the memory size of a data structure (similar to C’s size t). • System.Storage Elements.Storage Element: In most computers, this is a byte. Formally, it is the smallest unit of memory that has an address. • System.Storage Elements.Storage Array: An array of Storage Elements without any mean- ing, useful when doing raw memory access. Figure 4.1 shows the Ada type hierarchy. A type inherits properties from types above it in the hierarchy. Ada Types Elementary Types Composite Types Access Types Scalar Types Array Types Record Types Protected Types Task Types Discrete Types Real Types Signed Types Modular Types Enumeration Types Integer Types Decimal Types Ordinary Types Float Types Fixed Types Figure 4.1: The Ada Type hierarchy. Converting to/from strings All scalar types (refer to Figure 4.1) have two special attributes called ’Value and ’Image, which are used for converting to strings to values, and back. Both are functions that are called on the type, not the individual values (the Java metaphor is a static method, as opposed to an instance method). The following code listing demonstrates how to convert a string to an integer, and back again: Listing 4.7: value image attributes.adb 1 with Ada.Text_IO; use Ada.Text_IO; 2 84 3 procedure Value_Image_Attributes is 4 An_Integer : Integer; 5 A_String : String := "5"; 6 begin 7 An_Integer := Integer’Value(A_String); 8 Put_Line("Hello, world!" & Integer’Image(An_Integer)); 9 end Value_Image_Attributes; Exercise 4.2 Write a program that converts a string containing a non-number to an integer. Compile and run this program to see what happens. 2 4.6.2 Arrays Arrays in Ada are somewhat different to many other high-level programming languages – and in fact, are more powerful as well. One key difference is in the indices of the array. In many languages, the indices of arrays are integers, starting at either 0 or 1. However, in Ada, array indices can be of any discrete type, such as characters, enumerations, etc. As an example, consider the following program, which declares to arrays of type Character, but each with a different type of index: in one case, an range of integers; and in the other, a range of characters. Elements in the array are accessed using integers for the array Int Index, but characters for the array Ch Index. Listing 4.8: array examples.adb 1 with Ada.Text_IO; use Ada.Text_IO; 2 with Ada.Integer_Text_IO; use Ada.Integer_Text_IO; 3 4 procedure Array_Examples is 5 Int_Index : array(Integer range 5..10) of Character := 6 (’a’, ’b’, ’c’, ’d’, ’e’, ’f’); 7 Ch_Index : array(Character) of Character; 8 begin 9 Put(Int_Index(5)); -- accessing using an integer index 10 New_Line; 11 12 Ch_Index(’a’) := ’z’; -- setting using a char index 13 Put(Ch_Index(’a’)); -- accessing using a char index 14 New_Line; 15 16 -- setting and getting an array ’slice’ 17 Int_Index(6 .. 8) := (’X’, ’Y’, ’Z’); 18 19 -- array attributes "’First" and "’Last" 20 Put(Ch_Index’First); 21 New_Line; 22 Put(Int_Index’Last); 23 New_Line; 24 end Array_Examples; Arrays have four special attributes: ’First, ’Last, ’Length, and ’Range. The first two of these do not return the first and last elements of the array, but the first and last elements of the type of the array. 85 Thus, in the above example, Int Index’Last will return 10, and Ch Index’First will return the final ASCII character (null). The ’Length attribute returns the length of the array type (and of the array – the entire array is allocated in Ada), while the ’Range attribute returns the range of indices; e.g. Int Index’Range will return 5..10. Array ’First ’Last ’Length ’Range Int Index 5 10 6 5..10 Ch Index 〈null〉 y¨ 255 〈null〉 .. y¨ The pre-defined type String is a character array. However, the indices need not start at 0 or 1. Consider the following example, taken from the Ada Wikibook2: Listing 4.9: Attributes of strings 1 Hello_World : constant String := "Hello World!"; 2 World : constant String := Hello_World (7 .. 11); 3 Empty_String : constant String := ""; In this example, the attribute values are: Array ’First ’Last ’Length ’Range Hello World 1 12 12 1 .. 12 World 7 11 5 7 .. 11 Empty 1 0 0 1..0 Thus, strings can be indexed from any range of integers, and it is not necessarily that case that ’Length = ’Last. 4.6.3 Defining new types A new type is introduced with the syntax: type T is . . . Consider the following example of defining a type for dates. Listing 4.10: A data type for dates 1 type Day_type is range 1 .. 31; 2 type Month_type is range 1 .. 12; 3 type Year_type is range 1800 .. 2100; 4 type Hours is mod 24; 5 type DayOfWeek is (Monday, Tuesday, Wednesday, Thursday, 6 Friday, Saturday, Sunday); 7 8 type Date is 9 record 10 Day : Day_type; 11 Month : Month_type; 12 Year : Year_type; 2See http://en.wikibooks.org/wiki/Ada_Programming/Types/array 86 13 end record; A record is a composite type that groups one of more data type together into a single data type, similar to a struct in C. All types are incompatible with each other. Therefore, the following code snippet it illegal, because Day type and Month type are different and incompatible types, even though the value 8 can be assigned to either: Listing 4.11: 1 A : Day_type := 8; 2 B : Month_type := A; -- illegal! 4.7 Control Structures As a structured language, the control flow of the program is structured into statements. The following control structures are the most pertinent in the Ada language. Assignment Assignment in Ada is written as X := E where X is a variable and E is an expression. Assignment is read as “X becomes equal to E”. Conditionals if-then-else and case: Listing 4.12: A conditional 1 if x < y then 2 Put_Line ("True"); 3 else 4 Put_Line ("False"); 5 end if; Listing 4.13: A case statement 1 case x is 2 when 1 => Put_Line ("First"); 3 when 2 => Put_Line ("Second"); 4 when 3 => Put_Line ("Third"); 5 when others => Put_Line ("None!"); 6 end case; Unconditionals return and goto: Listing 4.14: Returning a value 1 function Use_Return return Integer is 2 X : Integer := 5; 3 Y : Integer := 10; 4 begin 5 return X + Y; 6 end Use_Return; 87 Listing 4.15: Using a goto 1 function Use_Goto return Integer is 2 X : Integer := 5; 3 Y : Integer := 10; 4 Z : Integer; 5 begin 6 Z := X + Y; 7 8 goto Label; 9 10 Z := 0; 11 12 <
> 13 14 return Z; 15 end Use_Goto; Loops There are a number of different kinds of loops in Ada: • Endless loops: Listing 4.16: endless loop.adb 1 with Ada.Text_IO; use Ada.Text_IO; 2 3 procedure Endless_Loop is 4 begin 5 loop 6 Put_Line("Hello again, world!"); 7 end loop; 8 end Endless_Loop; • While loops, which specify the termination condition at the start of the loop: Listing 4.17: while loop.adb 1 with Ada.Text_IO; use Ada.Text_IO; 2 3 procedure While_Loop is 4 X : Integer := 0; 5 begin 6 while X < 10 loop 7 Put_Line("Hello again, world!"); 8 X := X + 1; 9 end loop; 10 end While_Loop; • Exit loops, which specify the termination condition in the middle of the loop: Listing 4.18: exit loop.adb 1 with Ada.Text_IO; use Ada.Text_IO; 2 3 procedure Exit_Loop is 4 X : Integer := 0; 5 Y : Integer := 0; 88 6 begin 7 loop 8 Put_Line("Hello again, world!"); 9 X := X + 1; 10 exit when X >= 10; 11 Y := Y + 2; 12 end loop; 13 end Exit_Loop; • Until loops, which specify the termination condition at the end of the loop (which is just an exit loop with the exit condition at the end of the loop): Listing 4.19: until loop.adb 1 with Ada.Text_IO; use Ada.Text_IO; 2 3 procedure Until_Loop is 4 X : Integer := 0; 5 begin 6 loop 7 Put_Line("Hello again, world!"); 8 X := X + 1; 9 exit when X >= 10; 10 end loop; 11 end Until_Loop; • For loops, which allow iteration of all values in a range: Listing 4.20: for loop.adb 1 with Ada.Text_IO; use Ada.Text_IO; 2 with Ada.Integer_Text_IO; use Ada.Integer_Text_IO; 3 4 procedure For_Loop is 5 begin 6 for I in Integer range 1 .. 10 loop 7 Put("Hello again, world!:"); 8 Put(I); 9 New_Line; 10 end loop; 11 end For_Loop; If the range is omitted, the loop will iterate over all possible values of the type. • Array loops, which allow iteration over all elements in an array (or any other iterable types, for that matter): Listing 4.21: array loop.adb 1 with Ada.Text_IO; use Ada.Text_IO; 2 with Ada.Integer_Text_IO; use Ada.Integer_Text_IO; 3 4 procedure Array_Loop is 5 X_Array : array(Integer range 1..10) of Integer := 6 (29, 28, 27, 26, 25, 24, 23, 22, 21, 20); 7 begin 8 for I in X_Array’Range loop 9 Put("Hello again, world!:"); 89 10 Put(X_Array(I)); 11 New_Line; 12 end loop; 13 end Array_Loop; 4.8 Procedural Abstractions and Functional Abstractions The Ada language explicitly distinguishes between procedures and functions: Definition 4.1 A procedure does not return any value and is considered a statement. 2 Definition 4.2 A function returns a value and is itself an expression. 2 As such, functions can be used as part of larger expressions, but procedures cannot. Earlier in the hello/goodbye example (in the Greetings package), we saw an example of how to call a procedure (which contained no parameters). For succinctness, we use the term subprogram to include both functions and procedures. 4.8.1 Parameter Modes When defining a procedure, each parameter of the procedure must have one of the following modes: 1. in: The formal parameter is an input parameter only, and may be read from, but cannot be written to. It is treated as a constant in the subprogram. The actual parameter value is copied into the formal parameter and is not changed by the the call. 2. out: The formal parameter is an output parameter only. That is, the calling code will want to read the value of the parameter after the call is finished. The actual parameter’s value before the call is irrelevant because it will get a value in the subprogram. The formal parameter may be read and written in the subprogram. 3. in out: The formal parameter is both an input and output parameter, and can thus be both read and written. The actual parameter may be redefined in the subprogram body. 4. access: The formal parameter is an access type (a pointer) to some variable. If a parameter is declared without one of these modes, then by default, its mode is in. For functions, parameters must be of mode in or access. To write a subprogram that returns more than one value, the return values must be all out parameters. 4.8.2 Calling subprograms There are three syntax variations for calling subprograms: 90 1. Calling subprograms that have no parameters – called as just the subprogram name with no brack- ets. 2. Calling via positional arguments. This is found in most high-level programming languages, in which the position of the arguments in the call must be the same as in the declared subprogram. 3. Calling via named associations. This is where parameter variable names are explicitly specified in the calling code. The code listing below provides examples of all three call types: Listing 4.22: calling subprograms.adb 1 with Ada.Text_IO; use Ada.Text_IO; 2 with Ada.Integer_Text_IO; use Ada.Integer_Text_IO; 3 4 procedure Calling_Subprograms is 5 An_Int : Integer := 50; 6 Another_Int : Integer := 60; 7 begin 8 Put(An_Int, 20); -- positional parameter 9 New_Line; -- no parameters 10 Put(Width => 20, Item => An_Int); -- named associations 11 end Calling_Subprograms; Note that the order of the arguments in the final call (using named associations) is different to the declared order. 4.9 access Types: Passing Parameters By Reference Consider the following program. Listing 4.23: pass by value.adb 1 with Ada.Text_IO; use Ada.Text_IO; 2 with Ada.Integer_Text_IO; use Ada.Integer_Text_IO; 3 4 procedure Pass_By_Value is 5 type Rec is record 6 X : Integer; 7 end record; 8 9 R : Rec; 10 11 procedure Print_R is 12 begin 13 Put("R.X: "); 14 Put(R.X); 15 New_Line; 16 end Print_R; 17 18 procedure CheckR(givenR : in out Rec) is 19 begin 91 20 Put("In CheckR"); New_Line; 21 givenR.X := R.X + 1; 22 if GivenR = R then 23 Put("Equal"); 24 else 25 Put("Not Equal"); 26 end if; 27 New_Line; 28 Print_R; 29 end CheckR; 30 31 begin 32 33 R.X := 5; 34 CheckR(R); 35 Put("CheckR done"); New_Line; 36 Print_R; 37 38 end Pass_By_Value; It declares a new record type Rec which in this case has just a single Integer field X. However in general records can be very large. The program declares a record variable R and sets its X-field to 5. The CheckR procedure is designed to take a Rec as argument and if it is equal to the record R then it prints out “Equal”; otherwise it prints “Not Equal”. CheckR modifies the given record first before performing this test. What happens when we run this program? What does it print? $ ./pass_by_value In CheckR Not Equal R.X: 5 CheckR done R.X: 6 It prints “Not Equal”. The reason is because when CheckR is called, the record R is passed by value. This means that its contents are copied into a new record givenR that is given to CheckR. This is why when R is printed during CheckR, the value 5 is still printed — the modification to givenR has not (at least yet) modified R. When CheckR returns, the contents of givenR are copied back to the original record R, which is why at that point the value 6 is then printed. Programming languages like C support pointers to enable data structures to be passed by reference rather than by value. A reference to the structure can be passed to a subprogram (function or procedure like CheckR), allowing modifications performed during the subprogram to immediately update the original data structure. The following program is a small variation on pass by value.adb above showing how R can be passed by reference to CheckR instead. Listing 4.24: pass by reference.adb 1 with Ada.Text_IO; use Ada.Text_IO; 2 with Ada.Integer_Text_IO; use Ada.Integer_Text_IO; 92 34 procedure Pass_By_Reference is 5 type Rec is record 6 X : Integer; 7 end record; 8 9 type RecPtr is access all Rec; 10 11 R : aliased Rec; 12 13 procedure Print_R is 14 begin 15 Put("R.X: "); 16 Put(R.X); 17 New_Line; 18 end Print_R; 19 20 procedure CheckR(givenR : in RecPtr) is 21 begin 22 Put("In CheckR"); New_Line; 23 givenR.X := R.X + 1; 24 if GivenR = R’Access then 25 Put("Equal"); 26 else 27 Put("Not Equal"); 28 end if; 29 New_Line; 30 Print_R; 31 end CheckR; 32 33 begin 34 35 R.X := 5; 36 CheckR(R’Access); 37 Put("CheckR done"); New_Line; 38 Print_R; 39 40 end Pass_By_Reference; Before getting to the details of how this program is implemented, let’s first see what it prints when it runs. $ ./pass_by_reference In CheckR Equal R.X: 6 CheckR done R.X: 6 Observe that the modification performed inside CheckR directly updates the record R. Also observe that CheckR now prints out “Equal” since it is now comparing two pointers (i.e. two references) that both point to the same data structure, namely to R. This program declares a new type RecPtr that holds pointers to Recs. Values of this type are not 93 records of type Rec but rather point to such records. One can think of variables of type RecPtr as storing memory addresses that are the locations in memory of Rec objects. This type is an access type with the all qualifier, meaning that it can point to Recs that are declared as local variables like R (i.e. those that are allocated on the stack). CheckR now takes a RecPtr as its argument. When it is called, rather than passing Rwe pass R’Access, which is the expression that evaluates to a pointer to R. Ada won’t allow us to get a pointer to R unless we declare R as aliased. This is why this qualifier is added to the declaration of R. It tells the compiler that more than one reference to the memory occupied by Rmight exist at some point during the execution of the program. Inside CheckR, it compares this given RecPtr to see if it is equal to R’Access, i.e. whether the given pointer points to the record R. Of course it does which is why the program prints out “Equal”. 4.9.1 Access Types to Read-Only Data What if you want to pass by value by reference but don’t want to allow it to be modified? This is done by using access constant instead of access all. For example, if we replace access all by access constant in pass by reference.adb and try to compile it, we get a type error caused by the assignment statement that attempts to modify the record pointed to by givenR. This is because, with this change, the RecPtr type is now declared as pointing to read only Recs. Therefore any procedure that is passed a RecPtr is not allowed to modify the record that the pointer points to. 4.9.2 Summary To summarise, access types are used to pass data structures by reference, which avoids them having to be copied between procedures. They also allow one to check whether two given references refer to the same data structure in memory. Access types get more interesting when used in conjunction with dynamic memory allocation, which we don’t cover in this course. Here we have only scratched the surface. You should consult the references at the end of this chapter to learn more. 4.10 Example: Insertion sort Before we go any further, it is time for a complete example. The example we use is insertion sort3 – a well-known sorting algorithm with worst-case complexity O(n2), which works by finding the smallest remaining element in the unsorted part of the array and inserting it at the end of the sorted part of the array. This example uses many of the constructs seen already, declaring the procedure in a package. The package specification consists of a type declaration and a procedure declaration: Listing 4.25: sort.ads 1 package Sort is 2 type Data_Array is array(Natural range <>) of Integer; 3 procedure Insertion_Sort(Item : in out Data_Array); 3Taken from Rosetta Code: http://rosettacode.org/wiki/Sorting_algorithms/Insertion_sort# Ada 94 4 end Sort; The package body implements the procedure specification: Listing 4.26: sort.adb 1 package body Sort is 2 3 procedure Insertion_Sort(Item : in out Data_Array) is 4 First : Natural := Item’First; 5 Last : Natural := Item’Last; 6 Value : Integer; 7 J : Integer; 8 begin 9 for I in (First + 1)..Last loop 10 Value := Item(I); 11 J := I - 1; 12 while J in Item’range and then Item(J) > Value loop 13 Item(J + 1) := Item(J); 14 J := J - 1; 15 end loop; 16 Item(J + 1) := Value; 17 end loop; 18 end Insertion_Sort; 19 end Sort; The following serves as a small demonstrating of creating an unsorted array, sorting it, and printing the result: Listing 4.27: sort main.adb 1 with Sort; use Sort; 2 with Ada.Text_IO; use Ada.Text_IO; 3 with Ada.Integer_Text_IO; use Ada.Integer_Text_IO; 4 5 procedure Sort_Main is 6 Int_Array : Data_Array := (5, 3, 9, 1, 5, 4); 7 begin 8 -- sort the array 9 Insertion_Sort(Int_Array); 10 11 -- iterate over the array and print the elements 12 for I in Int_Array’First .. Int_Array’Last loop 13 Put(Int_Array(I), 2); 14 Put(", "); 15 end loop; 16 end Sort_Main; 4.11 Concurrency with tasks The final topic that we cover in this chapter is tasks. Tasks are the basic unit of concurrency in Ada. As such, they are much like threads in Java – but have several differences. 95 Tasks are important in Ada because they help to support real-time systems. Many (perhaps most?) safety- and mission-critical systems are real-time. That is, timing is part of the functional specification of the system, and failure to complete some function before a deadline means that the system has failed. Tasks start executing the moment they are created. Tasks are not called, unlike functions and procedures. Tasks can communicate with other tasks: • Tasks can pass messages – message passing; • Tasks can share variables – shared memory communication. Ada programs and tasks are scheduled. Unlike programs, the effects of task scheduling are often visible. The example below4 shows the declaration and body of two tasks: one for “backing up” and one for “cleaning the CPU”. Listing 4.28: housekeeping.adb 1 with Ada.Text_IO; use Ada.Text_IO; 2 with Ada.Integer_Text_IO; use Ada.Integer_Text_IO; 3 4 procedure Housekeeping is 5 6 task Check_CPU; 7 task Backup_Disk; 8 9 task body Check_CPU is 10 begin 11 for count in 1..50 loop 12 Put (Item => "Check CPU status "); 13 Put (Item => count, Width => 1); 14 New_Line; 15 end loop; 16 end Check_CPU; 17 18 task body Backup_Disk is 19 begin 20 for count in 1..50 loop 21 Put (Item => "Backup disk status "); 22 Put (Item => count, Width => 1); 23 New_Line; 24 end loop; 25 end Backup_Disk; 26 27 -- the two tasks are automatically created and begin execution 28 29 begin -- Housekeeping 30 null; 31 -- Housekeeping waits here for them to terminate 32 end Housekeeping; 4Taken from https://en.wikibooks.org/wiki/Ada_Programming/Tasking 96 The master of the tasks is the procedure Housekeeping. The procedure itself does nothing except declare the tasks, which then execute automatically. The body of Housekeeping is empty and just waits for its child tasks to terminate. Entry points Tasks communicate with each other via entry points (or entry calls). Tasks can have one or more entry points. Explicitly named entry points are declared in the task declaration. The entry keyword declares an entry point into a task. Entries are declared in much the same way as procedures. They have identifiers and parameters (which can be in, out, or in out parameters). An entry is executed by the called task, not the calling task. The calling task is suspended until the call completes. If the called task cannot process the entry call immediately (most commonly because it is already executing on behalf of another call), the call is placed in a FIFO queue associated with that entry. This interaction between calling task and called task is known as a rendezvous. A task accepts rendezvous with any caller for a specific entry by executing an accept statement for that entry. The following code snipped demonstrates how entry and accept statements work: Listing 4.29: entry call.adb 1 with Ada.Text_IO; use Ada.Text_IO; 2 with Ada.Integer_Text_IO; use Ada.Integer_Text_IO; 3 4 procedure Entry_Call is 5 6 task type Encapsulated_Variable_Task_Type is 7 entry Store (An_Integer : in Integer); 8 entry Fetch (An_Integer : out Integer); 9 end Encapsulated_Variable_Task_Type; 10 11 task body Encapsulated_Variable_Task_Type is 12 Datum : Integer; 13 begin 14 loop 15 select 16 accept Store (An_Integer : in Integer) do 17 Datum := An_Integer; 18 Put ("Storing "); 19 Put (An_Integer); 20 New_Line; 21 end Store; 22 or 23 accept Fetch (An_Integer : out Integer) do 24 An_Integer := Datum; 25 Put ("Fetching "); 26 Put (An_Integer); 27 New_Line; 28 end Fetch; 29 or 30 terminate; 31 end select; 32 end loop; 97 33 end Encapsulated_Variable_Task_Type; 34 35 X, Y : Encapsulated_Variable_Task_Type; 36 It : Integer; 37 38 begin 39 It := 5; 40 X.Store(It); 41 X.Fetch(It); 42 It := It + 1; 43 Y.Store(It); 44 end Entry_Call; The above example also demonstrates several other important aspects of tasks: 1. Task types: The two tasks used are of the same task type, Encapsulated Variable Task Type. This allows tasks to be created dynamically, and also included into new data structures. 2. Selective waits: The select keyword is used to prevent a task being held up when it could be doing something. In the above example, the task can choose between storing or fetching an item. If there is only one pending call, that is executed. If there is more than one, any of the tasks can be chosen, thus introducing non-determinism into the program. 3. Termination: The loop in the tasks above will execute indefinitely. By giving the tasks a termination alternative, they will terminate when: (1) they have no pending calls; (2) no other tasks that are children of the task’s master have any pending calls; and (3) the task’s master has completed. Exercise 4.3 Remove the terminate statement from the Encapsulated Variable Task Type task and re- run the example. 2 Guards Entries can have guards that only allow the entry to execute under specific circumstance. For example, in Listing 4.29, we could add a guard to the Store entry restricting the input to under certain numbers: Listing 4.30: 1 when An_Integer < 100 => 2 accept Insert (An_Integer : in Integer) do A guard means that this entry point is not open, another entry point can be executed. 4.12 Further reading An excellent introduction to Ada, which includes chapters on systems programming, real-time systems, generic types, and distributed systems, is Ben-Ari’s book “Ada for Software Engineers” [1]. This is available as an e-book from the University of Melbourne library, and is far more comprehensive than these notes. 98 A excellent online reference is the “Ada programming” Wikibook [2], which offers a substantial overview of the Ada language and many complete examples, some of which are used in this chapter. Bibliography [1] Mordechai Ben-Ari. Ada for software engineers. Springer, 2009. [2] Wikibooks. Ada Programming. https://en.wikibooks.org/wiki/Ada_Programming. 99 Chapter 5 Safe programming language subsets In this chapter, we introduce the concept of safe programming language subsets. These are subsets of programming languages in which unsafe parts of the language are not permitted. They are intended to make it impossible to implement certain types of faults in software. The particular language that we will look at is SPARK – a subset of Ada. SPARK is intended to be a simple language that allows structured programs to be written, while preventing many types of serious faults that can affect the system, and are difficult to detect. It does that by constraining the syntax and the semantics of the language to be straightforward to reason about by both humans and machines. 5.1 Principles of correctness by construction “There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other is to make it so complicated that there are no obvious deficiencies.” — Professor Tony Hoare, 1980 ACM Turing Award Lecture Correctness by construction is an approach to developing software at Altran Praxis, a small company in the U.K. that specialises in safety- and security-critical software. Altran Praxis are the company that defined SPARK and continue to provide high quality tool support for it. There are seven key principles adhered to at Praxis: 1. Expect requirements to change. 2. Know why you’re testing: finding faults, not showing correctness. 3. Eliminate errors before testing. Testing is the second most expensive way of finding errors. The most expensive is to let your customers find them for you. 4. Write software that is easy to verify. Testing takes up to 60% of total effort. Implementation typically takes 10%. Doubling implementation effort is cost-effective if it reduces verification cost by as little as 20%. 5. Develop incrementally. Make small changes, incrementally. After each change, verify that the updated system behaves according to its updated specification. 100 6. Some aspects of software development are just plain hard. The best tools and methods take care of the easy problems, allowing you to focus on the difficult problems. 7. Software is not useful by itself. The executable software is only part of the picture: user manuals, business processes, design documentation, well-commented source code and test cases should not just be added at the end. Learning outcomes Throughout the next few chapters, we will explore just some of the principles above, and how Praxis (and others) go about adhering to them. The learning outcomes of the next four chapters are: 1. Motivate the use of safe programming language subsets. 2. To understand the motivation of design by contract. 3. To specify formal contracts for small- and medium-scale packages in Ada. 4. Prove the correctness of a program against its contract. 5. Apply SPARK Ada static analysis tools for semi-automated proof of contract satisfaction. 6. Systematically construct tests from specifications and contracts. In this chapter, we look at the first item above: safe programming language subsets. 5.2 Safe programming language subsets The evolutionary nature of programming languages implies that there are many problems with our ev- eryday programming languages. For example, C and Ada, the two most widely used programming languages in embedded software, have no agreed semantics. Furthermore, they both permit programs that have no defined behaviour. These two languages are two of the most well understand and strict languages in common use today. Other languages, such as Python, Ruby, and Java, have far greater problems. Structured programming is where the procedural logic of a program is a structure composed of sub- structures in a limited number of ways; most commonly, constructed of sequencing, selection, and itera- tion. Safe programming language subsets use structured programming to provide a small but powerful pro- gramming language that is easier to verify than non-safe languages. Some programming language subsets Some well-known safe programming language subsets include: • SafeD, a safe subset of D, which is itself related to C++. 101 • MISRA C, a safe subset of C. • Joe-E, a secure subset of Java. • SPARK, a safe subset of Ada, but with annotations for additional static checking. In this subject, we will focus on SPARK, as it is the most widely used safe programming language subset, and also has the most mature tools supporting it. Example 5.1 The following is an example of a safe programming subset of C, defined by a strict coding standard. The coding standard [1] was defined by engineers at NASA’s Jet Propulsion Laboratory, for use on NASA’s Mars Rover Curiosity. In fact, the standard is a subset of MISRA C, a safe subset of C defined the Motor Industry Software Reliability Association (MISRA), which defines standards for the automotive industry. For NASA’s Curiosity, the following rules had to been abided by: 1. Language Compliance (a) Do not stray outside the language definition. (b) Compile with all warnings enabled; use static source code analysers. 2. Predictable Execution (a) Use verifiable loop bounds for all loops meant to be terminating. (b) Do not use direct or indirect recursion. (c) Do not use dynamic memory allocation after task initialisation. (d) Use IPC messages for task communication. (e) Do not use task delays for task synchronisation. (f) Explicitly transfer write-permission (ownership) for shared data objects. (g) Place restrictions on the use of semaphores and locks. (h) Use memory protection, safety margins, barrier patterns. (i) Do not use goto, setjmp or longjmp. (j) Do not use selective value assignments to elements of an enum list. 3. Defensive Coding (a) Declare data objects at smallest possible level of scope. (b) Check the return value of non-void functions, or explicitly cast to (void). (c) Check the validity of values passed to functions. (d) Use static and dynamic assertions as sanity checks. (e) Use U32, I16, etc instead of predefined C data types such as int, short, etc. (f) Make the order of evaluation in compound expressions explicit. (g) Do not use expressions with side effects. 4. Code Clarity 102 (a) Make only very limited use of the C pre-processor. (b) Do not define macros within a function or a block. (c) Do not undefine or redefine macros. (d) Place #else, #elif, and #endif in the same file as the matching #if or #ifdef. (e) Place no more than one statement or declaration per line of text. (f) Use short functions with a limited number of parameters. (g) Use no more than two levels of indirection per declaration. (h) Use no more than two levels of dereferencing per object reference. (i) Do not hide dereference operations inside macros or typedefs. (j) Do not use non-constant function pointers. (k) Do not cast function pointers into other types. (l) Do not place code or declarations before an #include directive. 5. MISRA shall compliance (a) All MISRA shall rules not already covered at Levels 1-4 (73 rules in total). 6. MISRA should compliance (a) All MISRA should rules not already covered at Levels 1-4 (16 rules in total). The rules above are not sufficiently detailed to be understood, but the standard itself [1] (available at http://lars-lab.jpl.nasa.gov/JPL_Coding_Standard_C.pdf) presents a set of well- defined rules, as well as justifications for each rule. 2 5.3 SPARK SPARK is a subset of Ada, in that any correct SPARK program is also a correct Ada program. However, this is misleading, because SPARK also extends Ada via the use of annotations. Annotations are a major part of the language. Thus, SPARK = safe(Ada) + annotations. 5.3.1 SPARK Structure The steps used to construct SPARK is: • Take Ada, which is a safe language relative to most others. • Remove unsafe features to get a powerful subset that is still Turing compatible. • Define a set of annotations to support static checking. To enable SPARK annotations, we must tell tools that we are using SPARK mode, using syntax such as: 103 1 package IncorrectReassignment with 2 SPARK_Mode => On 3 is SPARK mode can be enabled for packages, individual procedures/functions, or at the project level. By having a standard SPARK mode, this allows any SPARK program to be compiled using a standard Ada compiler. In fact, there is no SPARK compiler — just a set of tools for checking conformance to the SPARK language. If these tools confirm that an Ada program is also a valid SPARK program, then an Ada compiler is used to compile the program. Chapter 6 presents the use of two types of SPARK annotations: preconditions and postconditions. These are the only annotations we will use in this subject, however, several other useful annotations exist that can be used to cross-check source code. 5.3.2 What is left out of Ada? SPARK removes many of the features of Ada that are unconsidered either unsafe, or difficult to verify. These include language features such as: • Dynamic memory allocation • Tasks • Gotos • Exceptions • Generics • Access types (similar to references in other languages) • Recursion Ada favours the reader heavily over the writer in its design. The reason is that the verification is more difficult and more important than the coding. SPARK carries this principle to further extremes. Why? Because the reader may be a software tool used to verify the program. Furthermore, writing code takes a fraction of the time it takes to verify, especially in certifiable software. Why no dynamic memory allocation? One of the key constraints in SPARK is that memory cannot be allocated dynamically. That is, all arrays etc. must be statically allocated. The reason for this is for certifiability: if all memory is allocated at design time, then we can calculate the maximum amount of memory required to run the program, and ensure that we have enough memory for the program to execute. If SPARK allowed dynamic memory allocation, there is a possibility that the program could attempt to allocate more memory than is available on the physical platform, which can have catastrophic conse- quences. 104 In addition to memory management, static memory allocation makes it straightforward to reason about range correctness; e.g. we can prove that there is no possibility of an array being indexed out of bounds. Why no tasks? Tasks are forbidden because tasking programs are inherently more complex than simple sequential pro- grams: 1. The state of a tasking program is a much more complex object than of a non-tasking program. 2. Task interactions are complex. In particular, verifying tasking programs that share memory is exceedingly difficult because we can no longer verify a module as a stand-alone state machine — it’s internal data may not be hidden. While SPARK does not allow tasks, RavenSPARK 95, a superset of SPARK, does allow tasks (although with strong restrictions), and also supports verification of these. Why no gotos? Gotos are not really needed in most programming: at best they are only a minor convenience; but they complicate formal analysis. When using gotos, the effect of a sequence of code can no longer be represented as the composition of the effects of its sequential components. In other words, they violate the principle of structured programming. Why no exceptions? Because exceptions make the control flow of a program much more complex. Similar to goto statements, they violate the principle of structured programming. Further to this, is the issue of what happens when an exception occurs. Certifiable programs cannot have unexpected exceptions being thrown. For example, software responsible for controlling part of an aircraft cannot throw an exception and exist. This implies that all exceptions must be handled using exception handlers. However, exception handlers lead to deactivated code which is problematic for certification. To work around the idea of not using exceptions, return values from sub-programs should be used to signal exceptional cases, which can then be handled safely. Why no generics? Generics are a shorthand for repeated instantiations, so do not provide any fundamental expressive power. But proving properties is more complex because we have to quantify over types. That is, we must prove that the properties we want verified hold for all instantiations, rather than just one. 105 If we have to prove separate instantiations for certifiable software, it is generally faster and easier to just write all of the instantiations than it is to use generics. So, this creates a more lengthy task at the programming level, but this is clawed back by making verification more straightforward. Why no access types? Access types only make sense in connection with dynamic storage allocation, but dynamic allocation is forbidden. Why no recursion? Recursive programs are difficult to reason about. In terms of programs, all non-recursive programs can be mapped to regular expressions, while some recursive programs cannot. This is somewhat of a violation of the principles of structured programming (but not really). But more importantly, in the absence of recursion, the maximum stack depth required for a program can be statically determined at design time. Once recursion is introduced, variables can be declared at each call, meaning the memory management must be dynamic – something that is not permitted in SPARK. Other things left out Some other language features that are omitted in SPARK: 1. Operation overloading: the number of possible combinations of bindings when operations are overloaded/overridden can be challenging to reason about, and can cause unexpected behaviour. 2. Functions with side effects: any sub-program that needs to modify its input or a global must be a procedure. Functions can only read global variables or parameters. 3. Implicit subtypes; e.g. X : Integer range 1..10; are not allowed. We instead must declare separate subtype: subtype XRange is Integer range 1..10; X : XRange; This is to prevent designers/programmers from specifying the same implicit subtypes for multiple variables. It is difficult to determine whether these implicit subtypes are intended to represent the same logical concept, or different concepts. Allowing implicit subtypes also presents a mainte- nance issue: if the range changes, which implicit subtypes need to be changed? 5.3.3 Why bother with SPARK or other safe programming language subsets? A question to ask is: why bother with safe programming language subsets? Doesn’t SPARK make programmers “less productive” by forcing them to declare new subtype ranges, etc.? The answer is: “yes, it probably does, but ...”. 106 It may be true that it takes a bit longer to write one feature of a system, but, that time is more than gained back in the verification, and even debugging, of the program. Using the SPARK toolset (which we will use in workshops), we can eliminate a number of possible faults without having to even execute the software. Testing typically takes about 50% of total effort in a software product. For a certifiable system, verifica- tion (including testing) takes up even more. By taking some extra time during programming, we can cut down the verification time significantly. Further to this, we can certifiably verify our system. If we look at the features of SPARK and compare them to the C programming standard used on NASA’s Mars Rover Curiosity, we can see that many of the Ada features not present in SPARK are those same features not permitted in that coding standard. Thus, SPARK is a formal definition of a language that forbids many programming language features that are considered unsafe or problematic, with a set of mature tools that are able to prove many other problematic faults are not in a program. 5.4 SPARK Examiner In this section, we present the SPARK examiner tool, which uses the annotations to prove that certain properties are (or are not) present in a program. 5.4.1 Annotations Annotations are Ada comments, starting with --#. Like comments, they specify some form of expect- ed/intended behaviour of a program, however, they are formal comments and are strict elements of the SPARK syntax. By “formal”, we mean that the comments have a strict syntax and semantics. The syntax and semantics are designed to be both human and machine readable. As well as making code clearer at the specification level, annotations help with verification by: 1. introducing redundancy (annotations + code), which can be checked for consistency using static analysis; and 2. allowing error checks to be made. In this chapter, the two types of annotations we will look at are to do with variable declaration and information flow. These are the only two annotations that are compulsory in a SPARK program. These annotations are attached to sub-programs. The two annotations each use their own reserved keywords: • global: this specifies which global variables (those declared outside of the procedure) are used in a function. • derives: specifies which variables are used to derive the value of other variables. The annotations describe information about the program. Using the program code, we can infer the same information about which variables change and how their value is determined, and cross check this against the annotations. SPARK Examiner is a tool that supports this for SPARK. 107 5.4.2 SPARK Examiner SPARK examiner is a tool for statically analysing SPARK programs. It does the following: • checks syntactic validity of a program; • checks annotations for consistency and correctness; • generates useful warnings for debugging; and • generates auxiliary information helpful in analysis and certification. One of its primary purposes is to check that a program is a valid SPARK program (as opposed to just an Ada program). SPARK examiner is run prior to compilation. If it produces no warnings or errors, then the program can be compiled by an Ada compiler. We will investigate SPARK Examiner in greater detail both in later chapters and in workshops. The following provides some examples of how the SPARK Examiner can check internal consistency of a SPARK program. Example 5.2 The following code listing shows a second attempt at implementing the Swap procedure. In this example, a local variable T is used as the temporary value holder instead of a global variable. Listing 5.1: incorrectassignment.ads 1 procedure Swap (X, Y : in out Float); Listing 5.2: incorrectassignment.adb 1 procedure Swap (X, Y : in out Float) is 2 T : Float; 3 begin 4 T := X; 5 X := Y; 6 Y := X; 7 end Swap; There is a fault in this program: the final line assigns X to Y, instead of assigning T to Y, so the values do not get swapped correctly. Running this through SPARK examiner would result in the following five problems being identified: 1. T := X is an “ineffective statement”: the assigned value of T is never used. 2. Importation of initial value of X ineffective: the initial value of X is never used, so declaring it as an in parameter is not consistent with the program. 3. The variable T is neither referenced nor exported: T is declared and is given a value, but that value is never used, and T is not an output variable. 4. Imported value of X not used in deriving the value of Y: SPARK examiner reasons that the value of X is not used in deriving Y. Despite the final assignment stating doing exactly this, SPARK 108 examiner is sophisticated enough to reason that the value of X is assigned the line before this, so the imported value of X is not used to derived Y, which is inconsistent with the derives annotation. 5. Imported value of Y may be used in the derivation of Y: as a result of the fault, the value of Y is derived from itself, which is inconsistent with the derives annotation. 2 The above example illustrates several checks that are performed by SPARK examiner. All five errors can be attributed to the same single fault in the final assignment statement. Exercise 5.1 The following code listing presents a non-sensical example of the “swap” program, in which the assign- ment of the value of X to the variable X only occurs if X < Y. This is not sensible behaviour, but is included just to demonstrate the features of SPARK examiner. Listing 5.3: weirdswap.ads 1 procedure Swap (X, Y : in out Float); Listing 5.4: weirdswap.adb 1 procedure Swap (X, Y : in out Float) is 2 T : Float; 3 begin 4 if X < Y then 5 T := X; 6 end if; 7 X := Y; 8 Y := T; 9 end Swap; What is the issue with this procedure? 2 Exercise 5.2 The following example of the Swap is more sensible than the last, but the additional code (initialising the variable T) will generate a warning for the SPARK examiner. Listing 5.5: initialiselocal.ads 1 procedure Swap (X, Y : in out Float); Listing 5.6: initialiselocal.adb 1 procedure Swap (X, Y : in out Float) is 2 T : Float := 0.0; 3 begin 4 T := X; 5 X := Y; 6 Y := T; 7 end Swap; 109 What is the issue with this procedure? 2 Exercise 5.3 The two exercises above and Example 5.2 each contain specific (and different) types of faults. Do the three issues on these last three programs seem familiar? HINT: You may need to think back to SWEN90006. 2 Bibliography [1] JPL institutional coding standard for the C programming language. JPL DOCID D-60411, March 2009. http://lars-lab.jpl.nasa.gov/JPL_Coding_Standard_C.pdf. 110 Chapter 6 Design by contract In this chapter, we present the concept of Design by Contract (DbC) – a paradigm for designing software. The paradigm prescribes that designers should specify the interfaces between software components using formal, mathematical statements, such as preconditions, postconditions, and invariants. These statements serve as both documentation, as well as statements that can be reasoned about automatically to provide verification assistance. SPARK supports DbC using its advanced annotation system, and the tool support for SPARK contracts are particularly mature. The learning outcomes of this part of the subject are: 1. To understand the motivation, advantages, and disadvantages of design by contract. 2. To specify formal contracts for small- and medium-scale packages in SPARK. 6.1 Introduction to Design by Contract Design by Contract (DbC) [2] is a design paradigm specifying how sub-systems or components of a system interact with each other. The paradigm is based on the metaphor of “clients” and “suppliers”, who have obligations for and receive benefits from (respectively) a mutual contract. 6.1.1 History The DbC paradigm was developed by Bertrand Meyer, while a professor University of California, Santa Barbara. In concert with DbC, Meyer and colleagues developed Eiffel — a programming language that embodies the very spirit of DbC. For his efforts on Eiffel and DbC, Meyer was presented with the 2006 ACM Software System Award: “For designing and developing the Eiffel programming language, method and environment, embodying the Design by Contract approach to software development and other features that facilitate the construction of reliable, extendable and efficient software.” — 2006 ACM Software System Award Citation. 111 The Eiffel language and tools to support DbC reasoning are provided by the Eiffel Software company (http://www.eiffel.com). 6.1.2 Some Definitions Before we go further, we cover some standard software engineering definitions that are especially im- portant for DbC: 1. Information hiding is a design principle that specifies that data types and implementation should be abstracted and hidden behind a publicly available interface. 2. Encapsulation is a strategy for information hiding that bundles related entities together behind an interface. 3. An abstract data type (ADT) is some package/module/class that encapsulates data and operations on that data, and hides the data types and implementation behind an interface. Exercise 6.1 Explain why these concepts are important in software engineering. 2 6.2 Specifying ADT interfaces In the above definitions, interfaces play a key role. In modern programming languages, interfaces are an important part of providing abstraction via information hiding. Most software library interfaces and many components and module interfaces in larger packages document their interfaces, so that a user of the interface can understand what services it provides, and how it must provide inputs to receive correct outputs. 6.2.1 ADT interfaces as natural-language comments Many ADT interface specifications are documented using comments at the top of modules and opera- tions. These comments, if crafted well, provide users (the developers using the code) with information as to which parts to use and how to use them. They also provide the developers of the module itself with a design against which to implement and verify their code. Example 6.1 Consider the following example of a Java class interface that implements the behaviour of a bounded stack of integers. 1 //Implements a bounded stack data type. 2 //The stack must also be less than a specified size. 3 public class StackImpl implements IStack 4 { 5 //Create a new empty stack, with a maximum size, which 6 // must be greater than 0 7 public StackImpl(int size); 8 9 //Add Object o to the top of the stack 112 10 // if the stack is not full. 11 public void push(Object o); 12 13 //Return and remove the object at the top of the stack 14 // if the stack is not empty. 15 public Object pop(); 16 17 //Return the size of the stack. 18 public int size(); 19 } 2 6.2.2 ADT interfaces as invariants, preconditions, and postconditions While specifying the behaviour of an interface using natural language comments provides users with information to use the class/module, a more structured manner of doing so is using invariants, precondi- tions, and postconditions. Recall from Chapter 3, the following definitions: 1. A precondition is a statement that must hold for an operation to execute. If the precondition does not hold, the behaviour of the operation is undefined. 2. A postcondition is a statement that must hold after a operation executes, provided that the precon- dition was true before the operation was executed. 3. An state invariant (or just invariant) is a statement that must hold at all times. It must hold before any operation is executed, and must also hold after that operation is executed. That is, all operations must preserve the invariant. An invariant can be violated during execution of an operation, provided that the invariant is re- established at the end of the operation. Preconditions, postconditions, and invariants say “what” must hold, but not “how” this must be achieved. Example 6.1 specifies the behaviour of a Java class, in which the data types and implementation are hidden. However, is it clearer to specify this as preconditions and postconditions? Example 6.2 The following example specifies the behaviour of the Java class from Example 6.1 using preconditions and postconditions. In this example, the keyword requires is used to denote the precondition, and the keyword ensures is used to denote the postcondition. 1 //Implements a bounded stack data type. 2 public class StackImpl implements IStack 3 { 4 //invariant: The stack must be less than a specified size. 5 6 //ensures: Creates a new empty stack, with a maximum size, which 7 // must be greater than 0 8 public StackImpl(int size); 113 910 //requires: the stack is not full. 11 //ensures: Object o is added to the top of the stack. 12 public void push(Object o); 13 14 //requires: the stack is not empty 15 //ensures: return and remove the object at 16 // the top of the stack 17 public Object pop(); 18 19 //requires: none (or true) 20 //ensures: return the size of the stack. 21 public int size(); 2 The above example does not provide any additional information than that of the interface specification from Example 6.1. However, the structured comments are preferable for two reasons: 1. The separation of precondition and postcondition make it clearer to both developers and users of the module what is required from the calling code — that the precondition holds before a call is made to the operation — and what is required from the implementation of the operations — that the postcondition holds after a call is made to the operation. Establishing the former is the obligation of the user, while establishing the latter is the obligation of the developer. 2. The requirement to provide an invariant, precondition, and postcondition (especially the first two) force the designer to consider these issues — that is, what is invariant about the module and what the preconditions of the operations are? As such, we are more likely to end up with a more complete interface specification. 6.2.3 ADT interfaces as contracts Software systems are viewed as a set of interacting components. Interaction occurs via calls to operations defined in an interface; in many cases, an ADT interface. When a call is made, the caller is expected to ensure that the input is of the required format, and the callee is expected to ensure that it provides a correct output with respect to the input. DbC leverages this by using the metaphor of a contract. Contracts In business, a contract is an agreement between two or more parties with mutual obligations and benefits. For example, a contract between a client and supplier of a product has the following properties: Obligation Benefit Client Pay for product Receives product Supplier Provide product Receives income 114 The DbC paradigm uses this metaphor in software: a module is the supplier and the calling code is the client. The supplying module must perform some computation, and in return the calling code must ensure that the input values sent to the supplying module are correct. Contracts and software Using contracts on interfaces ensures that interaction between components is governed by contracts. In total, a contract for an interface should consist of at least the three following items: 1. preconditions for all operations — what the contract expects; 2. postconditions for all operations — what the contract guarantees; and 3. an invariant — what the contract maintains. Using the metaphor of a contract in software design, we can reflect the above table as follows: Obligation Benefit Client Establish precondition Receives result of computation Supplier Provide result of computation Has precondition established Expanding the above: 1. The precondition is the obligation of the client: before calling any operation, the client must ensure that the precondition of that operation holds. For example, if calling a binary search operation, it must ensure that the input list is sorted in ascending order. 2. The postcondition is the obligation of the supplier: the supplier must change the ADT state and/or return the requested result to the client, as specified by the postcondition. The supplier must also establish the invariant. 3. The client’s benefit is it has the operation executed, as requested. 4. The supplier’s benefit is that it can assume that precondition of any executed operation, and the invariant, hold before the operation is executed, therefore simplifying the implementation. In practice, the supplier may sometimes throw exceptions or handle violations gracefully, however, this is not always possible or desirable. Contracts are bi-directional, and assuming all of the contracts are correct, violation of a contract by either party is a fault. Further to this, in the case of a fault, we can assign the responsibility of the fault to a particular module: either the calling module or the called module. The example of the stack in Example 6.2 demonstrates informal contracts via standard comments. How- ever, can we be more precise? 115 6.2.4 Formal contracts So far, we have specified interfaces for ADTs using comments, and in one case, comments structured as invariants, preconditions, and postconditions. However, the DbC paradigm advocates that software designers should define formal, precise and verifiable contracts. Many languages provide constructs, such as assertions, to establish formal properties of their imple- mentations. The DbC paradigm goes further, stating that such properties are so valuable in establishing software correctness, that they should be a routine part of the software design, and that all modules should have contracts. Example 6.3 The following example provides a formal contract for the running stack example using Java Modelling Language [1]. To distinguish JML contract statements from normal Java comments, the @ symbol is used. 1 //Implements a bounded stack data type. 2 public class StackImpl implements IStack 3 { 4 private /*@ spec_public @*/ Object [] elements; 5 private /*@ spec_public @*/ int top; 6 private /*@ spec_public @*/ int maxsize 7 8 //@ public invariant top >= -1 && top < maxsize; 9 10 //@ requires size > 0; 11 //@ ensures maxsize == size && top == -1 && elements != null; 12 public StackImpl(int size); 13 14 //@ requires top < maxsize - 1; 15 //@ ensures top == \old(top) + 1 && elements[top] == o; 16 public void push(Object o); 17 18 //@ requires top >= 0; 19 //@ ensures top == \old(top) - 1 && \result == elements[\old(top)]; 20 public Object pop(); 21 22 //@ ensures \result == top + 1; 23 public int size(); The spec public keyword denotes that the variable is public for specification purposes. That is, it is seen in any generated documentation, but the corresponding variable remains private/protected to other code. 2 The example above is a formalised version of the same interface in Example 6.2, in which JML is used instead of natural language. The same preconditions, postconditions, and invariants are provided, except in a precise manner. 116 6.2.5 Advantages and disadvantages of formal contracts Formal contracts, in particular those that are offered as a native support as part of a programming lan- guages, offer several advantages over natural language comments. These include: 1. They document a precise (unambiguous) specification of behaviour. Further, if contracts are en- forced as part of a language (or by project/organisational methods), the formal nature offers many of the same benefits offered by formal specification, as outlined in Section 3.3. Or, to put it better: “In practice, it is amazing to see how far just stating what a module should do goes towards helping to ensure that it does it. — Bertrand Meyer. 2. They can be used as runtime assertions. That is, during implementation and testing of the system, when an operation is executed, we can check the precondition, postcondition, and invariant all hold, and raise a runtime error if they do not. The failure of a contract to hold implies that either the code or the contract is wrong. 3. A corollary of the above point is that contracts can be used as test oracle. If we are unit testing a module, we can run large amounts of tests, and use the runtime checking to ensure that the contract is satisfied. 4. Contracts can be used for static analysis. That is, using software tools, we can specify certain properties as part of the contract are satisfied, and check these statically. 5. One way to implement operations with contracts is to assume that the precondition of an operation is satisfied; that is, do not use defensive programming. This has the benefit that there is no “double- up” checking of preconditions. That is, in many cases of a precondition, the calling code will ensure that the precondition holds before execution, and then the called operation will also check this. This results in duplicated effort and execution. The main disadvantage of contracts is that introducing them into a project/organisation requires some learning overhead, but empirical evidence demonstrates that this overhead is recouped quickly. 6.3 Contracts in SPARK In Chapter 5, the SPARK programming language was introduced, but SPARK annotations were not dis- cussed in detail. SPARK contains several types of annotation, such as variable dependencies, and global variable declaration. In this subject, we’ll concern ourselves mostly with module invariants, operation preconditions, and operation postconditions, which can be used to specify SPARK contracts. While SPARK documentation does not use the term “Design by Contract”, these contracts are well within the spirit of DbC. 6.3.1 Syntax There are several elements that are required in a SPARK contract for a module, with the most important for this subject being: 117 1. The Initializes list: the variables initialised by the module. 2. The Pre (for each operation): the precondition. 3. The Post (for each operation): the postcondition. The SPARK annotation language for preconditions and postconditions contains some additional syntax on top of SPARK expressions to specify contracts. Arrays and records 1 x in A’Range //range membership 2 A’Update(x => y) //array updates (where ’x’ is an array index) 3 A’Update(x => y) //record updates (where ’x’ is a field in the record) For arrays, there are two main types of operation. The first is “range membership”, in which x is a variable and A is a type mark, and x in A’Range denotes that the value of x is in the type mark A. For arrays, as well as records, the operator A’Update(x => y), in which A is an array/record, x is an index/name, and y is an expression, indicates that A’Update(x => y) is the same as array/record A, except with the index/name x overridden with the new value y. This is the same as function override in set theory. Logic SPARK offers standard propositional operators and quantification over type marks (i.e. not over arrays). 1 P and Q //conjunction 2 P or Q //disjunction 3 if P then Q //implication 4 not(P) //negation 5 for some x in X => (P(x)) //existential quantification 6 for all x in X => (P(x)) //universal quantification Arithmetic 1 +, -, /, * //plus, minus, division, multiplication 2 x mod y //modulus 3 x**y //exponent 4 <, >=, >, <= //less than (or equal to), greater than (or equal to) 5 =, /= //equals, not equals Return The Return operator denotes the value that a function will return: 1 FunctionName’Result //the return value of function ‘FunctionName’ 118 The first of these is a standard return: just return an expression. The second, called an implicit return, allows the declaration of a new variable (V), and then allows the value of that variable to be constrained. The syntax return V => P(V) can be read: “return the value of variable V, such that predicate P holds on V.” Some things to note 1. In a postcondition, V’Old refers to the value of the variable before execution, and V to the value after. Thus, the syntax is reversed from Alloy’s syntax of labelling the post-state variable. 2. The => and -> operators are different: the former comes after a quantification (e.g. for all), the latter is logical implication. 3. Any valid SPARK expression is a valid expression in a precondition or postcondition. That is, programming-level expressions can be references as part of SPARK predicates. Example 6.4 This example is taken from the SPARK manual (http://docs.adacore.com/spark2014-docs/ html/lrm/mapping-spec.html. The function simply adds two numbers together, but in addition to taking care of the addition, it also prevents integer overflows by specifying, as a precondition, that the inputs cannot lead to an overflow: 1 function Add (X, Y : Integer) return Integer 2 with Pre => (if X >= 0 and Y >= 0 then X <= Integer’Last - Y 3 elsif X < 0 and Y < 0 then X >= Integer’First - Y), 4 Post => Add’Result = X + Y; The contract specifies that this function must return (that is, the result of the expression Add’Result) an integer that adds X and Y. The precondition prevents the overflow. 2 Example 6.5 The following is also taken from the SPARK Proof Manual. Consider a more complex function that swaps two elements in an array: 1 procedure Swap_Array_Elements(I, J : Index; A: in out Array_Type) 2 with Post => A = A’Old’Update (I => A’Old (J), 3 J => A’Old (I)); The precondition states that the new value of array A should be the old value (A’Old), but with the elements at indices I and J replaced by the values at indices J and I respectively in the old array. 2 Example 6.6 The following illustrates linear search as a contract: 1 package Linear_Search 2 with SPARK_Mode 3 is 4 type Opt_Index is new Natural; 5 119 6 subtype Index is Opt_Index range 1 .. Opt_Index’Last - 1; 7 8 No_Index : constant Opt_Index := 0; 9 10 type Ar is array (Index range <>) of Integer; 11 12 function Search (A : Ar; I : Integer) return Opt_Index with 13 Post => (if Search’Result in A’Range then A (Search’Result) = I 14 else (for all K in A’Range => A (K) /= I)); 15 16 end Linear_Search; In this example, types are used to restrict much of the behaviour (e.g. the index of the array), and the postcondition specifies the functional behaviour: if the result is in the range of the array A, then the element at the result must be equal to I (the element being searched for); otherwise, it must be that the element is not in the array. 2 Example 6.7 As a final example, consider the following contract interface, which includes (among others) the func- tions for adding and swapping presented earlier: 1 package Swap_Add_Max 2 with SPARK_Mode 3 is 4 subtype Index is Integer range 1..100; 5 type Array_Type is array (Index) of Integer; 6 7 procedure Swap (X, Y : in out Integer) 8 with Post => (X = Y’Old and Y = X’Old); 9 10 function Add (X, Y : Integer) return Integer 11 with Pre => (if X >= 0 and Y >= 0 then X <= Integer’Last - Y 12 elsif X < 0 and Y < 0 then X >= Integer’First - Y), 13 Post => Add’Result = X + Y; 14 15 function Max (X, Y : Integer) return Integer 16 with Post => Max’Result = (if X >= Y then X else Y); 17 18 function Divide (X, Y : Integer) return Integer 19 with Pre => Y /= 0 and (if X = Integer’First then Y /= -1), 20 Post => Divide’Result = (X / Y); 21 22 procedure Swap_Array_Elements(I, J : Index; A: in out Array_Type) 23 with Post => A = A’Old’Update (I => A’Old (J), 24 J => A’Old (I)); 25 end Swap_Add_Max; 2 120 6.3.2 Similarity to Alloy As a brief note, it is important to note the similarities between the Alloy specification language, and the SPARK contract language. These similarities are not coincidence: the combination of languages such as Alloy and SPARK is common (although to date, most applied work has used the Z language instead of Alloy), and the translation from Alloy or Z to SPARK can be quite straightforward for standard precon- dition/postconditions. That is, systems engineers specify behaviour in Z/Alloy, prove properties about this using tools based on first-order logic, and then translate the specifications into SPARK contracts. The contracts are then implemented in SPARK code, and this code is verified against the contracts. The similarities between languages like Z/Alloy and SPARK make the translation straightforward in many cases. 6.3.3 Tools The SPARK toolset comes with tools for supporting contracts in SPARK, including the SPARK Exam- iner, which can be used to check that a contract is a valid SPARK contract. In the workshops, we will look at a powerful tool chain (as part of the SPARK tools) that can be used for verifying programs against their contracts. 6.4 Additional reading 1. The following offers can excellent tutorial on SPARK contracts, using a real system developed by Altran-Praxis: http://www.adacore.com/home/products/sparkpro/tokeneer/ discovery/. The notation used in this tutorial is for SPARK versions prior to 2014, but the notation for preconditions and postcondition expressions is the same. 2. The following tutorial on SPARK2014 is a great place to start: http://docs.adacore.com/ spark2014-docs/html/ug/getting_started.html 3. The following tutorial on SPARK contracts is another good reference for people who have played with the SPARK examiner: http://docs.adacore.com/spark2014-docs/html/ug/ tutorial.html 4. Bertrand Meyer’s textbook, Object-Oriented Software Construction [2], is a detailed presentation of the Eiffel programming language, including Design by Contract. Bibliography [1] G. Leavens, A. Baker, and C. Ruby. JML: A notation for detailed design. In Behavioral Specifica- tions of Businesses and Systems, pages 175–188. Springer, 1999. [2] B. Meyer. Object-oriented software construction. Prentice Hall New York, 2nd edition, 2002. 121 Chapter 7 Reasoning about program correctness “How do we convince people that in programming simplicity and clarity — in short: what mathematicians call “elegance” — are not a dispensable luxury, but a crucial matter that decides between success and failure? — Edsger W. Dijkstra [3]. Content to be heavily revised in 2022 (please email
[email protected]
for more info): This chapter will be revised in 2022. Stay tuned. In this chapter, we will look at a very powerful piece of work, known as Hoare logic, named after its creator, Tony Hoare, who is currently a principal researcher at Microsoft. This work dates back as far as 1969, when Hoare proposed the first axiomatic basis for computer programming [4]. Taking only a set of few primitives programming constructs (now considered to be the basis of structured programming), Hoare demonstrated a simple logic that enabled programs to be proved correctness. Since that time, much research has gone into building on Hoare’s initial work, with considerable success. We will look at Hoare logic and show how to use it to reason about the correctness of programs, allowing us to prove that a program meets its contract/specification. This is much more powerful than running tests, which constitute nothing more than a handful of observations about our program, and also more powerful that using review to check for a larger set of cases. For some systems, testing and review are not enough. By the mid-1970s, there was a general belief that this type of reasoning would pervade software engi- neering within a decade or two. However, three things happened: 1. The speed and power of computers continued to increase so quickly that the types of systems that could be constructed using software outgrew the techniques for such complex reasoning. 2. Software began to pervade homes and businesses as useful tools, and the correctness of these software systems was not as important as it was for systems in which software had previously been applied, such as digital components. Further, time to market became more important than it had previously, so there was an acceptance of lower quality for a faster turnaround. 3. The spread of the Internet has meant that fixing many software applications could be done by users downloading patches, or in the case of server-side problems, developers just uploading a 122 new version to the server. This is not sufficient for many safety-critical system, in which the software resides in a non-PC and non-server environment (e.g. on a car ECU). It is my belief that the techniques presented this chapter are the most important of the subject — yet are also the techniques that I can confidently state none of you will ever apply as they are applied in these notes. The importance of this chapter is to change the way that you think about programs, the science of programming, and of software design and verification, in order to make you better programmers and software engineers. Learning outcomes The learning outcomes of this chapter are: 1. To critique the difference between program proof and other forms of verification, such as testing. 2. To be able to prove the correctness of simple programs against their contracts. 3. To think about programming in a more systematic and mathematical manner, with the result of improving our programming skills. 7.1 Introduction to reasoning about programs The first thing to ask is why we would want to reason about programs at all. In previous chapters, we have looked at how to formally specify and design (by contract) our system, and to prove that the specification and design satisfy certain properties. Proof of formal statements are possible because we are operating in space of logic and set theory, which have well-founded proof theories. However, for high integrity systems, we want to ensure that our designs are implemented correctly. That is, we want to show that our code corresponds to our design. Quality assurance techniques such as peer review and testing seem to offer very good reliability for software. However, with respect to high-integrity constraints, they do not offer the level of assurance that is required. What if we have missed a few cases that can cause catastrophic consequences? This chapter covers the theory of how we can prove that a program meets its contract. This is a powerful ability that provides us with assurance far above what is possible with testing and review. 7.1.1 The correctness of binary search Let’s motivate the idea of reasoning about programs using the following story from Jon Bentley: “Binary search solves the problem [of searching within a pre-sorted array] by keeping track of a range within the array in which T [i.e. the sought value] must be if it is anywhere in the array. Initially, the range is the entire array. The range is shrunk by comparing its middle element to T and discarding half the range. The process continues until T is discovered in the array, or until the range in which it must lie is known to be empty. In an N-element table, the search uses roughly log(2) N comparisons. 123 Most programmers think that with the above description in hand, writing the code is easy; they’re wrong. The only way you’ll believe this is by putting down this column right now and writing the code yourself. Try it. I’ve assigned this problem in courses at Bell Labs and IBM. Professional programmers had a couple of hours to convert the above description into a program in the language of their choice; a high-level pseudocode was fine. At the end of the specified time, almost all the programmers reported that they had correct code for the task. We would then take thirty min- utes to examine their code, which the programmers did with test cases. In several classes and with over a hundred programmers, the results varied little: ninety percent of the pro- grammers found bugs in their programs (and I wasn’t always convinced of the correctness of the code in which no bugs were found). I was amazed: given ample time, only about ten percent of professional programmers were able to get this small program right. But they aren’t the only ones to find this task difficult: in the history in Section 6.2.1 of his Sorting and Searching, Knuth points out that while the first binary search was published in 1946, the first published binary search without bugs did not appear until 1962.” — Jon Bentley, Programming Pearls (1st edition), pp. 35-36 [1]. Donald Knuth mentions in The Art of Computer Programming [6] that the first version of the binary search algorithm appeared in 1946, but it was not until 1962 that the first correct version appeared. All previous versions contained faults. Further, the most well-known implementation, written and “proved” by Jon Bentley, which has been adapted many times, contains a fault. The truth is, very few correct versions of the algorithm exist. Exercise 7.1 There is a fault in the following widely-tested and widely-used version of binary search. Find this fault. TIP: Converting to SPARK and using the SPARK tools to verify the program will reveal the fault. 1 int binarySearch(int [] list, int target) 2 { 3 int low = 0; 4 int high = len(list) - 1; 5 int mid; 6 7 while(low <= high) { 8 mid = (low + high)/2; 9 if (list[mid] < target) { 10 low = mid + 1; 11 } 12 else if (list[mid] > target) { 13 high = mid - 1; 14 } 15 else { 16 return mid; 17 } 18 } 19 return -1; 20 } 2 124 The question then is: if some of the best and brightest minds in computer science cannot get a seemingly simple algorithm correct using testing, and dozens of versions in textbooks can be published with faults that remain undetected for decades, what hope do we have for producing correct software the first time? In this chapter, we will look at Hoare logic as one approach for proving a program correct with respect to its specification. In workshops, we will look at powerful support provided by the SPARK tools that can assist us in this goal. 7.2 A small programming language Before we look at Hoare logic, we’ll define a small structured programming language that is Turing compatible, and which will form the basis of the mathematical objects in Hoare’s logic. The language consists of procedures, variables, expressions (numerical expressions and arrays of expressions), variable assignment, sequencing, conditionals, loops, and procedure calls: Procedures: P := procedure p(v1, . . . , vn) =̂ S where v1, . . . , vn are variable names, p is a procedure name, and S is a program state- ment (defined below). Expressions: E := NE | a[NE] NE := n | v | E1 + E2 | E1 − E2 | E1 × E2 | E1/E2 | etc. . . where n is a number, v is a variable, a is an array variable, NE is a numerical expres- sion, and E an expression. Booleans B := true | false | ¬ B | E1 = E2 | E1 ≤ E2 | etc. . . Statements S := skip | v := E | a[NE] := E | p(E1, . . . ,En) | S1; S2 | if B then S1 else S2 endif | while B do C done where v is a variable, a is an array variable, and p is a procedure name. The skip statement is a statement that has no effect and always executes successfully, the same as null in Ada. A program in this language consists of a (possibly empty) sequence of procedure definitions, and a final “main” program; e.g.: procedure p1(v1, . . . , vm) =̂ S1 . . . procedure pn(v1, . . . , vn) =̂ Sn S where S is the main program, which may reference procedures p1, . . . , pn. Using this simple language, we can construct any program that possible with more expressive languages, such as those with references, but the program may be more long-winded. Example 7.1 Using this language, we can construct a program that calculates the factorial of 10 an assigns it to a 125 variable f (where n is an input variable and f is an output variable): Algorithm 1 A program for calculating the factorial of a number procedure FACTORIAL(in n, out f ) f := 1 i := 0 while i 6= n do i := i + 1 f := f × I end while end procedure FACTORIAL(10, f ) 2 7.3 Hoare logic Hoare logic [4] is an axiomatisation of programming. Hoare’s aim was to treat computer programs as mathematical objects, and to provide a sound and complete set of rules for proving things about these objects. 7.3.1 Introduction to Hoare logic Hoare logic is a set of rules for reasoning about programs. The rules are organised around the program statements in the simple programming language defined above. Any program in the language is a col- lection of statements, as defined by the grammar S, with a (possibly empty) set of procedures that it references. The rules are applied to the program in a top down manner, until the final result is the application of rules to atomic statements: skip or variable assignments (V := E). The rules for compound program statements, such as sequencing and branching, require us to prove properties about the statements that make up the compound statement. Statements about programs are written using Hoare triples, which take the following form: {P} S {Q} This states that, given a program S with precondition P, if P is true when S is executed, then S will establish the postcondition Q. In other words, a Hoare triple describes how executing a program changes the variables in the program. Hoare logic is a collection of axioms and rules of this form that can be used to reason about the correct- ness of programs; similar to using Peano’s axioms to reason about numbers. To prove correctness of our program, we need to show that {P}S{Q} is true, using Hoare’s axioms and rules. The rules are written as inference rules, of the form: H1, . . . ,Hn C 126 where H1, . . . ,Hn are a collection of hypotheses, and C is the conclusion. An inference rule should be read as: if hypotheses H1, . . . ,Hn can all be proved, then the conclusion C is true. Thus, to prove C, we must prove H1, . . . ,Hn. For example, consider the most famous inference rule of all — modus ponens: A,A⇒ B B This states that, if A is true and A⇒ B is true, then we can infer that B must be true. An inference rule with no hypotheses is called an axiom, and in often just written as the conclusion with no line. For example, the following is an axiom of propositional logic: A⇒ (B⇒ A) That is, this proposition is always true. This could be written as: true A⇒ (B⇒ A) but the line and premise are often omitted for readability. In Hoare logic, a hypothesis can be either a predicate or a Hoare triple, and the conclusion is always a Hoare triple. 7.3.2 An introductory example — Incrementing an integer Consider the following SPARK program used to increment a variable: 1 procedure Inc (X: in out Integer) 2 with 3 Pre => (X >= 0), 4 Post => (X > 0); 5 is begin 6 X := X + 1; 7 end Inc; Can we prove that this program establishes its postcondition for all X that satisfy the precondition? It may seem quite easy for us to reason about this in our head to convince ourselves that the program is correct, however, how could we go about automating or mechanising this? 7.4 The rules of Hoare logic In this section, we present the fundamental rules of Hoare logic, and present several examples. 7.4.1 Assignment axiom The assignment axiom is specified as a triple: 127 {P[E/x]} x := E {P} where P[E/x] represents substituting all occurrences of the variable x with expression E in P. Note that this rule has no hypotheses, just a conclusion, so it is an axiom rather than an inference rule. This rule says that, if we want to prove that P is true after executing the statement x := E, then we need to prove that P[E/x] is true before executing x := E. When using this rule on a program proof, we derive P[E/x] from the program and the postcondition. P[E/x] is called the weakest precondition of S that establishes P. That is, it is the weakest, or more general, predicate for which program S establishes the postcondition P. Example 7.2 The following are two simple Hoare triples that are true: {y = 0} x := y {x = 0} {5 + y ≥ 0} x := 5 {x + y ≥ 0} In both triples above, the variable x has been substituted by the assigned expression to form the precon- dition. 2 Example 7.3 So, using the assignment axiom, can we prove that our program Inc establishes its postcondition? To do this, we need to prove the following: {x ≥ 0} x := x + 1 {x > 0} From Hoare’s assignment axiom, we derive the weakest precondition of the program x := x+1 from the postcondition x > 0. If we substitute x + 1 for all occurrences of x in x > 0, we are left with x + 1 > 0, and the resulting proof we need to show is: {x + 1 > 0} x := x + 1 {x > 0} However, the precondition of this statement is not the same (syntactically) as the precondition of the original program. To finish the proof, we need to introduce another rule, which we introduce now. 2 7.4.2 Consequence rule Hoare’s consequence rule is as follows: P′ ⇒ P, Q⇒ Q′, {P} S {Q} {P′} S {Q′} This states that, if we can establish that {P} S {Q} is true, then any precondition that is at least as strong as P, and any postcondition that is at least as weak as Q, are also valid preconditions/postconditions for the program S. 128 Example 7.4 Some examples: {x ≥ 0} x := x + 1 {x > 0} (because x ≥ 0 ⇒ x + 1 > 0) {5 + y ≥ 0 ∧ z = 0} x := 5 {x + y ≥ 0} (a strengthened postcondition) {5 + y ≥ 0} x := 5 {true} (a weakened postcondition) 2 Proving assignment statements To prove that an assignment statement establishes its postcondition for some precondition other than the weakest precondition, we need to prove that Q ⇒ P[E/x], where Q is the precondition. For example, to prove the following triple: {true} x := 5 {x ≥ 5} Apply the assignment axiom: {5 ≥ 5} x := 5 {x ≥ 5}, and then prove true⇒ 5 ≥ 5, which is trivially true. Example 7.5 Let’s return to our example of the Inc procedure. We need to prove the following: {x ≥ 0} x := x + 1 {x > 0} Applying the assignment rule, we get: {x + 1 > 0} x := x + 1 {x > 0} Now, we need to prove that the precondition of the original program implies the weakest precondition above. That is, we know that the program establishes its postcondition if the precondition is x + 1 > 0, but does it prove it if x ≥ 0? We need to prove: x ≥ 0⇒ x + 1 > 0, which holds trivially: if x itself is greater than or equal to 0, then x + 1 must be strictly greater then 0. Therefore, we have proved that our program establishes its postcondition when the precondition holds. 2 7.4.3 Sequential composition rule Consider the following SPARK program for swapping two numbers: 129 1 procedure Swap (X, Y : in out Float) 2 with 3 Post => (X = Y’Old and Y = X’Old); 4 is 5 T : Float; 6 begin 7 T := X; 8 X := Y; 9 Y := T; 10 end Swap; Using Hoare logic, we need to prove the triple: {true} t := x; x := y; y := t {x = Y ∧ y = X} Auxiliary variables Before we tackle this proof, consider using a programming language that allows simultaneous assign- ment; e.g. x, y := y, x; which means that x gets the value of y and y gets the value of x simultaneously, eliminating our need for the “temp” variable. If our swap program is implemented using that, we have to prove the triple: {true} x, y := y, x {x = Y ∧ y = X}, in which X and Y are the values of x and y before the program is executed — like X in Ada. Applying the assignment axiom: {y = Y ∧ x = X} x, y := y, x {x = Y ∧ y = X} How do we prove true⇒ y = Y ∧ x = X? The trick here is that we don’t have to! The variables X and Y cannot occur in the program: they are auxiliary variables, and represent the value of the corresponding variable before the program is executed. All free variables in a program have an auxiliary variable that specifies its pre-execution value. At any point in a program, x represents the current value of the variable x, where “current” means im- mediately after the previous statement is executed, and X represents the pre-execution value. Therefore, at the start of the program, before any statement has been executed, it must be that x = X. That is, a variable is always equal to its pre-execution value before any part of the program has executed. As a result, x = X is an implicit part of every precondition, so we do not need to prove propositions of the form x = X in the precondition — they always hold. Reasoning about swap using the assignment axiom — incorrectly Now, let’s return to the original definition of the swap program. 130 The assignment axiom on its own is not sufficient to prove this program is correct. If we simply substitute in the assignment expressions for the variables, we end up with the following: {y = Y ∧ t = X} t := x; x := y; y := t {x = Y ∧ y = X} Here, we cannot prove t = X. And in fact, this does not need to be true for the program to establish its postcondition. The problem here is that we attempted to apply the assignment rule to a sequence of (non-simultaneous) assignment statements, which does not work. Instead, we have to “work backwards” through the program sequential composition used in the program. Sequential composition rule To solve this problem, we use Hoare’s sequential composition rule: {P} S1 {R}, {R} S2 {Q} {P} S1; S2 {Q} This states that, to prove that the program S1; S2 will achieve the postcondition Q from precondition P, we need to prove that the subprogram S1 can get to some intermediate state R, and from R, subprogram S2 will establish Q. But how do we determine what R should be? The weakest precondition of S2! Example 7.6 Finally, we can reason about the correctness of the “swap” program correctly. First, we apply the se- quential composition axiom: {true} t := x; x := y {R} y := t {x = Y ∧ y = X} Note here that we are concatenating two Hoare triples as shorthand, such that {P} S1 {R}, {R} S2 {Q} is equivalent to {P} S1 {R} S2 {Q}. Then, we find the weakest precondition, R, using the assignment axiom: {true} t := x; x := y {x = Y ∧ t = X} y := t {x = Y ∧ y = X} And then apply the sequential composition rule to the top triple: {true} 131 t := x {y = Y ∧ t = X} x := y {x = Y ∧ t = X}, y := t {x = Y ∧ y = X} The bottom two triples are trivially true from the assignment axiom, so we need to just prove the top one. Again, we apply the assignment axiom to get: {y = Y ∧ x = X} t := x {y = Y ∧ t = X}. The above holds trivially from the assignment axiom, so all that is left is to prove that the weakest precondition above is implied by the precondition (true) of the swap program (the consequence rule). For this, we need to prove: true ⇒ y = Y ∧ x = X which holds trivially. In fact, the prove is equivalent to proving true ⇒ true, because both y = Y and x = X hold trivially at the start of the program. The three triples are all true, so our program is now proved to establish its postcondition. 2 7.4.4 Empty statement axiom The empty statement axiom asserts that the skip statement changes nothing, so whatever holds before skip is executed will hold afterwards: {P} skip {P} This axiom is trivial, but is important for completeness of the axioms. 7.4.5 Conditional rule The next rule we introduce is for proving programs containing conditional statements. We assume that if-then-else statements are the only conditionals, but other conditionals are just shorthand for if-then-else, and even then, the reasoning rules are much the same. The conditional rule is: {P ∧ B} S1 {Q}, {P ∧ ¬B} S2 {Q} {P} if B then S1 else S2 endif {Q} This states that, to prove that the program if B then S1 else S2 endif establishes the postcondition Q from precondition P, we need to prove that each branch establishes the same. However, we assume that B holds for the true branch, and that ¬B holds for the false branch. 132 Example 7.7 Conditional swapping The illustrate this, consider our swap program again, except that it only swaps the values of x and y if x < y. In other words, it establishes the postcondition x ≥ y: {true} if x < y then x := t; x := y; y := t else skip end if {x ≥ y} As a side note, the statement if B then S else skip endif is equivalent to just if B then S endif (that is, removing the else branch — a one-armed branch statement), so there is no need to a rule covering this case. For our conditional swap program, we must prove the following two triples: {x < y} t := x; x := y; y := t {x ≥ y} {x ≥ y} skip {x ≥ y} Going backwards through the first triple, we can reason about this as follows: {y ≥ x} t := x {y ≥ t} x := y {x ≥ t} y := t {x ≥ y} All that remains is to apply the consequence rule to show that the weakest precondition is implied by the precondition of the first Hoare triple; that is, x < y ⇒ y ≥ x. This holds trivially, therefore, the first triple is proved. What remains is to prove the second triple: {x ≥ y} skip {x ≥ y}, which holds trivially from the empty statement axiom. We have proved that both branches of the statement establish the postcondition, and therefore, the entire program establishes its postcondition. 2 7.4.6 Iteration rule Now, we get to the most difficult-to-apply rule in Hoare logic: the iteration rule. As it name says, it is used to reason about iteration in programs. As with conditionals, we assume only one basic iteration 133 operator, the while loop, but other type of loops, such as for and do-while loops, are straightforward extensions from this. The iteration rule is: {P ∧ B} S {P} {P} while B do S done {¬B ∧ P} In this rule, the predicate P is the loop invariant. This rule states that, provided B is true, if the loop invariant holds before the loop body S executes, and holds after S executes, then the while loop will preserve the loop invariant as well. In addition, the while loop will establish the condition ¬B — the branch condition B must be false for the loop to exit. Therefore, a loop invariant is a property that holds before the loop, at the end of each iteration, and after the loop. Reasoning about loops To reason about a while loop, we must actually calculate the loop invariant. It turns out that this is by far the hardest part about applying Hoare logic. The loop body change the variables of the program. Our job is to find a relationship between the values of the variables that are preserved by the execution of the loop body. That is, the individual values of the variables may change, but the relationship between them does not. If we want to prove a postcondition that is different to the loop invariant; e.g.: {P} while B do S done {Q} we use the iteration rule, and then prove that (¬B ∧ P)⇒ Q using the consequence rule. Example 7.8 To demonstrate the rule as well as the concept of loop invariants, we’ll consider the following SPARK program for calculating the factorial of a number: 1 procedure Factorial (N : in Integer, 2 F : out Integer) 3 with 4 Pre => (N >= 0), 5 Post => (F = Fact(N)); 6 is 7 I : Integer; 8 begin 9 F := 1; 10 I := 0; 11 while I /= N do 12 I := I + 1; 13 F := F * I; 14 done; 15 end Factorial; In this procedure, the expression Fact(N) refers to the factorial mathematical function. To prove this program establishes its postcondition under its precondition, we need to prove the following triple: 134 {n ≥ 0} f := 1; i := 0; while i 6= n do i := i + 1; f := f ∗ i; end while {f = n!} where n! is defined as the factorial function: 0! = 1 n! = n× (n− 1)! where n > 0 First, we need to establish the loop invariant. The factorial loop invariant The loop invariant for the factorial program is: f = i! That is, f is the factorial of i before the loop, and this is re-established after each iteration, including the final iteration. In this program, as with others, the loop invariant is generally based on the postcondition that the loop needs to establish. Back to the proof The first step is to apply the sequential composition rule to split the program into its constituent parts: 1: {n ≥ 0} 2: f := 1; 3: i := 0; 4: {R} ← new annotation (from composition rule) 5: while i 6= n do 6: i := i + 1; 7: f := f ∗ i; 8: end while 9: {f = n!} To find R, we need to calculate the weakest precondition of the while loop using the iteration rule. From the iteration rule, we know that the final result will be ¬B ∧ P — the negation of the branch (the loop has exited) and the loop invariant. We annotate our proof such that after the loop, the branch is false and the invariant holds: 1: {n ≥ 0} 2: f := 1; 3: i := 0; 4: {R} 5: while i 6= n do 135 6: i := i + 1; 7: f := f ∗ i; 8: end while 9: {i = n ∧ f = i!} ← new annotation 10: {f = n!} We need to prove that this new annotation, (f = i! ∧ i = n), implies the postcondition. That is, we need to show: (f = i! ∧ i = n)⇒ f = n!. We can substitute n for i in the premise using the one-point rule, leaving f = n! ⇒ f = n!, which is true. Essentially, we can now eliminate the postcondition from the proof (although it will remain for completeness in these notes). The other thing we know from the iteration rule is that the invariant must hold at the both the start and the end of the loop. Also, at the start of the loop, the Boolean expression in the branch must be true. So, we annotate the start and end of the loop body with the invariant, and just the start of the loop with the negated branch condition: 1: {n ≥ 0} 2: f := 1; 3: i := 0; 4: {R} 5: while i 6= n do 6: {f = i! ∧ i 6= n} ← new annotation (loop invariant and negated branch condition) 7: i := i + 1; 8: f := f ∗ i; 9: {f = i!} ← new annotation (loop invariant) 10: end while 11: {f = i! ∧ i = n} 12: {f = n!} We have already established that the annotation at line 11 implies the postcondition at line 12. The next thing to prove is that the loop body preserves the loop invariant. This involves proving the following Hoare triple: {f = i! ∧ i 6= n} i := i + 1; f := f ∗ i; {f = i!} We apply the sequential composition rule and the assignment axiom to establish the weakest precondition of the sequential composition: 1: {f = i! ∧ i 6= n} 2: {f × (i + 1) = (i + 1)!} ← new annotation (weakest precondition of i := i + 1) 3: i := i + 1; 4: {f ∗ i = i!} ← new annotation (weakest precondition of f := f ∗ i) 5: f := f ∗ i; 6: {f = i!} We now need to prove that the loop invariant and the guard establish the weakest precondition of the loop 136 body. For the factorial program, this means we must prove the following: (f = i! ∧ i 6= n) ⇒ f × (i + 1) = (i + 1)! From the definition of the ! operator, we know that (i + 1)! = (i + 1)× i!. If we take the right-hand size of the implication and substitute this, we get: (f = i! ∧ i 6= n) ⇒ f × (i + 1) = (i + 1)× i! Dividing both sides by i + 1 leaves us with (f = i! ∧ i 6= n) ⇒ f = i! which holds. Therefore, we have established that the loop preserves the loop invariant. Further to this, we have successfully applied the iteration rule, means that we now have the conclusion: {P}while B do S done {Q}. This means that we have established that the intermediate predicate R can be replaced with just P, leaving our proof in the following state: 1: {n ≥ 0} 2: f := 1; 3: i := 0; 4: {f = i!} 5: while i 6= n do 6: {f = i! ∧ i 6= n} 7: {f × (i + 1) = (i + 1)!} 8: i := i + 1; 9: {f ∗ i = i!} 10: f := f ∗ i; 11: {f = i!} 12: end while 13: {f = i! ∧ i = n} 14: {f = n!} The iteration rule has been successfully applied, completing the first part of the proof of the sequential composition rule. What remains is to prove the first part of the sequential composition: {n ≥ 0} f := 1; i := 0; {f = i!} That is, we need to prove that the first part of the program (the assignment of initial values to the vari- ables) establishes the loop invariant. We do this by applying the sequential composition rule, and two applications of the assignment rule: {n ≥ 0} {1 = 0!} f := 1; {f = 0!} i := 0; {f = i!} 137 To complete the proof, we need to show that the precondition implies the weakest precondition above: n ≥ 0⇒ 1 = 0! 1 = 0! is true from the definition of factorial, so this is true, which completes our proof of the second part of the sequential composition rule, meaning that the entire sequential composition (the variable initiali- sation composed with the while loop — the factorial program) establishes its postcondition whenever its precondition it true. A re-cap To recap the proof of the factorial program, we did the following (in reverse order): 1. We proved that the loop invariant was established just prior to executing the loop; that is, the initial variable assignment established the loop invariant. 2. We proved that the loop maintains the loop invariant; that is, the loop invariant holds at both the start of the loop body and the end of the loop body. 3. Using our loop invariant, we proved that the loop establishes the postcondition f = n! after the loop finishes executing. These are the necessary conditions to prove the factorial program using the iteration rule, so this proves that the factorial program establishes its postcondition. 2 Loops and termination The iteration rule presented in this section is not valid for all while loops: it does not consider the case in which the loop does not terminate. Example 7.9 Consider the following non-terminating while loop, with precondition and postcondition: {y ≥ 0} while y ≥ 0 do x := 0; end while {y ≥ 0} Applying the iteration rule, we can prove the hypothesis: {y ≥ 0} x := 0{y ≥ 0} which allows us to conclude the following: {y ≥ 0} while y ≥ 0 do x := 0; 138 end while {y ≥ 0 ∧ ¬y ≥ 0} The concluded postcondition is a contradiction. Not only does our loop achieve no postcondition because it never terminates, if it did, it could not possibly achieve the postcondition y ≥ 0 ∧ ¬y ≥ 0. 2 This means that the iteration rule only allows us to prove partial correctness. The correctness is partial because we have not proved that the program terminates, and thus, it may not even achieve its postcon- dition. What we have proved is that, if the program does terminate, then it will satisfy the postcondition. The solution is a modified rule that must include a proof that the loop terminates. Including the modified rule gives us the ability to prove total correctness. We can define total correctness as: total correctness := partial correctness + termination We will not consider total correctness in completeness in these notes, however, for the interested reader, the rule is as follows: [P ∧ B ∧ E = n] S [P ∧ E < n], P ∧ S⇒ E ≥ 0 [P] while B do S done [¬B ∧ P] Note that the Hoare triple is now expressed with square brackets instead of curly brackets — this denotes a rule of total correctness. In the above rule, the expression E refers to a loop variant. The idea behind the rule is to demonstrate that the loop variant E decreases on each iteration of the loop, implying that the loop must terminate. We will not consider total correctness further in these notes. However, the distinction between partial and total correctness is an important one. 7.4.7 Establishing loop invariants Establishing loop invariants automatically is an open and challenging research question. This is good news for researchers in the area, but bad news for those of us that want to prove programs correct. Loop invariants are important because knowing the loop invariant provides you with a stronger prop- erty to work with when showing that the loop establishes the required postcondition. There are three properties that we require in a loop invariant, I: 1. It should hold before the loop executes. That is, if the annotation before the loop is P, then P⇒ I. It must be weaker than its precondition. 2. With the negated branch condition, ¬B, it should establish the result that we want. That is, if the result we want is Q, then I ∧ ¬B⇒ Q. It must be stronger than its postcondition. 3. With the branch condition, B, the body must re-establish the invariant. That is, {B ∧ I} S {I}. 139 Heuristics for finding loop invariants The above properties are not enough to find loop invariants. While finding loop invariants requires some creativity, there are heuristics to help with this. The following heuristics can help us to infer loop invariants: 1. Loop invariants generally contain most of the variables from the loop condition, loop body, the precondition, and the postcondition (where we mean the precondition/postcondition of the loop, not necessarily the entire program). 2. They generally state a relationship between the variables. 3. They generally contain a predicate that is closely related to the postcondition, but is not the same. 4. They hold even if the loop condition is false. Example 7.10 In the factorial program, we have the following: 1. Initially, f = 1 and i = 0. 2. After termination, we want that f = n! and i = n. 3. At each loop, i and f are both increased. From this, we can infer that a good loop invariant is f = i!. 2 Example 7.11 As an example of loop invariants and the iteration rule, let’s consider the following program that sums the values in an array of length N: 1 procedure Summation (N : in Integer; A : in IntArray; Sum : out Integer) 2 with 3 Pre => (N >= 0), 4 Post => (Sum = ∑N−1 j=0 A(J)); 5 is begin 6 I := 0; 7 Sum := 0; 8 while I /= N do 9 Sum := Sum + A(I); 10 I := I + 1; 11 done; 12 end Summation; What is the loop invariant for this program? We can eyeball the program to see that it will need to support precondition of i = 0 ∧ sum = 0, and the postcondition sum =∑N−1j=0 a[j]. What we need to look for is a predicate that is close to the postcondition, but describes a relationship between sum and i. We can use what we know about the program to help us. The loop body is summing up elements from an array, adding each element, a[i], one at a time, and using the variable sum to record 140 the result. We know that i is 0 initially and is incremented until it reaches N. The loop exits when i = N, so a good guess is to replace i with N in the postcondition to get: sum = ∑i−1 j=0 This becomes our loop invariant. Now, we can annotate our program with the loop invariant in the right places (before/after the loop, and before/after the loop body), and work backwards to get the following annotations: {N ≥ 0} {0 =∑−1j=0} i := 0; {0 =∑i−1j=0} sum := 0; {sum =∑i−1j=0} while i 6= N do {i 6= N ∧ sum =∑i−1j=0} {sum + a[i] =∑ij=0} sum := sum + a[i]; {sum =∑ij=0} i := i + 1; {sum =∑i−1j=0} end while {i = N ∧ sum =∑i−1j=0} {sum =∑N−1j=0 a[j]} Using these, we need to prove: 1. That the precondition entails 0 = ∑−1 j=0 (the consequence rule — strengthening the precondition): The predicate 0 = ∑−1 j=0 holds from the definition of ∑ , because we are summing a list from index 0 to index -1 (an empty list), and the sum of an empty list is 0. 2. That the loop invariant is established at the start of the loop body: i 6= N ∧ sum =∑i−1j=0 ⇒ sum + a[i] =∑ij=0, which holds by reasoning that if sum is the sum of the array from 0 to i − 1, then the sum of the array from 0 to i must be sum + a[i]. 3. That the invariant and negation of the loop condition establish the postcondition: i = N ∧ sum =∑i−1j=0 ⇒ sum =∑N−1j=0 a[j], which holds be using the one-point rule on i = N and substituting N for the value of i in the premise. Proving these three conditions means that our summation algorithm conforms to its contract. 2 141 7.4.8 Array assignment rule Our simple programming language permits arrays. The values of array variables and their indices can be treated as any other expression. That is, a[1] or a[x] are just expressions. However, array assignment must be treated separately from variable assignment. An array assignment is of the form: a[NE] := E in which a is an array variable, NE is a numerical expression, and E is an expression. For example, a[x] := a[x + 1] assigns the value of the (x + 1)-th element of a to the index x. A first naı¨ve attempt at an array assignment would look similar to the assignment axiom: {P[E/a[NE]]} a[NE] := E {P} That is, replace all occurrences of a[NE] with E. However, this does not take into account that NE may be a variable, and there may be other variables with the same value as NE that can reference a in the precondition or postcondition. This is known as the aliasing problem. Example 7.12 Consider the following counterexample for the rule above: {x = y ∧ a[x] = 0} a[y] := 5 {a[x] = 0} Because a[x] is not referenced in the program, we must assume that its value has not changed. However, this is not the case, because x = y, and therefore a[x] should be 5 after the assignment. 2 The solution for this is to actually treat array assignment as an ordinary assignment statement, except that we consider the array in its entirety. That is, we treat the assignment: a[NE] := E as the assignment: a := a{NE → E} in which a{NE → E} represents the array that is identical to a, except with the NE-th component has the value E — the same as array updates in the SPARK annotation language, discussed in Section 6.3.1. Therefore, the array assignment rule is simply: {P[a{NE → E}/a]} a[NE] := E {P}, which is the same as the assignment rule. To reason fully about arrays, we need to provide axioms for the notation a{NE → E}. The following two axioms are the array axioms in Hoare logic: a{NE → E}[NE] = E a{NE1 → E}[NE2] = a[NE2] where NE1 6= NE2 That is, the new value at the overridden index is E, and the values remain the same at all other indices. 142 Example 7.13 As a demonstration of applying this rule, we use the counterexample of the naı¨ve attempt, which also tests out the validity of the new rule. Using the naı¨ve rule, the following Hoare triple is valid: {x = y ∧ a[x] = 0} a[y] := 5 {a[x] = 0} Using the updated array assignment axiom, we prove that the program does not achieve the postcondition a[x] = 0. Applying the array assignment axiom and working backwards, we get the triple: {a{y→ 5}[x] = 0} a[y] := 5 {a[x] = 0} Now, we need to prove that the original precondition applies the weakest precondition above: x = y ∧ a[x] = 0 ⇒ a{y→ 5}[x] = 0 From the array axioms, we can infer that a[y] = 5, however, from the premise (x = y ∧ a[x] = 0), we can also infer that a[y] = 0, leaving us with: x = y ∧ a[x] = 0 ⇒ a{y→ 5}[x] = 0 ∧ a[y] = 0 ∧ a[y] = 5 which is a contradiction. The proof fails, so the program does not establish its postcondition. 2 7.4.9 Procedure call rule In the first instance, we will consider only calls to non-recursive procedures — either direct or indirect. The rule for this is straightforward, but comes with some restrictions. The procedure call rule is: {P} S {Q} where p(v1, . . . , vn) =̂ S {P[E1/v1, . . . ,En/vn]} p(E1, . . . ,En) {Q[E1/v1, . . . ,En/vn]} This states that, if the body, S, of a procedure p, establishes a postcondition, Q, under the precondition P, then a call to that procedure will establish the same postcondition, but with all parameters replaced by the supplied arguments. To prove a Hoare triple containing a procedure call, we first prove the correctness of the procedure call in terms of its formal parameters, and then replace the formal parameters with the actual arguments in the precondition and postcondition. Unfortunately, this rule is not valid for all procedure calls. If our language only had procedures with no parameters, the simplified version of this rule (with no renaming) would be sufficient. However, the rule is not valid for cases in which the same variable is used as an argument to the procedure. The following example, taken from Hoare [5], demonstrates a counter example to the rule. 143 Example 7.14 Assume: p(x, y) =̂ x := y + 1 (1) {y + 1 = y + 1} x := y + 1 {x = y + 1} [Assignment axiom] (2) true ⇒ y + 1 = y + 1 [Logical theorem] (3) From 2, 3: {true} x := y + 1 {x = y + 1} [Consequence rule] (4) From 1, 4: {true} p(x, y) {x = y + 1} [Procedure rule] (5) From 5: {true} p(z, z) {z = z + 1} [Procedure rule] (6) Clearly, the final line (6) contains a contradiction because z = z + 1 is not satisfiable, showing that the procedure rule is not sound for all programs. 2 Parameters and arguments In many languages, procedures take two types of arguments: value variables, which are values passed to the procedure that cannot be changed; and value-result variables, which are variables passed to the procedure that can be changed. In Ada, these are labelled using the keywords in and out. From the counterexample above, it is clear that certain combinations of parameters and arguments can cause problems in some programs with respect to Hoare logic. To ensure that the procedure call rule is valid, the set of programs accepted by Hoare logic is restricted to the following: 1. The number of arguments in a procedure call must be equal to the number of parameters in the procedure. 2. The formal parameter names must be distinct from each other. 3. The arguments for value-result variables must be distinct from each other. It is this rule that prevents the problem outlined in the counterexample above — the procedure call would be illegal. 4. Value parameters cannot be changed in the procedure body. 5. No recursion. In languages such as SPARK, these rules apply, and the SPARK toolset will raise errors when typecheck- ing programs that violate the above rules. Example 7.15 As an example, consider the factorial program from Example 7.8. In that example, we established that the factorial program conforms to its contract. Now, we want to prove the following call to the FACTORIAL procedure: {N = 10} FACTORIAL(10, F) {F = 10!} We have already proved the triple {N ≥ 0} FACTORIAL(N, F) {F = N!}, which is the hypothesis of the procedure call rule. So, applying the procedure call rule, we substitute in 10 for N to get: 144 {10 ≥ 0} FACTORIAL(10, F) {F = 10!} The postcondition is achieved, but not the precondition. We use the consequence rule to show that the precondition of our program implies the weakest precondition above: N = 10 ⇒ 10 ≥ 0. This is trivially true because 10 ≥ 0 is true. 2 Recursive procedures — the Induction rule Reasoning about recursive procedures is not possible using the procedure call rule above. This is because the procedure call rule “unfolds” a procedure call and proves its body. If the procedure is recursive, then the body contains the same procedure call, so the unfolding will continue ad infinitum. Reasoning about recursive procedures is not so difficult, although it is not straightforward to automate. As it turns out, the solution is simple yet extremely powerful: in the proof of the procedure body, we assume that recursive calls to the procedure already hold. Formally, this can be stated: {P} p() {Q} ` {P} S {Q} where p() =̂ S {P} p() {Q} in which we have omitted parameters to improve readability. In the above, the symbol ` means “entails” or “proves”; so P ` Q means that Q is provable if we assume that P holds. This rule is called the induction rule. It says: if we can prove that the procedure body satisfies the precondition and postcondition under assumption that all recursive calls to p also do so, then p satisfies the postcondition under the precondition. How can this be?! Assuming the conclusion as part of the program body proof seems strange, but the following example provides a concrete case on which it holds. Example 7.16 The following program shows a recursive procedure: procedure p(in x, out y) if x ≤ 5 then y := 1 else x := x− 1; p(x, y) end if end procedure It is clear to reason that if x > 5, it will be decremented until it reaches 5, and the program will terminate. The program will always terminate with y = 1, so the fact that the second branch contains a recursive call is inconsequential: all that matters are the terminating states. 2 Recursive procedure calls can never be terminating statements themselves: the statements within the call must be executed. The terminating statements in the procedure are the only statements with end states, we need to only prove that these statements establish the postcondition. For the recursive statements, we assume that these are true for the purposes of establishing the proof. 145 This is analogous to the use of induction in logic: we prove the base case, and then the inductive case. In the recursive procedure rule above, we offer only the inductive step, because the base case corre- sponds to the zero-th unfolding of the procedure, which is achieved by default: the zero-th unfolding is not executing the procedure at all, which results in a terminating state of false, and false implies any postcondition. 7.4.10 Other rules Many extensions of the basic Hoare logic have been proposed, including extensions that consider pointers and references, concurrency, goto statements, object-oriented programs, and real-time programs. How- ever, the basic Hoare logic here is sufficient to reason about a wide range of programs – including all programs written in the SPARK language. 7.5 Mechanising Hoare logic It is clear that proofs about Hoare triple are long and tedious, even for small programs like the factorial program, with many small details that need to be done correctly. There are many opportunities to make mistakes, which may lead us to “prove” a correct program is incorrect, or even worse, that an incorrect program in correct. However, with the exception of finding loop invariants, many of the steps can be mechanised, improving both the speed at which we can prove programs, and the correctness of the proofs themselves. The procedure for mechanising proofs is straightforward, and it mimics that of how we tackle the proof manually. If we consider that the input to the process is a Hoare triple, then the following steps are undertaken: 1. Annotate the program with the necessary predicates; for example, the annotations between se- quentially composed statements, and the loop invariants. Many of these can be done automatically using the idea of weakest preconditions, but loop invariants are difficult. This is an active research area, even here at the University of Melbourne in the Department of Computing and Information Systems. 2. Use a program to generate verification conditions. The verification conditions can be generated automatically by applying the Hoare logic rules. The parts of the Hoare logic proofs that remain are the verification conditions; for example, that a program precondition implies the weakest pre- condition. 3. Prove as many verifications conditions correct as possible using a theorem prover. 4. Leave the remainder of the verification conditions for the user, who can try to prove manually, or more likely, with a proof assistant tool. 5. If all verification conditions can be proved, the program is correct. If at least one is proved incor- rect, the program is incorrect. If some cannot be proved or disproved, we cannot be certain of the program’s correctness. In workshops, we will look at the SPARK tools, which use logics quite similar to Hoare logic and processes such as the one described above to prove that SPARK programs conform to their contracts. 146 7.6 Dijkstra’s weakest precondition calculus In this section, we briefly present the denotational semantics of programming languages, known as Di- jkstra’s weakest precondition calculus, or Dijkstra’s predicate transformer semantics [2]. Unlike Hoare logic, which uses axioms to define the semantics of a programming language, the weakest precondition calculus assigns each program to a corresponding predicate transformer. These transformers are func- tions that map a postcondition of a program statement to the weakest precondition that achieves that postcondition. We have already seen weakest preconditions in Hoare logic, but the concept has a first-class importance in Dijkstra’s definitions. Definition 7.1 (Weakest precondition) Given a statement S and a postcondition R, the weakest precondition of S under R, written wp(S,R), is the weakest predicate that as a precondition, establishes R if S is executed and terminates. 2 Thus, the following holds: {wp(S,Q)} S {Q}. Dijkstra’s weakest precondition calculus is a set of predicate transformers for determining the weakest precondition of a program given a postcondition. 7.6.1 Transformers The transformers are similar to that of the Hoare logic rules: Rule Input Output Skip wp(skip,R) R Assignment wp(x := E,R) R[E/x] Sequence wp(S1; S2,R) wp(S1,wp(S2,R)) Conditional wp(if B then S1 else S2 endif,R) B→ wp(S1,R) ∧ ¬B→ wp(S2,R) While loop wp(while B do S done) ∃ k.(k ≥ 0 ∧ Pk) where P0 ≡ ¬B ∧ Q Pk+1 ≡ B ∧ wp(S,Pk) For the while loop, the rule states that the loop body S executes k times, and terminates with Q holding. Note that Pk is an infinitely long assertion, for which induction is the most suitable approach to prove in most cases. 7.6.2 Program proof To prove that that a terminating program S establishes a postcondition Q under a precondition P — that is, the Hoare triple {P} S {Q} is provable for terminating S – we need to just prove that P satisfies the weakest precondition: P→ wp(S,Q). 147 7.7 Additional reading 1. Hoare’s original 1969 paper “An axiomatic basis for computer programming” [4] is a truly brilliant article that I highly recommend reading. It contains only a subset of the rules presented in this chapter. A PDF copy is available on the LMS and in the subject repository. 2. Mike Gordon from Cambridge University has written an excellent article on Hoare logic titled “Background reading on Hoare Logic”, which provides additional material and examples. This can be downloaded from http://www.cl.cam.ac.uk/˜mjcg/Teaching/2011/Hoare/Notes/ Notes.pdf 3. Dijkstra’s seminal work on predicate transformer semantics [2]. Bibliography [1] J. Bentley. Programming pearls. Addison-Wesley Professional, 2000. [2] E. W. Dijkstra. Guarded commands, nondeterminacy, and formal derivation of programs. In Pro- gramming Methodology, pages 166–175. Springer, 1978. [3] E. W. Dijkstra. ”Why is software so expensive?”: An explanation to the hardware designer. In Selected Writings on Computing: A Personal Perspective, pages 338–348. Springer, 1982. [4] C. A. R. Hoare. An axiomatic basis for computer programming. Communications of the ACM, 12(10):576–580, 1969. [5] C. A. R. Hoare. Procedures and parameters: An axiomatic approach. In Symposium on Semantics of Algorithmic Languages, pages 102–116. Springer, 1971. [6] Donald E Knuth. The Art of Computer Programming, 3: Sorting and Searching. Addison Wesley, Reading, MA, 2nd edition, 1998. 148 Chapter 8 Advanced Verification 8.1 Pointers and Aliasing Pointers are ubiquitous in programming. If you have programmed in Java, you have programmed with pointers (even if you haven’t been aware of it). Pointers are the source of many bugs in C programs. Ada also supports pointers, via access types (see section 4.9). Just as programming with pointers can often be tricky, reasoning about programs with pointers presents difficulties. To understand why, consider the following Ada program, which is a small variation on an Ada program we met when discussing pointers in section 4.9. Listing 8.1: pass by reference.adb 1 with Ada.Text_IO; use Ada.Text_IO; 2 with Ada.Integer_Text_IO; use Ada.Integer_Text_IO; 3 4 procedure Pass_By_Reference is 5 type Rec is record 6 X : Integer; 7 end record; 8 9 type RecPtr is access all Rec; 10 11 R : aliased Rec; 12 13 procedure Print_R is 14 begin 15 Put("R.X: "); 16 Put(R.X); 17 New_Line; 18 end Print_R; 19 20 procedure CheckR(givenR : in RecPtr) is 21 X : Integer := R.X; 22 begin 23 Put("In CheckR"); New_Line; 24 -- here we know that { X = R.X } 149 25 givenR.X := R.X + 1; 26 -- but here, R.X has changed, even though we did not 27 -- explicitly assign to R! So { X = R.X } is no longer true 28 if GivenR = R’Access then 29 Put("Equal"); 30 else 31 Put("Not Equal"); 32 end if; 33 New_Line; 34 Print_R; 35 end CheckR; 36 37 begin 38 39 R.X := 5; 40 CheckR(R’Access); 41 Put("CheckR done"); New_Line; 42 Print_R; 43 44 end Pass_By_Reference; We have added a local variable X to CheckR procedure, which is assigned the initial value of R.X. Therefore, before we assign to givenR, we know that X = R.X, as reflected in the first comment. However, since givenR refers to the same record as R, when CheckR executes the assignment givenR.X := R.X + 1, this statement modifies the record R. Hence, after that assignment, X = R.X is no longer true, even though the assignment did not explicitly modify the record R!. This highlights the fact that when two variables like R and givenR refer to the same region of memory, reasoning about programs becomes very difficult. The essence of the problem is that updates to one variable (e.g. givenR) can modify a separate variable (e.g. R). This problem is known as aliasing. Definition 8.1 (Alias) We say that two variables alias when they both refer to the same region of memory. For instance, both might be pointers that refer to the same record or the same integer, or one might refer to a record and the other might refer to a field within that record. When this happens, we also say that one variable is an alias for the other. 2 Because pointers make it very easy to (inadvertently) introduce aliasing into a program, reasoning about programs with pointers can be challenging. 8.2 Safe Programming with Pointers: Avoiding Unsafe Aliasing How can we prove things about programs with pointers? Since aliasing is what makes reasoning about pointers difficult, one approach is to limit aliasing. This is the approach taken by recent versions of SPARK, and we will explain it in this section. Another approach is to develop new extensions to Hoare logic that force one to prove, when reasoning about a program, that no unsafe aliasing occurs. We will discuss this approach later in section 8.3. 150 8.2.1 Unsafe Aliasing Aliasing is almost impossible to avoid. But luckily not all kinds of aliasing is bad for reasoning about programs. Consider the following example Swap procedure. Listing 8.2: A Procedure to Swap two Pointers 1 procedure Swap(X : in out IntPtr; Y : in out IntPtr) 2 with Post => (X = Y’Old and Y = X’Old) 3 is 4 Temp : IntPtr := X; -- Temp and X alias each other 5 begin 6 X := Y; 7 Y := Temp; 8 end Swap; This procedure intentionally creates an alias Temp for X. That is, after the assignment Temp := X (when Temp is initialised), those two variables alias each other. However, this alias is not unsafe. The reason it is not a problem is because, unlike in pass by reference.adb in section 8.1, we never use both Temp and X at the same time to access the region of memory that they both refer to. Definition 8.2 (Safe Aliasing) We say that, for two variables X and Y that alias, the aliasing is safe if at no point during the program while the two variables alias, do we ever use both variables at the same time. 2 Under this definition, the aliasing in pass by reference.adb in section 8.1 is not safe, because in the assignment givenR.X := R.X + 1, we are using both givenR.X and R.X at the same time. For two variables to be used “at the same time” the two uses don’t need to be part of the same statement. As another example of potentially unsafe aliasing, suppose we decided to extend the Swap procedure above so that, before doing the swap, it printed out the two integers that X and Y refer to, as follows. Listing 8.3: A Procedure to Swap two integer Pointers while Printing Out the integers 1 procedure Swap_And_Print(X : in out IntPtr; Y : in out IntPtr) 2 with Pre => (X /= null and Y /= null), 3 with Post => (X = Y’Old and Y = X’Old) 4 is 5 Temp : IntPtr := X; -- Temp and X alias each other 6 begin 7 -- print out the integers that X and Y each point to 8 Ada.Integer_Text_IO.Put(X.all); 9 Ada.Integer_Text_IO.Put(Y.all); 10 X := Y; 11 Y := Temp; 12 end Swap; Now the alias between Temp and X is potentially unsafe. Temp and X both alias. We access X to print out the integer it refers to, and then later we also access Temp to assign it to Y (on the final line). 8.2.2 Avoiding Unsafe Aliasing in SPARK To avoid unsafe aliasing, we need to make sure that at all times in the program, even if two variables alias, we only ever use one of them. 151 SPARK enforces this discipline by enforcing an anti-aliasing rule. Definition 8.3 (Anti-Aliasing Rule) This rule says that as soon as an alias is created, by assigning to some variable Y that now aliases an existing variable X, X is not allowed to be used anymore. 2 Note that the Swap procedure above follows this rule because, as soon as we create the alias Temp for X, we don’t use X anymore. Of course, updating X’s value, in the assignment X := Y, does not count because that assignment means that Temp and X no longer alias. SPARK allows this rule to be weakened in some cases. In particular, when it can be sure that the aliasing is being used only to read the memory being referred to (but not to modify it). Therefore one can have limited forms of aliasing when using access-to-constant types. You can read more about SPARK’s restrictions on pointer usage in the SPARK Reference manual [1, Section 6.4.2. Anti-Aliasing]. Another situation in which SPARK allows aliasing is when passing arguments to functions. Recall that in SPARK, functions need to be free of side-effects. This means that SPARK functions cannot modify memory that is referred to by a pointer. Therefore it is always safe to pass two pointers to a SPARK function, even if those two pointers alias. 8.2.3 Anti-Aliasing Assumptions in SPARK Because SPARK enforces the anti-aliasing rule, it means that it is free to assume that any two pointers that are both used at the same time cannot alias. To see how this works, consider the following procedure: Listing 8.4: A procedure that writes to one pointer but not the other. 1 procedure Zero_First_Arg(X : in IntPtr; Y : in IntPtr) 2 with Pre => (X /= null and Y /= null), 3 Post => (Y.all = Y.all’Old and X.all = 0) is 4 begin 5 X.all := 0; 6 Ada.Text_IO.New_Line; 7 end Zero_First_Arg; This procedure takes two integer pointer arguments X and Y and updates the value pointed to by X (mak- ing it point to 0) but not that of Y. The SPARK Prover is able to prove that it satisfies its postcondition, which says that the value pointed to by Y is not changed. This is true, of course, only when X and Y do not alias. The SPARK Prover is free to assume that they do not alias, since both are used (in particular in the postcondition), because it enforces the anti-aliasing rule. 8.2.4 SPARK Pointer Checks Just as the SPARK Prover proves the absence of runtime errors like integer overflow and division by zero, it also proves that each pointer dereference is valid. A pointer is dereferenced when it is used to access the memory that it points to. A pointer is invalid when it is the null pointer (i.e. one that doesn’t refer to valid memory). The SPARK prover will prove that all pointer accesses are valid and will give a warning when it cannot prove this. As an example, take the Zero First Arg procedure above, which accesses the memory that X refers to, by storing 0 to that memory: X.all := 0. This dereference is valid only if X is not the null 152 pointer. This is the reason that the procedure has the precondition requiring that X /= null. Since its postcondition also dereferences Y (in order to state that the memory referred to by Y does not change), it also requires the precondition that Y /= null. 8.2.5 Other Programming Languages: Rust SPARK has only recently begun to support programs with pointers. Without its anti-aliasing rules it would not be feasible for SPARK to reason about Ada programs with access (pointer) types. SPARK’s anti-aliasing rules were heavily inspired by the Rust programming language, from Mozilla. Rust is a programming language that is designed to be a safe alternative to “systems” programming languages like C. One of Rust’s main differences from traditional languages is that it enforces strong anti-aliasing rules in all programs. Rust’s rules are more permissive than the ones described above, and they exist not primarily to allow one to reason about Rust programs but instead to avoid common kinds of programming errors that can occur when using pointers. These include errors that might occur when two pieces of code share access to some common memory, that is deallocated by one of them but subsequently used by the other. Rust prevents this by preventing both pieces of code holding a (de-allocatable) reference to the memory at the same time. The same restrictions also allow one to program safe concurrent programs in Rust, that are free of data race errors. You can learn more about Rust by consulting the various online resources available on that language. 8.3 Separation Logic Another way to reason about programs with pointers, which avoids the restrictions imposed by SPARK, Rust and similar languages, is to extend Hoare Logic to support explicit reasoning about pointers. The most widely used extension to Hoare Logic for reasoning about pointers is called Separation Logic. 8.3.1 Pointers in C Before describing separation logic, we give a brief reminder of how pointers work in the C programming language. We saw earlier how pointers can be used to implement pass-by-reference (i.e. how they can be used to avoid having to pass large data structures around in programs, to improve efficiency). Another common use for pointers is for dynamically allocated memory. The following C program gives a small example. It allocates a very small amount of memory dynamically—just enough to hold a single signed integer—using the malloc() function from the standard C library. The malloc() function returns a pointer that refers to the newly allocated memory, if the allocation was successful. Otherwise, if allocation fails, malloc() returns NULL: the null pointer. The result from malloc() is stored into the local variable p: this variable is of type int *, which means that it is a pointer to a signed integer. The program checks to see if allocation was successful. If so, it stores the integer 5 into the memory that was allocated. Specifically, in C a pointer like the variable p in the program below stores the address of the memory that it refers to. The line of code *p = 5; means “write the number 5 into the memory location that is stored in the variable p”. The program then loops until the value stored in the location that p points to is no longer greater than 153 zero. It repeatedly decrements the value that p points to, by first reading this value into the local variable v and then storing back into this location the result of v - 1. Listing 8.5: A tiny C pointer program. 1 #include
2 #include
3 4 int main(void){ 5 printf("Hello, world!\n"); 6 int *p = malloc(sizeof(int)); 7 if (p != NULL){ 8 *p = 5; 9 10 while(*p > 0){ 11 int v = *p; 12 *p = v - 1; 13 } 14 printf("All done!\n"); 15 }else{ 16 printf("Allocation failed!\n"); 17 } 18 return 0; 19 } Recall that in C programs, local variables live in a region of memory called the stack. As explained above, pointers store memory addresses. These addresses often refer to memory in the heap, which is the region of memory used to hold dynamically allocated structures (like the dynamically allocated integer in the program above). 8.3.2 Separation Logic (for C) Separation Logic is an extension of Hoare logic. That is, it adds just a few extra ingredients to the Hoare Logic that we learned about in chapter 7. Specifically, the memory of programs now contains not only local variables but also the heap. An as- signment like x := x+1 modifies the local variable x, by incrementing its value. In general we can write assignments like x := E where E is an expression that mentions local variables (but cannot not refer to the heap). Local variables do not live in the heap, and modifying a local variable does not alter any of the memory in the heap. To access the heap, there are two new kinds of statements: the load and store statements. The Load Statement The load statement is used to read data from the heap and copy it into local variables. The load statement looks like this: x := ∗E where E is an expression like x + y, z− 1, etc. The expression E is evaluated and its result is interpreted as a memory address. This memory address is looked-up in the heap and whatever value is stored at that 154 heap location is copied into the local variable x. For example, suppose variable p holds a pointer to an integer (that, necessarily, lives in the heap). For instance, p might store the address 120. In the heap at location 120 suppose there is the number 5. Then after executing x := ∗p, x would hold the value 5. The Store Statement The store statement is used to write data into the heap. It looks like this: ∗E1 := E2 where E1 and E2 are both expressions that can mention local variables. The expression E2 is evaluated to produce a value v. The expression E1 is evaluated to produce a memory address a. Then the value v is written into the heap at the memory location a. For example, if p is a local variable holding the address 120 and x holds the number 5, then ∗p := x − 1 evaluates x − 1 to produce the value 4 and stores this value into the heap at location 120. The Points-To Assertion: 7→ To reason about these two new commands, separation logic adds some new, special kinds of assertions that can be used in pre- and post-conditions. The first of these is called the points-to assertion and is written as follows, for expressions Ep and Ev: Ep 7→ Ev It means that in the heap at the location identified by the expression Ep is the value Ev. For instance, in the example above in which the pointer p holds the address 120 and the heap at address 120 holds the value 4, we would write p 7→ 4. As another example, suppose the local variable a held the address of an array of integers that was allocated on the heap. Suppose that this array lives at address 160. Suppose the second element of the array was the number 2 (i.e. we might normally write a[1] = 2, assuming the array was indexed from 0). In separation logic, we write a+1 7→ 2, which literally means that at address 161 in the heap is the number 2. If p is the null pointer, then it does not refer to any value in the heap. Therefore there is no value v for which p 7→ v is true. The Problem of Aliasing The points-to assertion is separation logic’s first weapon in the struggle to reason about pointers. Its sec- ond weapon we will introduce shortly. To understand why it is needed, however, consider the following scenario. Suppose we have two pointers x and y and we know that they both refer to the value 2 in the heap, i.e. we know x 7→ 2 ∧ y 7→ 2. Then suppose we store 3 into the heap using the x pointer, i.e. the program does ∗x := 3. Is it the case that afterwards x 7→ 3 ∧ y 7→ 2? i.e. is the following Hoare logic statement true? {x 7→ 2 ∧ y 7→ 2} ∗x := 3 {x 7→ 3 ∧ y 7→ 2} 155 The store ∗x := 3 will definitely modify the part of the heap that the pointer x refers to, by writing 3 to that location. So afterwards, x 7→ 3 will definitely be true. Can we be sure that after we to ∗x := 3 that this won’t affect the part of the heap that y refers to? What if x and y alias, i.e. they both refer to the same location of the heap? In that case, it would not be true that afterwards y 7→ 2 anymore since y refers to the same location of the heap that x does and so now it would be the case that y 7→ 3 instead. In general, the statement that x 7→ 2 ∧ y 7→ 2 gives us no information to say that x and y don’t alias. As we saw earlier in this chapter, aliasing makes reasoning about pointer programs very difficult. Indeed, it is what invalidates the reasoning above. Separation logic’s second weapon addresses this problem. Separating Conjunction: ? To deal with the problem of aliasing, separation logic introduces a new “and” operator, in addition to the usual ∧ used above. The ordinary ∧ is called conjunction. The new “and” operator is called separating conjunction and is written as ?. For predicates (e.g. pre- or postconditions) P and Q, we write P ? Q and this means that P and Q both hold. However it also says that the parts of the heap that P talks about, and the parts of the heap that Q talks about, must be separate from each other. Put another way, it means that P and Q both hold and, additionally, P and Q each talk about different (non-overlapping) parts of the heap. The part of the heap that a predicate refers to is called the predicate’s footprint or domain. For a pred- icate P we write dom(P) to refer to P’s domain, which is just the set of heap locations that P talks about. The domain of the predicate p 7→ v is just the location p, i.e. dom(p 7→ v) = {p}. The do- main of a predicate P ? Q (for some P and P) is the union of the domains of P and Q respectively: dom(P ? Q) = dom(P) ∪ dom(Q). Then P ? Q means that P and Q both hold and, additionally, that dom(P) ∩ dom(Q) = {}. Therefore, x 7→ 2 ? y 7→ 2 means not only that x and y both point to the value 2 but, additionally, that the two pointers do not alias, since we know that x and y must each refer to separate parts of the heap from each other. Therefore, the following statement is true in separation logic: {x 7→ 2 ? y 7→ 2} ∗x := 3 {x 7→ 3 ? y 7→ 2} 8.3.3 The Meaning of Separation Logic Statements What does a statement {P} S {Q} mean in separation logic? Just like in Hoare logic, when {P} S {Q} we know that if the program S begins execution in a state satisfying the precondition P then, when S terminates, Q will be true. However, in separation logic, we also know something else: S will not fail. By “fail” we mean that S won’t try access an invalid pointer (e.g. a null pointer or a pointer that hasn’t yet been initialised and so might not refer to a valid memory address). For instance, suppose we changed the C program above so that it didn’t check whether malloc() was successful before it dereferenced the pointer p. In that case, the program might fail if malloc() does 156 not succeed. When we prove something about a program in separation logic, we also prove that the program will never fail, assuming it begins execution from a state satisfying its precondition. Therefore separation logic gives us guarantees that programs are memory safe, and so are free of null- pointer dereference vulnerabilities, buffer overflow vulnerabilities, use-after-free vulnerabilities and other memory errors. 8.3.4 The Rules of Separation Logic Separation logic uses the same rules for skip, sequencing, if and while as in normal Hoare logic. It also has the consequence rule, just as in ordinary Hoare logic. (See chapter 7 if you need a refresher on those rules.) It does have different rules for reasoning about assignments to local variables x := E, plus it of course has new rules to reason about load and store statements. However it also has a very special rule that normal Hoare logic does not, which makes separation logic more powerful than ordinary Hoare logic. This special rule is called the frame rule, and this rule harnesses the power of separating conjunction. We discuss the frame rule first, since it is so integral to separation logic, before discussing the other rules. The Frame Rule The frame rule allows one to reason locally about a statement, which means that we can reason about each statement just in terms of the parts of the heap that the statement interacts with while ignoring the rest of the heap. This works because the frame rule allows us to conclude that the parts of the heap that a statement does not touch cannot change, and therefore anything that was true about those parts of the heap before the statement executed must still be true afterwards. The frame rule is written as follows: {P} S {Q}, modv(S) ∩ vars(R) = {} {P ? R} S {Q ? R} Here, S is any program statement. This rule says that if we can prove some statement {P} S {Q} about S, then we know that any other parts of the heap that S does not interact with do not change. Therefore if we know separately that some property R holds before S, then R must still hold afterwards. The property R is called the frame. The use of separating conjunction “?” in the pre- and post-conditions is what tells us that the part of the heap that R talks about is separate from the parts of the heap that P and Q talk about. If we can prove {P} S {Q}, then we know that S only interacts with parts of the heap that are described by P (otherwise S would fail and we wouldn’t have been able to prove {P} S {Q}). We also know that all of these parts of the heap described by P are separate from R (thanks to separating conjunction). Therefore, R isn’t affected by S. Before discussing the side-condition modv(S) ∩ vars(R) = {} in the rule, and why it is needed, first note that the frame rule is what allows us conclude that the following separation logic statement is valid: {x 7→ 2 ? y 7→ 2} ∗x := 3 {x 7→ 3 ? y 7→ 2} 157 In particular, y 7→ 2 is the frame R. Since the statement ∗x := 3 does not interact at all with the part of the heap referred to by y, we know that it will not be affected by this statement. Hence, if y 7→ 2 holds beforehand it will still hold afterwards. The side condition modv(S)∩vars(R) = {} checks that the program doesn’t change the meaning of the frame R. Suppose we changed the program above so that after doing ∗x := 3 it then did y := x. Would the reasoning above still be true? {x 7→ 2 ? y 7→ 2} ∗x := 3; y := x {x 7→ 3 ? y 7→ 2} Clearly it would not be since now, after we run the program x and y no longer refer to separate parts of the heap (they both refer to the same part of the heap because we made y an alias for x), so y 7→ 2 is no longer true and we can’t even use separating conjunction anymore because x and y are longer separate! The problem here is that assigning to y changes the meaning of the frame y 7→ 2. This is why we need the side condition modv(S) ∩ vars(R) = {}. It says that the set of local variables modified by statement S does not include any variables mentioned by the frame R. The set of variables modified by a statement is written modv(S). The set of variables mentioned in an assertion R (pre or postcondition) is written vars(R). When a statement S modifies some variable mentioned by R it can of course change what R means, as the example above demonstrates. So this side condition is needed to make sure that this doesn’t happen, in order to conclude that the frame is not affected by S. modv(S) can be defined very straightforwardly for all statements. Assignment statements x := E modify only the variable x, as do load statements x := ∗E. Store statements do not modify any local variables (they only modify the heap). For all other statements, the set of variables they modify is simply all variables modify by any sub-statement. modv(skip)= {} modv(∗Ep := Ev)= {} modv(x := E)= {x} modv(x := ∗E)= {x} modv(S1; S2)=modv(S1) ∪modv(S2) modv(if B then S1 else S2 endif)=modv(S1) ∪modv(S2) modv(while B do S done)=modv(S) The Load Rule The rule for reasoning about load statements, also sometimes called the heap read rule, is as follows: x 6∈ vars(E) {E 7→ v} x := ∗E {x = v ? E 7→ v} This rule says that if we know that the address E refers in the heap to a valid memory location that holds some value v, then if we load from that location and put the value into the variable x (i.e. we execute x := ∗E), then afterwards x = v must be true and x 7→ E will still be true. The condition that x 6∈ vars(E) is needed to ensure that modifying x does not change the meaning of E: suppose we did the load statement x := ∗(x + 1) and, beforehand we knew that x + 1 7→ 4 was true. 158 Then afterwards, x = 4 would be true but (very likely) it would no longer be the case that x+1 7→ 4 was true anymore (since that would be true only if 5 7→ 4 were true). This rule can be generalised to handle the case in which x ∈ vars(E), but for this course, the rule above is sufficient. The Store Rule The store rule is as follows: {Ep 7→ v} ∗ Ep := Ev {Ep 7→ Ev} It says that after executing ∗Ep := Ev we know that Ep 7→ Ev msut be true. However, this is valid only if Ep already refers to a valid memory location; otherwise, if it didn’t, the program might fail. So we need the precondition that there exists some value v for which Ep 7→ v is true. It doesn’t matter what v is, but we do need to know that Ep 7→ something. The Assignment Rule Recall that in ordinary Hoare logic, we tend to reason backwards over programs. Backwards reasoning is also possible in separation logic. However it is more common in separation logic to reason forwards over programs instead. Therefore, the assignment rule in separation logic is different to the assignment rule in Hoare logic. The assignment rule in separation logic is as follows: x 6∈ vars(E) {emp} x := E {x = E} Here emp is a bit like the assertion “true”: it talks about none of the heap (i.e. it talks about the empty heap—hence its name) and is true of every state. This rule says that after assigning E to x, x = E, as expected. Again we need the side condition that x 6∈ vars(E) to make sure that the meaning of E is not changed. Hence, this rule cannot be used to reason about assignment statements like x := x + 1. To reason about those kinds of statements, we need a more general rule: {x = n} x := E {x = (E[n/x])} This rule says that after the assignment, x = E except that we need to replace any occurrences of x in E with x’s value before the assignment, n. Using this rule, we can reason forwards to prove {x = 3} x := x + 1 {x = 4}, since x = ((x + 1)[3/x]) = 3 + 1 = 4. 8.3.5 An Example Proof As an example proof in separation logic, consider the following statement: {x 7→ vx ? y 7→ vy} t := ∗x; u := ∗y; 159 ∗x := u; y := t {x 7→ vy ? y 7→ vy} This program starts with two valid pointers x and y. It then loads the values that x and y each point to into variables t and u respectively. Finally it stores u to x and t to y, so that finally x points to the value that y was originally pointing at, and vice versa. To reason about this program we first apply the Hoare logic sequencing rule. In separation logic we reason forwards, so the first statement we reason about is t := ∗x with precondition {x 7→ vx ? y 7→ vy}. So we need to find some intermediate assertion ?Q such that: {x 7→ vx ? y 7→ vy} t := ∗x; {?Q} and {?Q} u := ∗y; ∗x := u; y := t {x 7→ vy ? y 7→ vy} both hold. Working on the first goal we apply the frame rule, taking the frame R to be y 7→ vy. Hence ?Q =?P ? y 7→ vy and we we need to find ?P such that {x 7→ vx}t := ∗x{?P}. Using the load rule (since t 6∈ vars(x)), we have {x 7→ vx}t = ∗x{t = vx ? x 7→ vx}, hence let ?P = t = vx ? x 7→ vx and so we know that: {x 7→ vx ? y 7→ vy} t := ∗x; {(t = vx ? x 7→ vx) ? y 7→ vy} This comletes the reasoning about the first statement. We then have to reason about the three remaining statements. We of course use the Hoare sequencing rule to reason next about the statement u := ∗y; under the precondition {(t = vx ? x 7→ vx) ? y 7→ vy} that we just derived. This reasoning is analogous to the reasoning we did for the first load statement (using the frame and load rules). Afterwards we obtain {(t = vx ? x 7→ vx) ? y 7→ vy} u = ∗y; {(t = vx ? x 7→ vx) ? (u = vy ? y 7→ vy)} and we have to prove {(t = vx ? x 7→ vx) ? (u = vy ? y 7→ vy)} ∗x = u; ∗y = t; {x 7→ vy ? y 7→ vy} Of course we apply the sequencing rule so we have to find some intermediate assertion ?S such that {(t = vx ? x 7→ vx) ? (u = vy ? y 7→ vy)} ∗x = u; {?S} ∗y = t; 160 {x 7→ vy ? y 7→ vy} We reason forwards over the first statement. As before we first need to apply the frame rule to the pre- condition to focus on just the part needed by the store rule to reason about ∗x = u. Since ? is associative and commutative, we can rearrange the order and brackets in the different parts of the precondition as we like (technically, using the consequence rule). So we can take as the frame t = vx ? u = vy ? y 7→ vy, and via the store rule and the frame rule we can let ?S = x 7→ u ? t = vx ? u = vy ? y 7→ vy. Then we can apply similar reasoning to solve the final goal: {x 7→ u ? t = vx ? u = vy ? y 7→ vy} ∗y = t; {x 7→ vy ? y 7→ vy} First we use the consequence rule which says that this goal is true if we can find some final postcondi- tion ?Z such that {x 7→ u ? t = vx ? u = vy ? y 7→ vy} ∗y = t; {?Z} and ?Z ⇒ x 7→ vy ? y 7→ vy. We apply the frame rule, taking the frame to be x 7→ u ? t = vx ? u = vy and then apply the store rule to obtain {x 7→ u ? t = vx ? u = vy ? y 7→ vy} ∗y = t; {x 7→ u ? t = vx ? u = vy ? y 7→ t} All that remains to be shown is that x 7→ u? t = vx ?u = vy ? y 7→ t⇒ x 7→ vy ? y 7→ vy which is obvious. This completes the proof. 8.4 Security: Information Flow Content to be added in 2022 (please email
[email protected]
for more info): To be added. Bibliography [1] AdaCore and Altran UK Ltd. Spark 2014 reference manual, 2019. http://docs.adacore. com/spark2014-docs/html/lrm/index.html. 161 Chapter 9 Fault-tolerant design In this chapter, we introduce the concept of fault-tolerant design – a design paradigm in which the system is designed to tolerate certain types of failure during operation. In this chapter, we discuss: 1. the notion of fault tolerance in systems, and its link to reliability and high-integrity systems; and 2. several design principles based on redundancy to design a fault-tolerant system. 9.1 Introduction to fault-tolerance Recall from the subject Software Engineering Methods that the reliability of a system is expressed as a probability: it is the probability that a system may exhibit a failure over a defined interval. Reliability is often measured as Mean Time To Failure (MTTF). Despite our best efforts, faults will still remain in software once it is in operation. Fault-tolerant design is a body of principles and methods for designing systems so that they can continue functioning correctly in the presence of faults. Fault-tolerant designs are used to achieve high reliability. By making a system more tolerant of faults, we increase the mean time to failure. Such an increase is important in many high-integrity systems. The higher the MTTF, the less chance there is of failure, making the system safer, more secure, or more reliable. Definitions Before we start to discuss fault tolerance, we introduce some terms relevant to the topic, some of which you will already be familiar with; and some of which may be defined differently in other chapters in these notes (due to them being used in a different context/topic). 1. A failure is a point in the functioning of a system where the system behaviour deviates from its specification. 2. A fault is an incorrect step, process or data definition which will cause a system to fail for a given set of inputs. 162 3. An error is a manifestation of a fault. 4. A hardware fault is a physical defect that can cause a system or component to produce an error. 5. A software fault is a defect in the source of the software that can cause a system or component to produce an error. 6. The reliability of a system is the probability that the system will not exhibit a failure over a speci- fied interval (e.g. time or set of inputs). Software and hardware failures It is important to note that software and hardware fail in different ways. Hardware can fail as a result of wearing of parts, or parts that were manufactured faulty. To repair the system, replacing the part with an equivalent part is generally sufficient. Further to this, hardware generally fails randomly. In this context, “randomly” means that, given the same inputs, a hardware component can pass on these inputs one time, and then fail the next time. Hardware does not often fail as a result of a design fault. Figure 9.1 shows a model failure curve for a hardware item. Hardware is likely to fail early due to a manufacturing error. If it does not, it tends to operate reliably for a period until the parts being to wear out, at which times its likelihood of failure increases. F ai lu re R at e Infant Mortality Wear Out Time Figure 9.1: A model failure curve for hardware Software does not wear out. Failure are generally the result of the design, so replacing software will typically not restore the system. Further, software generally fails systematically, meaning that in the same state, with the same inputs, it will fail every time, or success every time. What is fault tolerance? A system is said to be fault tolerant if it can continue to function according to its specification in the presence of a finite number of faults. Thus, the field of fault tolerance accepts the fact that hardware and software will fail sometimes, but aims to produce a system that will be able to continue operating under a finite number of faults. In a fault tolerant system, the system itself must be able to detect that a fault has occurred, and work around it. 163 Classifying Failures There are generally accepted to be three main causes of failures: 1. Faults in the specification or design of the system. It is extremely difficult to assure the correctness of a specification. If the specification contains faults then everything that stems from it may also contain faults. 2. Faults in the components making up the system. Components may be defective because of the manufacturing process or because of wear and tear. 3. Faults due to environmental effects. Devices can be subjected to a large number of environmental stresses, e.g. radiation for space vehicles, temperature, or high G-forces. Classifying Faults Faults may be classified using the following set of classes: 1. Temporal behaviour classification: There are essentially three types of faults: (a) Permanent faults do not die away with time but remain until they are repaired or the faulty component replaced. (b) Intermittent faults cycle between fault-active and fault-benign states. (c) Transient faults die away after some time. 2. Output behaviour: There are two categories of output behaviour: (a) A task generates non-malicious output if it is interpreted consistently by all processes receiving the output. (b) A task generates malicious (or Byzantine) output if it cannot be interpreted consistently by all processes receiving the output. Byzantine faults are much harder to neutralise in practice. We will discuss this in detail in Sec- tion 9.5. 3. Independence and correlation: Failures may be categorised as follows: (a) Independent failures do not cause other failures either directly or indirectly. (b) Correlated failures are sets of failures that are related in some way. Redundancy The main means of ensuring fault-tolerance is through redundancy. For a system to function despite the loss of some of its components means that there must have been spare capacity to begin with. Fault-tolerant design techniques can be divided into four categories: 1. Hardware Redundancy: this involves multiple processors and duplicate computations. 164 2. Software Redundancy: this involves multiple developments of the software. 3. Information Redundancy: this involves adding additional bits for error detection and recovery. 4. Time Redundancy: this involves adding slacks to task schedules and rerunning tasks if necessary, which may required rolling back to correctly working states. The first three methods are the most commonly used in modern high-integrity systems, although the fourth is applied successfully in modern database systems. In this section, we will focus on the first three: hardware, software, and information redundancy. Hardware and software redundancy are closely related — the former inspiring the latter. However, due to the differences in how hardware and software can fail (as discussed above), techniques from hardware redundancy must be modified to be applied to software redundancy. Throughout the rest of this chapter, we will look at some methods for achieve hardware, software, and information redundancy, and will analyse their effectiveness and limitations. 9.2 Hardware Redundancy The analysis of hardware is important in high-integrity systems. As discussed in Chapter 2, software on its own cannot cause physical harm — it operates hardware that causes harm. The use of hardware redundancy is common-place outside of high-integrity systems. For example, the Google search engine is run on multiple servers to decrease the likelihood of users being unable to connect. However, we are especially interested in not just switching servers when a server goes down, but detecting and tolerating specific faults in a system. 9.2.1 Redundancy The use of redundancy for increase reliability is applied generally to electrical, mechanical, and software systems, however, the use of redundancy for reliability has been around for centuries: “The most certain and effectual check upon errors which arise in the process of computation is to cause the same computations to be made by separate and independent computers; and this check is rendered still more decisive if their computations are carried out by different methods.” – D Lardner, 1824. The above quote is from D. Lardner’s article in the Edinburgh Review from 1824. Lardner uses the term “computer” as it was 200 years ago: to refer to people who do manual calculations, rather than programs running on hardware. Lardner’s idea still stands today. To achieve fault tolerance in high-integrity systems, multiple units (hardware and/or software) are used to do the same calculations. In 1824, if 2-3 “computers” were to calculate a result and find a different answer, the calculation would be re-done by all three, with the hope that the error made would not be repeated. However, when using hardware/software to do calculations, repeating the calculation is unlikely to result a different answer. Instead, different techniques are required to mitigate these problems. 165 9.2.2 Static pairs A simple way to implement redundancy of hardware is to hard-wire two processors together to form a static pair. The pair runs identical software using identical inputs and compares the output of each process. If the outputs are identical then the pair is functional; otherwise it is deemed faulty and the interface switches the pair off from the network. This is a straightforward way to detect faulty proces- sors/components. P P 1 2 Interface Monitor To Network Figure 9.2: Static pairing with a monitor There are two assumptions in this: (a) that the interface does not fail; and (b) that both processors do not fail identically and around the same time. Assumption (a) is not reasonable but assumption (b) is reasonable for hardware failures; that is, given reliable components, it is unlikely that both will fail identically at the same time. The interface is a critical (single) point of failure of the pair. One way to mitigate this problem is to introduce an interface monitor that checks the behaviour of the interface and conversely the interface can check the functioning of the monitor. Figure 9.2 presents a model of this. The monitor checks that if two different values are sent to the interface, the interface correctly switches the pair off the network. The interface checks the behaviour of the monitor as well. This introduces redundancy for the interface, so this cross-checking fails only if the monitor and interface fail at the same time. 9.2.3 Redundancy and voting Static pairs can be used to detect a faulty component, however, they cannot continue operating reliably if one component fails, as there is no way to determine what the correct operating value should be. In the case of two redundant components, the best we can do when they disagree is to take some action such as shutting down to a fail-safe mode, because it is not straightforward (and often not possible) to determine which component has failed. As such, fault-tolerant designs tend to use more than two redundant components. Once more than two components are used, it must be determined which components are correct, and which are faulty. To do this, we can use voting. The basic method is to use multiple units (hardware and/or software) and to vote on the result. The designer must decide whether to use exact agreement or approximate agreement. 166 Voting and consensus-based reliability Consider three processors A, B and C such that A and B produce the value x, while C produces the value x + α. The problem in voting-based solutions is: What is the value of α which is acceptable? If the same types of processors are executing the same code and receiving the same interrupts at the same time then it is easy to detect the faulty processor. If many components are in exact agreement, and one or two are not, then we can be confident that the two that are not in agreement are faulty. However, this is often not the case. Approximate agreement arises more often. For example, fluctuations in sensor readings or using different processors and algorithms to eliminate design faults may produce slightly different readings, with no way to determine which is correct. To mitigate this problem, we require a metric d(x1, x2) to be defined on outputs x1 and x2. The metric is specific to the system being designed. Essentially, two value are said to be sufficiently equal (SE) if and only if d(x1, x2) — the difference between the two outputs — is less than some specified error margin, . That is, SE(x1, x2) ⇔ d(x1, x2) ≤ However, the metric only provides us with a way to determine whether two values are sufficiently equal. In voting, we must instead determine that many values (more than two) agree to determine an output value of which we are confident. Example 9.1 Consider a system with two sensors used to read the temperature of a chemical storage tank. The read- ing is used to determine what action to take, such as lowering cooling rods into the tank. The system designers decide on an error margin of 0.2 degrees Celsius. The sensors then produce the readings 35.6 degrees and 58.0 degrees. There is no way to decide which reading is accurate, so no action can be taken. However, if we have three sensors, readings producing 35.6, 35.7, and 58.0 degrees respectively, then we can be quite confident (but not certain) that the reading of 58.0 is incorrect, so we can discard this value. 2 In the case of more than two components, we need a voting algorithm to determine which readings are correct/incorrect. There are three common voting algorithms used in high-integrity systems: 1. majority voting; 2. generalised k-plurality voting; and 3. generalised median voting. Each voting algorithm takes a set of values as input, which represent the outputs from the redundant components, and returns a single value based on those inputs, which can then be used by the system to make decisions. 167 Majority Voting The concept behind the majority voting algorithm is quite straightforward: the output is the value on which more than half of the components agree. However, this is not as straightforward to decide as it initially seems. For example, given the set of five sensor readings {30.1, 30.2, 30.3, 30.4, 30.6}, with an error margin of 0.2, what should the output be? The first two elements do not agree with the final element, and the last two elements do not agree with the first. The algorithm needs to provide a single value: which should be chosen? Assume that there are N outputs to be voted on and that N is odd. From an analysis of the system, we have chosen a metric and such that d(x1, x2) ≤ if x1 and x2 are sufficiently equal for practical purposes (note that d is not transitive). The majority voting algorithm then proceeds as follows: 1. From the set of inputs, construct classes of inputs, P1. . . Pn, such that: (a) x, y ∈ Pi iff d(x, y) ≤ ; and (b) Pi is maximal; that is, if z 6∈ Pi there there exists a w ∈ Pi s.t. d(w, z) > . Each class of values should therefore be a set of values that agree with each other, and any input that agrees with that set should be in that set. Values can appear in more than once class. 2. Take the largest class, Pk, generated. This represents the largest maximal set of inputs that agree with each other. In the case that there are multiple largest classes for the same size, break the tie arbitrarily. 3. If Pk contains more than N2 elements (that is, it forms a majority), choose any element in Pk as the output of the voting algorithm. Alternatively, one could take the average value. 4. If Pk contains less than N2 elements, the voting algorithm reports a failure — no consensus could be reached. In such a case, the greater system must determine what to do, such as entering a fail-safe mode or raising an alert. So, the majority algorithms finds the largest set of values that agree with each other. If this forms a majority, then any value in that set is representative of the majority, and can be used as the voting algorithm output. Example 9.2 Consider an example of an array of five sensors measuring the temperature of a chemical storage tank. The readings of the five sensors, and therefore the input to the voting algorithm are: 30.1, 30.2, 30.3, 30.5, 30.7 The error margin is 0.2. None of the values can be immediately rejected, because each value agrees with at least one other value. Using the majority voting algorithm, we would allocate values to the following classes: P1 = 30.1, 30.2, 30.3 P2 = 30.3, 30.5 P3 = 30.5, 30.7 168 The class P1 is the largest class contains more than 52 elements, so any value will suffice. We select the middle value, 30.2, as the output. 2 Generalised k-plurality voting A generalised k-plurality voter is the same as a majority voter, except that the Pi that is chosen must contain at least k elements, where k is determined by the system designer, rather than simply N2 elements. A majority vote is simply a generalised k-plurality voter with k = N2 . A k-plurality voter can be used to raise (increase k) or lower (decrease k) the risk tolerance of a system. That is, for a higher value of k, meaning that a larger set must agree, the system designers are specifying that they will be less tolerant of failure, and therefore, they will be more risk averse. In most real applications, k is specified to be less than N2 , meaning that a majority is not required. Generalised median voting The generalised median voter is a straightforward mechanism that simply chooses the median value by continually removing the elements n and m, such that for all i, j ∈ P, d(n,m) is greater than d(i, j), until one element remains. Thus, it requires inputs that with a total order relation between them, such as ≤. An averaging voter is not sufficient because even a single faulty unit can skew the results terribly. For example, consider a faulty unit that returns 0 for all measurements, with two correctly functioning units. They may return 0, 100, 102. The average is 67.3∗, while the correct value is more likely to be around 101. Question: Couldn’t the generalised median voting algorithm be even more influenced by faulty units: what if the median value is generated by a unit that is faulty? Comparison of voters Table 9.1 presents a comparison of the three voters for various situations. For example, if the outputs from all components are correct and sufficiently equal (the first row of the table), then all three voters will produce a correct output. However, if a majority of the outputs are incorrect and none are sufficiently equal (fifth row), then the majority and k-plurality voters will each produce no output, while the median voter will produce an incorrect output. From this table, we can see some interesting aspects of the voters. Median voters always produce an output, so are useful in systems in which a decision must be made (e.g. there is no option to enter a fail-safe mode). Majority and k-plurality voters will only produce incorrect output in the unlikely event that there is a majority (or k-majority) that are all incorrect and sufficiently equal. Thus, these are used in systems in which taking the right decision is more important, and when the voter has no output, some alternative action (such as entering a fail-safe mode) is preferred. 169 Case Majority k-plurality Median voter voter voter All outputs correct and SE∗ Correct Correct Correct Majority correct and SE Correct Correct Correct k† outputs correct and SE No output Correct Possibly correct All correct but none SE No output No output Correct All incorrect and none SE No output No output Incorrect k† outputs incorrect and SE No output Incorrect Possibly correct Majority incorrect and SE Incorrect Incorrect Incorrect All incorrect and SE Incorrect Incorrect Incorrect ∗ SE = Sufficiently equal † where k < N2 Table 9.1: Comparison of voter types. 9.2.4 N-Modular Redundancy (NMR) The N-modular redundancy scheme is a method used for masking faulty components in system. It works by using multiple (N) redundant components (where N is usually odd) and voting on their output. There are two typical designs, outlined in Figure 9.3. This figure shows the popular triplex design. The first design has one voter, while the second design has three voters, which mitigates the risk of the voter failing. (a) Triplex schema with 1 output (b) Triplex scheme with 3 outputs Figure 9.3: Structure of a N-modular sub-system A system of N = 2m+1 components can tolerate m failed units. The most popular is the triplex or triple redundancy scheme, which can tolerate the failure of a single unit. For example, if a failure occurs in a triplex then the faulty unit can be detected next time there is a vote and the unit isolated. This then forms a duplex that can detect a faulty unit, but it cannot mask it, because it is now similar to a static pair. Exercise 9.1 With N units, how many failed units can N-modular redundancy tolerate to detect a faulty unit, rather than mask it? 2 170 Purging When a faulty unit is detected, it should be purged, which means its results should no longer be con- sidered, unless there is some reason to believe that the fault is transient. Purging can be done in two ways: • self-purging, in which a unit compares its result to the voted output, and purges itself if it is sufficiently different; or • sift-out redundancy, in which all combinations of the pairs of units are compared with each other (producing a total of (N 2 ) outputs), and a controller can disconnect units that disagree with the majority. While having multiple voters enables redundancy in the voters, there are other pragmatic reasons for having multiple voters. Many applications may require only one output, however for others, multiple outputs may be sensible. For example, a braking system with four sensors (one on each wheel) to determine the current speed must also send four values to the four brake callipers. If one of the voters sends an incorrect value, the other three are still able to produce enough force for the car to brake at close to the correct speed. However, if one value is sent, and it happens to be incorrect, then the force applied may not brake the car. 9.3 Software Redundancy Fault tolerance for software is harder than for hardware. While we can run software on multiple units of hardware, simply installing multiple copies of the software on redundant hardware does not mitigate software failures, because in general, the N copies of the software will all fail at the same time, caused by the same design fault. These are called common-mode failures. There are two general approaches used for software redundancy: N-version programming and recovery blocks. In this section, we present both of these in detail. 9.3.1 N-Version Programming N-version programming is a software-based version of hardware redundancy. The catch is that we cannot simply create N copies of the same software, as these will all fail on the same input. Instead, N-version programming requires that N independent versions of the code are written. All N independent versions of the code are executed and the output is voted on. Voting algorithms used in hardware redundancy can be used (e.g. majority voting). However, careful planning and thought is required for this: • First, it will be considerably more expensive. For hardware items, the design is a big cost, but producing new hardware components is often relatively less expensive (compared to the design process). For software however, the effort in create new copies is essentially zero: copying some bits is so cheap as to not worry about the cost. The cost in software is all in the design. Thus, the effort to produce two independent software components is approximately double producing one. 171 • Second, it will not be effective if the same design and coding faults are replicated. It is well documented that if multiple separate software teams design, implement, and test the same system, they are likely to make mistakes in the same places as each other. Thus, we see common-mode failures in independent implementations of the same specification. Common-mode failures Common-mode failures in engineering refer to those failures that are not statistically independent from one another. For example, the back two tires on a car a likely to fail (run out of tread) at around the same time because they will have driven the same distance. In software, common mode failures typically result because programmers tend to make the same mistakes as each other; e.g. misinterpreting requirements in the same way, small boundary shifts, or or using the same (faulty) algorithm from a text book. Causes of common-mode failures in software In software, common-mode failures can occur in multiple different implementations of the same system, irrelevant of operating environment or how long they have been in place. Common-mode failures can result in N-version systems due to the following: 1. Faults in the original system specification. If two versions are implemented from the same spec- ification, which contains a fault, then the two versions are likely to contain the same fault. In software redundancy, versions are often written from the same specification. 2. Development environments. Similar languages and environments encourage similar styles and similarity in programs, using the same programming tools may result in the same faults. 3. The use of similar algorithms. For example, different teams implementing the same numerical algorithm may implement the same instabilities, especially if they both use well-known algorithms for solving the problem. 4. Similar training to the independent teams. This is common within organisations. Eliminating common-mode failures Eliminating common-mode failures in N-version programming is done via introducing design diversity. Essentially, this means implementing the same specification using diverse methods, tools, and teams. In response to the four points in the previous section, common-mode failures can be eliminated using the following: 1. Eliminate as many faults as possible in the specification (easier said than done!). 2. Have a controlling individual who specifies the programming environments, languages, and tools that are used by the separate teams. That is, one person (or group of people) is responsible for ensuring that separate teams use diverse means for designing, implementing, and testing the soft- ware. 172 3. Keep development teams as separate as possible to avoid the use of common algorithms and data structures. This includes keeping them geographically separate to avoid them discussing their solutions with each other. 4. Use teams that have been educated/trained by different organisations. It is not uncommon to outsource to multiple organisations to introduce diversity. Design diversity through independence? Does using design diversity work? Unfortunately, statistics on these methods are hard to come by, because performing controlled experiments on large-scale systems is expensive. One academic study performed by Knight and Leveson [1] attempts to assess the impact of indepen- dence on common-mode failures. This study compared common-mode failures of student assignments implemented by individuals that were geographically separated. Overall, 27 different versions from two different universities were produced from the same specification: 9 versions from the University of Vir- ginia (UVA) and 18 from the University of California and Irvine (UCI). The system was a simple missile defence system that took radar data as input and had to decide whether the signals represented incoming missiles. Neither set of students (from the different universities) were told that the other university was participating in the experiment, so it is unlikely that there was communication between individuals at different universities. All 27 versions were tested on the same 1 million tests, using the original version (produced by NASA) as the test oracle. Thus, the only faults that would not be noticed were those that failed on all 28 versions of the system. Knight and Leveson compared the 9 UVA versions against the 18 UCI versions for common failures: that is, those that failed for the same test. The results are shown in Table 9.2. For example, versions 3 and 11 both failed on the same 58 inputs as each other (out of a total of 1 million inputs). The results indicate that being part of independent teams is perhaps not enough to provide reliable results. For example, version 3 and 20 both failed on 325 of the 1 million outputs — too high for a missile defence system. Of course, from the results we can see that versions 3 and 8 are both particularly error prone; as are versions 20 and 26. These versions are probably less reliable than what we would expect from a team at NASA implementing a missile defence system. In addition, all versions used the same programming language. Nonetheless, the results indicate that independence alone may not be sufficient to achieve fault tolerance. 9.3.2 Recovery blocks The recovery block approach is similar to N-version programming in that it also uses N versions. How- ever, only one version is executed at any one time. If the output does not pass an acceptance test then next version is tried, and so on. Figure 9.4 shows a schematic for this design pattern. When an acceptance test fails and another version is invoked, all changes to the state of the system must be reversed. 173 Table 9.2: Results of the Knight and Leveson study on correlated failures Figure 9.4: A schematic diagram for recovery blocks Acceptance tests Recovery blocks require acceptance tests to determine whether the output is correct or not. This is the same as the oracle problem in testing. 174 Some acceptance tests can be straightforward to define; e.g. for example, a program that solves values for variables in an equation can be checked by substituting in the answer. Others, are more difficult and require something similar in complexity to the original program. For some problems, the complexity of checking a solution is considerably less than solving that problem, however, in many other cases, a sanity check is the best we can do. Example 9.3 As an example of an acceptance test, consider an algorithm that calculates the GPS coordinates of an un- manned vehicle to determine its location. Determining the position from GPS data is complex, however, estimating whether the location is correct based on other data can be reliable. For example, if the code performing the acceptance tests records the previous position from a few seconds prior, and knows the approximate speed and direction in which the vehicle is headed, then it can estimate the location using a simple “dead reckoning” algorithm. Using such an algorithm, a small range of possible positions can be determined, and the output passes the acceptance test if and only if it falls inside that range. 2 The challenge with acceptance tests is to produce a tight but accurate range of possible values. If the designer attempts to be too accurate, this may result in a lot of false negatives — that is, labelling an output as incorrect when it is in fact correct. If the design sets the range to loosely, this may result in a low of false positives — that is, labelling an output as correct when it is incorrect. The reliability of a recovery block is only as good as its acceptance test, so getting this right is of high importance. However, writing acceptance tests is more art than science. Alternative versions in recovery blocks As with N-version programming, recovery blocks require diverse designs to mitigate common-mode failures, and so implementing the different modes is expensive. The question is then: why use recovery blocks instead of N-version programming? One reason is efficiency. The first version will generally be correct, so only one version will be executed. In addition, voting is not required. However, a more pragmatic reason is development cost. Because the first version is likely to be correct, it will be executed the most. Similarly, if the first version fails, then the second version is likely to be correct as well, so the third version will hardly (if ever) be executed. As a result, one common strategy is to make each successive version more simple than the previous. If the system can tolerate some outputs that are less accurate or are produced in a longer time, then an approximation algorithm or an algorithm with higher execution time may be much easier to develop. This has three benefits: 1. It reduces development costs, because each successive version should be easier and cheaper to implement than the last. For N-version programming, each version must be a “full” version. 2. It introduces more design diversity between the versions. If the different algorithms are actually expected to produce different results, and may even take different inputs, then the diversity between the algorithms will be greater. 175 3. The successive versions are more simple, which reduces the probability that there are faults. This increases the reliability of the entire block. Example 9.4 Consider the GPS algorithm for the unmanned aerial vehicle discussed in Example 9.3. We could imple- ment two versions in a recovery block, and one acceptance test. The first block uses the GPS data to calculate the coordinates, and, as in Example 9.3, a “dead reckoning” algorithm is used to determine a range of values for the acceptance test. If the acceptance test fails, the dead reckoning algorithm is used to determine the most likely location based on the previous location, the direction, and the speed. This is then used as the output. Most GPS devices use such a recovery-block approach to determining GPS location when they cannot receive a signal. Just check out your device the next time that you are in a tunnel with no GPS signal — most devices will still attempt to track your current location. 2 9.4 Byzantine Failures Recall from Section 9.1 that whenever a fault causes multiple users of a unit to see different behaviour from that unit, the units is said to be behaving maliciously. The idea of malicious output is a strange one, so let’s illustrate it with an example. Consider a sensor sending out temperature readings to multiple components. If the communication mechanism between the sensor and the components is failing then, for each reading, each component may be receiving different readings from the sensor. This is classed as a malicious or Byzantine failure. The term “Byzantine” comes from the Byzantine Generals’ Problem, proposed by Lamport et al. [3]. We will follow the description in Sections 17.1, 17.2 and 17.4 of dcg.ethz.ch/lectures/podc_ allstars/lecture/podc.pdf. 9.5 Information Redundancy Information redundancy is one method for increasing the fault tolerance of a system. The basic idea behind information redundancy is to store/send more information than is necessary, and to use that extra information to check for errors, and in some cases, to correct errors. Coding schemes are typically used to implement information redundancy. Network communications research and microprocessor research have resulted in some efficient methods that use information re- dundancy to detect corrupted data. 9.5.1 Duplication Duplication is the most basic form of information redundancy: all information is duplicated. Errors can be detected by simply comparing each bit and its duplicate. 176 However, this requires twice as much storage/communication, requiring double the resources, so it is too inefficient to be implemented in most systems, especially real-time systems, or resource constrained systems, such as remotely operated vehicles. Duplication can detect errors in information, but if the information is duplicated only once, then it cannot correct errors, because we cannot know whether it is the original or duplicate that is in error. To cor- rect information errors, more than two copies are required, which is a considerable overhead for many systems. 9.5.2 Parity coding Error detection using parity coding Parity coding is a simple method for detecting 1-bit errors in information; that is, where 1 bit is incorrect. It works by counting whether there is an odd or even number of 1s in an n-bit word, and then appending 0 (odd) or 1 (even) to the end of the bit. When this information is read, a check can be done by counting the bits in the first n bits of the word, and comparing the result to the n + 1th bit. If these are different, then an error has occurred. Example 9.5 The 8-bit word 00001111 is transferred over a communication channel between two nodes. There are an even number of 1s, so 9 bits are transferred: 000011111; with the final 1 indicating that there is an even number of 1s in the first 8 bits. If one of the bits is corrupted, for example, the first bit (100011111), then the number of 1s is now odd. The receiving node counts the number of 1s in the first 8 bits, notices that this is odd, and compares this with the final bit, thereby detecting the error. 2 Clearly, parity coding only works if an odd number of bits have been corrupted. If two bits are corrupted, then the number of 1s will be the same as before. Despite this, the simplicity of the method and the small overhead make it popular for information redun- dancy. Error correction using interlaced parity coding Interlaced parity coding is a parity coding method that can correct as well as detect some bit errors. It works by dividing the word up into disjoint portions, and representing each portion with one bit — again using 0 or 1 to represent whether there are an even or odd number of bits. Table 9.3, taken from Krishna and Shin [2], shows an example of this for an 8-bit word. Four parity bits are used, and depending on which parity bits are wrong, we can detect and correct one faulty bit, including if the parity bit is in error. In Table 9.3, the parity bits are associated with the bits using the following code: • P0 covers bits w0,w2,w3,w6,w7, and P0 • P1 covers bits w0,w1,w3,w5,w6, and P1 177 • P2 covers bits w0,w1,w2,w4,w5,w7, and P2 • P3 covers bits w0,w1,w2,w3,w4, and P3 Table 9.3: Parity coding for an 8-bit word, taken from Krishna and Shin [2] So, given an 8-bit word, the sender calculates the first parity bit (P0) by counting whether there is an odd or even number of bits in the portion w0,w2,w3,w6, and w7, and similarly for the other parity bits and their respective portions. It will then send a 12-bit word. The receiver can use this information to determine which (if any) bit is faulty. Clearly there is an added overhead in this approach compared to the error detection technique discussed above, because it requires additional parity bits. As such, this scheme is generally only used in situations in which the original word cannot be recovered, such as storing information in memory or on disk. In cases where the word can be re-sent, error detection is generally sufficient. Example 9.6 Consider an example of storing information in 8-bit words, with the additional 4 parity bits. One part of our program reads the stored word and parity bits: 00000000-1100. It must check whether the extended word is faulty. It first calculates the expected value of the parity bits. The word is 00000000, so parity bits for this are straightforward to calculate: they are all 1 because there are an even number (zero!) of 1s in the string. As a result, the 12-bit word should be 00000000-1111. There is no way of retrieving the original stored word, so the best we can do is try to correct this, hoping that a single bit is in error. Looking at the expected and actual parity bits, we see that bits P2 and P3 (the 178 last two) are incorrect. By looking at Table 9.3, we can see that bit w4 must be the faulty bit (assuming exactly one bit has been corrupted), because this corresponds to the row in the table in which only P2 and P3 would be in error. Because we know that this is the faulty bit, and because we know it can take only one other value (the opposite bit value), we can correct the word to 00001000-1100. 2 Example 9.7 Consider the stored bits 00000000-1101. Again, we know that the parity bits should all be 1, so the bit P2 is incorrect. Looking at Table 9.3, we can see that the bit P2 itself is the faulty bit. If we use the coding scheme shown in Table 9.3, then it is always the case that if exactly one parity bit is incorrect, then that bit is the faulty bit. 2 9.5.3 Checksums The final information redundancy technique that we will look at is checksums. Most readers are probably already familiar with the use of checksums in network messaging. Checksums are a useful error detection mechanism, but cannot correct errors. Checksums are used to detect errors in blocks of data, especially when that data is being transferred between nodes. The approach is simple. Given a block of data, break the data up into words; e.g. 32-bit words, and add these words together to get the checksum. The checksum is then appended to the data. The receiver of the data strips the checksum upon receipt, re-calculates the checksum in the same way, and compares the re-calculated checksum to the stripped checksum. If they are different, the data has been corrupted. There are a few different methods for adding together the words. The simplest is known as the longitu- dinal parity check, which simply computes the exclusive or of all words. To check the data, the sender simply computers the exclusive or of all words including the checksum. If the result contains a bit that is non-zero, then an error has been detected. This can detect errors in multiple bits, but not if there are an even number of errors in the same bit of multiple words. The larger the set of blocks, the more likely it is to encounter an undetected error. Checksums are popular because the overhead is small: for a block of 8-bit words, only an additional 8-bits are required. 9.6 Fault tolerance in practice 9.6.1 Airbus A310–A380 The Airbus series of aircraft designs rely heavily on hardware, software, and information redundancy to achieve fault tolerant behaviour of aircraft. 1. A310 (circa 1983) had ten separate digital flight control computers. 179 2. A320 (circa 1988) introduced fly by wire, in which four computers teamed in command/moni- tor pairs for redundancy, which became standard approach for subsequent Airbus flight control computers. 3. A340 (circa 1992) had one command/monitor pair forming a “fail fast” module, failing over to another command/monitor “hot spare” pair. Error detection was achieved through mismatched command, sequence checking, and self-test when aircraft energised. A second command/monitor pair using a different microprocessor and different software provided design diversity to tolerate common mode design and manufacturing faults. 4. A380 (circa 2005) employs dual-redundant Ethernet for non-critical functions, electrical and hy- draulic flight control diversity. Design diversity in the A380 project was achieved via using dissimilar computers, physical sepa- ration of teams, multiple software bases, different software development tools, and data diversity. 9.6.2 Boeing 777 The Boeing 777 is another aircraft that relies heavily on hardware, software, and information redundancy. 1. Goal of Mean Time Between Actions to 25,000 operating hours, reduce probability of degrading below minimum capability to less than 10−10. 2. Designed to tolerate Byzantine faults, object impact, component failure, power failure, electro- magnetic interference, cloud environment. 3. Byzantine fault tolerance with data synchronisation and median voting. 4. Architecture of flight control computer has three independent channels each composed of three redundant computing lanes (command, monitor, standby). Standby allows dispatch of aircraft even with a lane or data channel failure. 5. Design diversity was achieved using different microprocessor hardware and different software compilers for a fault-intolerant single source code. 9.6.3 Boeing 737 MAX Flight Control Redesign Content to be added in 2022 (please email
[email protected]
for more info): In subsec- tion 2.4.3 we discussed the case study of the Boeing 737 MAX aircraft. In this section we will talk about the testing regime that the Boeing 737 MAX underwent as part of its re-certification in 2019–2020, as widely reported, and the resulting changes to the architecture of the 737’s flight control system. 9.7 Additional reading 1. These notes were based mostly on Krishna and Shin’s chapter “Fault-tolerance techniques” [2], which is Chapter 7 in their book Real-time Systems. 180 A PDF copy of Krishna and Shin’s chapter is available on the LMS. 2. Moulding’s chapter “ Designing for high integrity: The software fault tolerance approach”, which is Chapter 3 in C. T. Sennett, ed., High-integrity Software, Plenum Press, 1989 is a good resource of fault tolerance in high integrity systems. The book is available in the library. 3. Laprie et al.’s article [4] titled “Definition and analysis of hardware- and software-fault-tolerant architectures” is an old but excellent overview of fault tolerance in systems engineering. It de- scribes different techniques for achieving fault tolerance, including those outlined in these notes, and discusses their properties, such as cost, benefits, and limitations. A PDF copy of the article is available on the LMS. Bibliography [1] J. Knight and N. Leveson. An experimental evaluation of the assumption of independence in multi- version programming. IEEE Transactions on Software Engineering, SE-12(1):96–109, 1986. [2] C. Krishna and K. Shin. Real-time Systems, chapter 7. Addison-Wesley, 1999. (Available on the LMS). [3] L. Lamport, R. Shostak, and M. Pease. The Byzantine generals’ problem. ACM Transactions on Programming Languages and Systems, 4(3):382–401, 1982. [4] J-C. Laprie, J. Arlat, C. Beounes, and K. Kanoun. Definition and analysis of hardware-and software- fault-tolerant architectures. Computer, 23(7):39–51, 1990. 181
欢迎咨询51作业君
官方微信
TOP
Email:51zuoyejun
@gmail.com
添加客服微信:
abby12468