This is a draft - comments and criticisms are very welcome.
Forensic Software Engineering and the Need for New Approaches to Accident Investigation
Chris JohnsonDepartment of Computing Science, University of Glasgow, Glasgow, G12 8QQ, UK. Tel: +44 (0141) 330 6053 Fax: +44 (0141) 330 4913 http://www.dcs.gla.ac.uk/~johnson, EMail: firstname.lastname@example.org
Abstract. Accident reports are intended to explain the causes of human error, system failure and managerial weakness. There is, however, a growing realization that existing investigation techniques fail to meet the challenges created by accidents that involve software failures. This paper argues that existing software development techniques cannot easily be used to provide retrospective information about the complex and systemic causes of major accidents. In consequence, we must develop specific techniques to support forensic software engineering.
The Rand report into the "personnel and parties" in NTSB aviation accident investigations argues that existing techniques fail to meet the challenges created by modern systems:
"As complexity grows, hidden design or equipment defects are problems of increasing concern. More and more, aircraft functions rely on software, and electronic systems are replacing many mechanical components. Accidents involving complex events multiply the number of potential failure scenarios and present investigators with new failure modes. The NTSB must be prepared to meet the challenges that the rapid growth in systems complexity poses by developing new investigative practices." 
The Rand report reveals how little we know about how to effectively investigate and report upon the growing catalogue of software induced failures. By software "induced" accidents we include incidents that stem from software that fails to perform an intended function. We also include failures in which those intended functions were themselves incorrectly elicited and specified. The following pages support this argument by evidence from accident investigations in several different safety-related industries . These case studies have been chosen to illustrate failures at many different stages of the software development lifecycle. As we shall see, the recent NTSB investigation into the Guam crash has identified a number of problems in requirements capture for Air Traffic Management systems . Software implementation failures have been identified as one of the causal factors behind the well-publicised Therac-25 incidents . The Lyons report found that testing failures were a primary cause of the Ariane-5 accident . The South-West Thames Regional Health Authority identified software procurement problems as contributory factors in the failure of the London Ambulance Computer Aided Dispatch system . A further motivation was that all of these incidents stem from more complex systemic failures that cross many different stages of the software lifecycle.
2. Problem of Supporting Systemic Approaches to Software Failure
It can be argued that there is no need to develop specific forensic techniques to represent and reason about software "induced" accidents. Many existing techniques, from formal methods through to UML, can be used to analyze the technical causes of software failure . For instance, theorem proving can be used to establish that an accident can occur given a formal model of the software being examined and a set of pre-conditions/assumptions about the environment in which it will execute . If an accident cannot be proven to have occurred using the formal model then either the specification is wrong or the environmental observations are incorrect or there are weaknesses in the theorem provide techniques that are being applied. Unfortunately, there are many special characteristics of accidents that prevent such techniques from being effective applied. For example, there are often several different ways in which software might have contributed to an accident. Finding one failure path, using formal proof, symbolic execution or control flow analysis will not be sufficient to identify all possible causes of failure. There are some well-known technical solutions to these problems. For instance, model checking can be used to increase an analystís assurance that they have identified multiple routes to a hazardous state. These techniques have been applied to support the development of a number of complex software systems. However, they have not so far been used to support the analysis of complex, software-induced accidents .
There are a number of more theoretical problems that must be addressed before standard software engineering techniques can be applied to support accident investigation. Many development tools address the problems of software complexity by focussing on particular properties of sub-components. As a result, they provide relatively little support for the analysis of what has been termed "systemic" failure . The nature of such failures is illustrated by the NTSBís preliminary report into the Guam accident:
"The National Transportation Safety Board determines that the probable cause of this accident was the captain's failure to adequately brief and execute the non-precision approach and the first officer's and flight engineer's failure to effectively monitor and cross-check the captain's execution of the approach. Contributing to these failures were the captain's fatigue and Korean Air's inadequate flight crew training. Contributing to the accident was the Federal Aviation Administration's intentional inhibition of the minimum safe altitude warning system and the agency's failure to adequately to manage the system." (Probable Causes, Page 3, ).
It is unclear how existing software engineering techniques might represent and reason about the Captainís fatigue and the inadequate briefings that left the crew vulnerable to the failure of air traffic control software. Such analyses depend upon the integration of software engineering techniques into other complementary forms of analysis that consider human factors as well as organizational and systems engineering issues. There are a number of requirements engineering techniques that come close to considering the impact that these diverse systemic factors have upon systems development. Finkelstein, Kramer and Nuseibehís viewpoint-oriented approaches are a notable example . However, existing requirement analysis techniques tend to focus on the generic impact of management and organizational structures on future software systems. They provide little or no support for situated analysis of the reasons why a specific piece of software failed on a particular day under specific operating conditions.
3. Problems of Framing Any Analysis of Software Failure
The problems of identifying multiple systemic causes of failure are exacerbated by the lack of any clear "stopping rule" for accident investigations that involve software failures. This problem is particularly acute because many different causal factors contribute to software "induced" accidents. For example, at one level a failure can be caused because error-handling routines failed to deal with a particular condition. At another level, however, analysts might argue that the fault lay with the code that initially generated the exception. Both of these problems might, in turn, be associated with poor testing or flawed requirements capture. Questions can also be asked about the quality of training that programmers and designers receive. These different levels of causal analysis stretch back to operational management and to the contractors and sub-contractors who develop and maintain software systems. Beyond that investigators can focus on the advice that regulatory agencies provide for suitable development practices in safety related systems . This multi-level analysis of the causes of software failure has a number of important consequences for accident analysis. The first is that existing software engineering techniques are heavily biased towards a small section of this spectrum. For example, Software Fault Trees provide good support for the analysis of coding failures . Requirements analysis techniques can help trace software failures back to problems in the initial stages of development . However, there has been little work into how different management practices contribute to, or compound, failures at more than one of these levels [12, 13].
The Therac-25 incidents provide one of the best-known examples of the problems that arise when attempting to frame any analysis of software failure. Leveson and Turnerís  accounts provide detailed analyses of the technical reasons for the software bugs. They also emphasize the point that iterative bug fixes are unlikely to yield a reliable system because they address the symptoms rather than the causes of software failures. It is instructive, however, that many software engineers remember this incident purely for the initial scheduling problems rather than the subsequent inadequacies of the bug fixes:
"in general, it is a mistake to patch just one causal factor (such as the software) and assume that future accidents will be eliminated. Accidents are unlikely to occur in exactly the same way again. If we patch only the symptoms and ignore the deeper underlying cause of one accident, we are unlikely to have much effect on future accidents. The series of accidents involving the Therac-25 is a good example of exactly this problem: Fixing each individual software flaw as it was found did not solve the safety problems of the device" (page 551, ).
A range of different approaches might, therefore, be recruited to identify the many different causal factors that contribute to major software failures. Such an approach builds on the way in which standards, such as IEC61508 and DO-178B, advocate the use of different techniques to address different development issues. There are, however, several objections to this ad hoc approach to the investigation of software induced accidents. The most persuasive is Lekbergís analysis of the biases amongst incident investigators . Analysts select those tools with which they are most familiar. They are also most likely to finding the causal factors that are best identified using those tools. In the case of software engineering, this might result in analysts identifying those causal factors that are most easily identified using formal methods irrespective of whether or not those causal factors played a significant role in the course of the accident. A more cynical interpretation might observe that particular techniques might be selectively deployed to arrive at particular conclusions. In either case, the lack of national and international guidance on the analysis of software failures creates the opportunity for individual and corporate bias to influence the investigation of major accidents.
4. Problems of Assessing Intention in Software Development
It is not enough for analysts simply to document the requirements failures or the erroneous instructions or the inadequate test procedures that contribute to software "induced" accidents. They must also determine the reasons WHY software failed. Why was a necessary requirement omitted? Why was an incorrect instruction introduced? Why was testing inadequate? For instance, the Lyons report spends several pages considering the reasons why the inertial reference system (SRI) was not fully tested before Ariane flight 501:
"When the project test philosophy was defined, the importance of having the SRIís in the loop was recognized and a decision was made (to incorporate them in the test). At a later stage of the programme (in 1992), this decision was changed. It was decided not to have the actual SRIís in the loop for the following reasons: the SRIs should be considered to be fully qualified at equipment level; the precision of the navigation software in the on-board computer depends critically on the precision of the SRI measurements. In the Functional Simulation Facility (ISF), this precision could not be achieved by electronics creating test signals; the simulation of failure modes is not possible with real equipment, but only with a model; the base period of the SRI is 1 millisecond whilst that of the simulation at the ISF is 6 milliseconds. This adds to the complexity of the interfacing electronics and may further reduce the precision of the simulation" (page 9, )."
Levesonís recent work on intent specifications provides significant support for these forensic investigations of software failure . She argues that there will be significant long-term benefits for team-based development if specifications supported wider questions about the reasons why certain approaches were adopted. For instance, programmers joining a team or maintaining software can not only see what was done, they can also see why it was done. This approach is an extension of safety case techniques. Rather than supporting external certification, intent specifications directly support software development within an organization. Accident investigators might also use these intent specifications to understand the reasons why software failures contribute to particular incidents. Any forensic application of Levesonís ideas would depend upon companies adopting intent specifications throughout their software lifecycle. For example, maintenance is often a contributory factor in software induced accidents. Intent specifications would have to explain the reasons why any changes were made. This would entail significant overheads in addition to the costs associated with maintaining safety cases for external certification . However, it is equally important not to underestimate the benefits that might accrue from these activities. Not only might they help accident investigators understand the justifications for particular development decisions, they can also help to establish a closer relationship between the implemented software and the documented design. The report into the failure of the London Ambulance Computer-Aided Dispatch System emphasizes the problems that can arise without these more formal documentation practices:
"Strong project management might also have minimised another difficulty experienced by the development. SO, in their eagerness to please users, often put through software changes "on the fly" thus circumventing the official Project Issue Report (PIR) procedures whereby all such changes should be controlled. These "on the fly" changes also reduced the effectiveness of the testing procedures as previously tested software would be amended without the knowledge of the project group. Such changes could, and did, introduce further bugs." (paragraph 3082, ).
Many industries already have certification procedures for software maintenance. This helps to avoid the ad hoc procedures described in the previous quotation. Safety cases go part of the way towards the intent specifications that are proposed by Leveson. However, there is little room for complacency. Kelly and McDermid argue that many companies experience great difficulties in maintaining their software safety cases in the face of new requirements or changing environmental circumstance . As a result there is no documented justification for many of the decisions and actions that lead to software failure. These have to be inferred by investigators in the aftermath of major accidents when a mass of ethical and legal factors make it particularly difficult to assess the motivations that lie behind key development decisions.
5. Problems of Assessing Human and Environmental Factors
Simulation is an important tool in many accident investigations. For example, several hypotheses about the sinking of the MV Estonia were dismissed through testing models in a specially adapted tank. Unfortunately, accident investigators must often account for software behaviors in circumstances that cannot easily be recreated. The same physical laws that convinced the sub-contractors not to test the Ariane 5ís inertial reference systems in the Functional Simulation Facility also frustrate attempts to simulate the accident . The difficulty of recreating the conditions that lead to software failures has important implications for the reporting of software induced accidents. Readers must often rely upon the interpretation and analysis of domain experts. Unfortunately, given the lack of agreed techniques in this area, there are few objective techniques that can be used to assess the work of these experts. Given the complexity of the coding involved and the proprietary nature of many applications, accident reports often provide insufficient details about the technical causes of software failure. As a result, readers must trust the interpretation of the board of inquiry. This contrasts strongly with the technical documentation that often accompanies reports into other forms of engineering failure. It also has important implications for teaching and training where students are expected to follow vague criticisms about the "dangers of re-use" rather than the more detail expositions that are provided for metallurgic failures and unanticipated chemical reactions.
The interactive nature of many safety-critical applications also complicates the simulation of software "induced" accidents. It can be difficult to recreate the personal and group factors that lead individuals to act in particular ways. It can also be difficult to recreate the ways in which user interface problems exacerbate flaws in the underlying software engineering of safety-critical applications. For example, the London Ambulance system required "almost perfect location information" . As the demands on the system rose, the location information became increasingly out of date and a number of error messages were generated. These error messages are termed "exceptions" in the following quotation. The rising number of error messages increased the usersí frustration with the software. As a result, the operators became less and less inclined to update essential location and status information. This is turn led to more error messages and a "vicious cycle" developed. Accident analysts must, therefore, account both for the technical flaws in any software system but also for emergent properties that stem from the usersí interaction with their system:
"The situation was made worse as unrectified exception messages generated more exception messages. With the increasing number of "awaiting attention" and exception messages it became increasingly easy to fail to attend to messages that had scrolled off the top of the screen. Failing to attend to these messages arguably would have been less likely in a "paper-based" environment." (Paragraph 4023, )
It is not always so easy to understand the ways in which human behavior contributes to the failure of computer based systems. This is a complex topic in its own right. Behavioral observations of interaction provide relatively little information about WHY individuals use software in particular ways. It is also notoriously difficult to apply existing human error modeling techniques to represent and reason about the mass of contextual factors that affect operator performance during a major accident . The London Ambulance report provides a further illustration of these problems. There were persistent rumors and allegations about sabotage contributing to the failure of the software. Accident investigators could never prove these allegations because it was difficult to distinguish instances of deliberate "neglect" from more general installation problems.
6. Problems of Making Adequate Recommendations
Previous paragraphs have argued that accident investigators must address the systemic factors that contribute to and combine with software failures during major failures. They must also consider the scope of their analysis; software failures are often a symptom of poor training and management. It can also be difficult to identify the motivations and intentions that lead to inadequate requirements, "erroneous" coding and poor testing. Finally, we have argued that it can be difficult to determine the ways in which human factors and environmental influences compound the problems created by software failures in major accidents. There are also a number of further problems. In particular, it can be difficult for accident investigators to identify suitable recommendations for the design and operation of future software systems. This is, in part, a natural consequence of an increasing emphasis being placed upon process improvement as a determinant of software quality. Once an accident occurs, this throws doubt not only on the code that led to the failure but also on the entire development process that produced that code. At best, the entire program may be untrustworthy. At worst, all of the other code cut by that team or by any other teams practicing the same development techniques may be under suspicion. Readers can obtain a flavor of this in the closing pages of the Lyonsí report into the Ariane 5 failure. The development teams must:
"Review all flight software (including embedded software), and in particular:
Identify all implicit assumptions made by the code and its justification documents on the values of quantities provided by the equipment. Check these assumptions against the restrictions on use of the equipment."
(Paragraph R5, ).
This citation re-iterates the importance of justification and of intent, mentioned in previous paragraphs. It also contains the recommendation that the must identify "all implicit assumptions made by their code". Unfortunately, it does not suggest any tools or techniques that might be used to support this difficult task. In preparing this paper, I have also been struck by comments that reveal how little many investigators appreciate about the problems involved in software development. This is illustrated by a citation from the report into the London Ambulance Computer Aided Dispatch system.
"A critical system such as this, as pointed out earlier, amongst other prerequisites must have totally reliable software. This implies that quality assurance procedures must be formalised and extensive. Although Systems Options Ltd (SO) had a part-time QA resource it was clearly not fully effective and, more importantly, not independent. QA in a project such as this must have considerable power including the ability to extend project time-scales if quality standards are not being met. This formalised QA did not exist at any time during the Computer Aided Despatch development. (Paragraph 3083, ).
It is impossible by any objective measures to achieve total software reliability, contrary to what is suggested in the previous paragraph. It may be politically expedient to propose this as a valid objective. However, to suggest that this is a possible is to completely misrepresent the state of the art in safety-critical software engineering.
7. Conclusion and Further Work
A number of agencies have argued that existing techniques cannot easily be used to investigate accidents that involve the failure of software systems [1, 2]. This paper has, therefore, gone beyond the high level analysis presented in previous studies to focus on the challenges that must be addressed by forensic software engineering:
It should be stressed that this is a partial list. There are almost certainly additional factors that complicate the analysis of software induced failures. This paper, therefore, only provides a first step towards the development of tools and techniques that will support next generation accident reporting.
 C.C. Lebow, L.P. Sarsfield, W.L. Stanley, E. Ettedgui and G. Henning, Safety in the Skies: Personnel and Parties in NTSB Accident Investigations. Rand Institute, Santa Monica, USA, 1999.
 US Department of Health and Human Services, Food and Drug Administration, Guidance for the Content of Premarket Submissions for Software Contained in Medical Devices. Report Number 337, May 1998.
 National Transportation Safety Board, Controlled Flight Into Terrain Korean Air Flight 801 Boeing 747-300, HL7468 Nimitz Hill, Guam August 6, 1997. Aircraft Accident Report NTSB/AAR-99/02, 2000.
 N.G. Leveson, Safeware: System Safety and Computers, Addison Wesley, Reading Mass. 1995.
 J.L. Lyons, Report of the Inquiry Board into the Failure of Flight 501 of the Ariane 5 Rocket. European Space Agency Report, Paris, July 1996
 South-West Thames Regional Health Authority. Report of the Inquiry Into The London Ambulance Service Computer-Assisted Despatch System (February 1993) Original ISBN No: 0 905133 70 6
 C.W. Johnson, A First Step Toward the Integration of Accident Reports and Constructive Design Documents. In M. Felici, K. Kanoun and A. Pasquini (eds), Proc. of SAFECOMP'99, 286-296, Springer Verlag, 1999.
 C.W. Johnson, Proving Properties of Accidents, Reliability Engineering and Systems Safety, (67)2:175-191, 2000.
 J. Rushby, Using Model Checking to Help Discover Mode Confusions and Other Automation Surprises. In D. Javaux and V. de Keyser (eds.) Proc. of the 3rd Workshop on Human Error, Safety, and System Development, Liege, Belgium, 7--8 June 1999.
 A. Finkelstein, J. Kramer and B. Nuseibeh, Viewpoint Oriented Development: applications in composite systems. In F. Redmill and T. Anderson (eds.) Safety Critical Systems: Current Issues, Techniques and Standards, Chapman & Hall, 1993, 90-101.
 N.G. Leveson, S.S. Cha and T.J. Shimeall, Safety Verification of Ada Programs using Software Fault Trees, IEEE Software, 8(7):48-59, July 1991.
 P. Benyon-Davies, Human Error and Information Systems Failure: the Case of the London Ambulance Service Computer-Aided Despatch System Project. Interacting with Computers, (11)6:699-720.
 J. Reason, Managing the Risks of Organizational Accidents, Ashgate, 1998.
 A.K. Lekberg, Different Approaches to Incident Investigation Ė How the Analyst Makes a Difference. In S. Smith and B. Lewis (eds), Proc. of the 15th International Systems Safety Conference, Washington DC, USA, August 1997.
 N.G. Leveson, Intent Specifications: An Approach to Building Human-Centered Specifications. Accepted for IEEE Trans. on Software Engineering (2000).
 T.P. Kelly and J.A. McDermid, A Systematic Approach to Safety-Case Maintenance, M. Felici, K. Kanoun and A. Pasquini (eds.) SAFECOMPí99, LNCS 1698, Springer Verlag, 1998.
 C.W. Johnson, Why Human Error Analysis Fails to Support Systems Development, Interacting with Computers, (11)5:517-524, 1999.