a) Briefly explain why the UK Defence Standard 00-55 requires that Worst Case Execution Times and the amount of memory required by safety-critical software should be statically determined.
General-purpose software is typically concerned to ensure good average case performance while accepting that there may be occasional situations in which this level of performance cannot be achieved. In contrast, safety-critical software must demonstrate that it meets a range of non-functional requirements including both memory and time constraints. If either execution time or memory requirements cannot be statically determined before run-time then it will be difficult to demonstrate that the system will remain within these constraints.
b) Most modern processors provide cache memory that a memory manager can exploit for maximum throughput. Briefly explain why this creates particular problems for the safety-cases associated with software projects and describe two possible solutions to this problem.
The time taken to transfer data between main memory and any processor can act as a bottleneck. Caches address this problem and provide important optimisations for general-purpose processors. Data is temporarily stored in locations that provide faster access times for the processing units. Unfortunately, this can complicate the calculation of WCET values because analysts must consider whether or not to consider any optimisations that might be supported through the use of a cache. In practice there are two approaches to this problem. Either the analyst can always assume that data is never held in a cache. This will result in WCET that are consistently greater than the actual processor performance. Alternatively, they may decide to switch-off the caching strategy. This simplifies the analysis but imposes significant performance overheads.
First class answers might refer to the additional complexities created by modern two-level cache architectures and by cache pre-emption.
c) The Boeing 777 Primary Flight Control System uses different types of processors in each of three computing channels with cross-lane monitoring between each channel. Explain why this architecture can be used to mitigate processor design errors and attempt to ensure liveness.
Less able students may stick more narrowly to the issues of design redundancy because this is what we<92>ve covered in the class. More able students should go beyond this and look at the concept of diversity, which is the central theme of the question. Answers should explain that many modern processors have known design flaws. DEF STAN 00-54 argues that analysts must ensure the software does not trigger those flaws. Unfortunately, this is technically difficult to achieve. For instance, different types of the same processor can have subtle variations on a known fault. For new processors, it is unlikely that there will be adequate documentation even about those flaws that some other organisations have reported. It is for this reason that some safety-critical systems rely upon design diversity. The 777 PFCS uses different types of processor to support three computing channels. It is hoped that design flaws are not replicated by the different design and manufacturing processes that support the development of these processors. It can be assumed that these channels are redundant and that the results of any computations are cross-checked between each channel. Some solutions may also observe that monitoring may go beyond result comparison to ensure liveness during key stages of a computation. This follows the general pattern established by Triple Modular Redundancy. If the processors were not diverse then there is a danger that the voting or comparison routines used in this architecture would be undermined by a common fault across each of the channels.
a) NASA use Failure Modes, Effects and Criticality analysis to guide their risk assessments. Level 1 hazards are associated with the highest level of criticality. They are sufficient to cause overall shuttle failure defined by loss of vehicle and crew (LOVC). Early Shuttle risk assessments focussed on the probability of a level 1 failure occurring to individual component subsystems. Briefly explain why this systematically under-estimates the likelihood of LOVC incidents.
By focussing on the likelihood of level 1 failures for individual components, the NASA risk assessment procedures failed to adequately consider the ways in which Level 2, 3 and 4 failures might combine to cause a Level 1 incident. Similarly, the form of FMECA analysis described in the question does not explicitly address the problem of dependent failures in which a lower criticality event substantially increases the likelihood of a LOVC event being triggered in another area of the system.
b) Since the loss of the Challenger mission, Shuttle risk assessments have assumed that all level 1 failures will lead to LOVC incidents. Describe the problems that this might create for the subsequent engineering of Shuttle subsystems.
It is unlikely that every level 1 hazard will actually lead to an LOVC event. A range of manual and automated protection systems may intervene to mitigate the consequences of a level 1 hazard. Hence it might be argued that PRA should consider the contingent probability of a LOVC event given that a level 1 hazard has occurred. By not doing this, it can be argued that NASA engineers will deliberately over- estimate the likelihood of LOVC events. This may lead to the over-engineering of safety related systems. Additional protection mechanisms may be introduced and expense occurred when this is not stricly justified by the risk assessment. An important issue here is that the introduction of these additional features may itself introduce new failure modes into the system. Hence there is a paradoxical effect that the deliberate over-estimation of failure rates may actually threaten the safety of the system.
c) In 1995, a NASA investigation FMECA found that the probability of a LOVC was between 1 in 76 and 1 in 230 missions. These estimates considered a range of hazards, including the loss of a tile from the thermal protection system on the outer skin of the Shuttle. The analysis shows that 15% of the tiles are the source of 85% of the risk to the heat shield. Possible failure modes include a failure to centre the tile in its cavity during maintenance operations and letting the bond dry before applying pressure. Explain the problems of calculating Risk Priority Numbers given this data. Comment on the accuracy of any risk assessment for LOVC incidents that rely on likelihood estimates for these contributory factors.
Again this is an open-ended question. I have given a number of relatively simple FMECA formulae for calculating risk priority numbers (e.g., RPN = Severity x Occurence x Detection). I have not provided the data necessary for such calculations. Instead, I<92>m hoping that they will relate the information that I have given them to the information that might be required to calculate a RPN. For instance, detection factors might be estimated from the observation that 15% of the tiles creating 85% of the hazard. Answers should focus on the problem of obtaining reliable failure rate estimates for maintenance activities. A failure to centre a tile or to correctly apply tile pressure can be observed through quality control mechanisms and process inspections. These activities will provide some estimate of the likelihood of future problems. As we have seen, however, LOVC events are often characterised by contingent probabilities. From this it follows that these may well have been instances in which these hazards did occur, were not detected during quality control and did not lead to an LOVC event. This creates two problems. Firstly, our estimates of the likelihood of these hazards from monitoring activities may be a considerable under-estimate. Secondly, do we refine our risk analysis to consider contingent probabilities with all of the problems they introduce (see section b). First class solutions could go on to consider the wider problems of assessing the risks associated with known hazards. In the case of the tiling system we even know which tiles poses the greatest hazard. We can, therefore, introduce more stringent quality control mechanisms to focus on these areas. How then do we calculate the new probability of failure in the maintenance of these tiles? Unless we can justify such a subjunctive risk assessment, it will be difficult to place any confidence in the overall likelihood of LOVC events for the Shuttle at a system level.
3. a) The Dornier/CMIL Surgical Programmable Urological Device is a robotic system that supports a range of surgical procedures including laser based incisions and the insertion of radioactive sources (seeds) into a patient. The software architecture of this system is composed of four layers. These can be summarised as follows:
In broad terms, risk assessment techniques can be used to focus attention on those failure modes that are likely to occur most frequently or with the highest consequences for the safety of the system. This implies some knowledge of the internal architecture of the system being considered. Hence, it can be argued that white box testing is more appropriately linked to this risk analysis than black box methods. There are further benefits from this combination of techniques. If risk assessments are made for the failure modes at a device level then this can provide a `systemic analysis<92> that is often missing when testing focusses on the behaviour of individual components. However, the complexity of systems such as the Urological Device can prevent any single evaluator from possessing a detailed knowledge of the implementation of each of the component layers. Therefore, even if white box techniques are applied to some components there may still be an element of black box testing. Alternatively, teams of analysts can test different aspects of the system. This can create managerial problems in co-ordinating the activities of the team.
b) Both white box and black box testing provide examples of dynamic verification techniques. What are the limitations of this approach compared to static analysis techniques? What additional issues must be considered when using these techniques to support the development of a safety-critical system rather than mass-market software?
There are many different answers to this question. For instance, dynamic testing relies upon a full or partial implementation being available. Static analysis can in contrast be performed on a model or abstraction of the intended system before it is built. Similarly, dynamic tests must either be performed in a real or simulated operating environment. There can be ethical objections and cost barriers that limit the use of these techniques. In contrast, static analysis can be conducted without the need for such simulations. Other arguments might include the problems of ensuring adequate coverage using dynamic techniques in contrast to model checking and other similar approaches. This argument can become difficult to sustain <96> it might be argued that model checking is a hybrid form with elements of both static and dynamic testing. A number of additional issues complicate the use of such testing techniques for safety-critical applications. Many of these issues stem from the need to satisfy external regulatory organisations of the sufficiency of the development process. This can impose additional documentation overheads not simply to explain what was tested but also why those tests were sufficient. This argument is particularly important in the case of white box testing. Answers might also refer to the need to justify the competence of key personnel, especially in signing-off verification results. A further set of issues surround the need to achieve particular integrity levels, for instance in IEC 61508. This can imply the use of particular design and verification techniques. c) The software for the Dornier/CMIL Surgical Programmable Urological Device is implemented in ANSI C++ and runs under Linux (Red Hat 6.1). Write a brief technical report that explains the benefits of these implementation platforms and then summarises any concerns that you might have about producing such a safety-critical application using these environments.
There is a lot to talk about here. Starting with the operating system <96> the answer to question 1 described some of the problems associated with dynamic memory allocation and caching in safety-critical systems. Many of these arguments can be applied to the facilities offered by general-purpose operating systems such as Linux. Dynamic scheduling makes it difficult to ensure that any safety-critical implementation meets particular timing requirements. This would clearly be important for the real-time control of a surgical robot and contrasts strongly with more `tailored<92> operating systems, especially the static scheduling algorithms used in MARS variants. On the other hand, there are some benefits to the use of such general- purpose operating systems. For instance, the level of support, development knowledge, management and diagnostic tools will be high in comparison to more dedicated operating systems. Costs are likely to be considerably cheaper. An important consideration for Linux variants is that it is possible to obtain significant sections of the source code to support any safety case. This forms a strong contrast with NT etc. The choice of ANSI C++ reflects similar development decisions. The language is relatively mature, there is amass of good quality support tools and teaching materials are easily available. The fact that an ANSI variant is used indicates that programmers will have access to some form of agreed semantics for particular instructions. There are also particular language features that make C++ a reasonable choice. The way inwhich programmers are <91>protected<92> from some of the less desirable features of C and are given access to a form of object orientation can be argued to support the development more reliable code. On the other hand, there are weaknesses in ANSI C++. I<92>ve provided references to specific areas in which the standards are ambiguous about the language semantics. It can also be argued that the language inherits too many of its features from C for it to be acceptable as a <91>language of choice<92> in safety-critical applications.
4. Both Perrow and Sagan have identified tight coupling as both a key strength and a major weakness in modern safety-critical systems. How can this concept be applied to explain the software failures that occurred during either NASA~Rs Mars Surveyor'98 missions or the loss of Ariane 5's flight 501?
The nature of the solution will depend on which accident is considered. Access to both accident reports is available from the course web site. I have compared Perrow<92>s <91>Normal Accidents<92> theory with Sagan<92>s work on <91>High Reliability Organisations<92> in the lectures. I have also described the role of coupling in these theories. However, as with any essay question, I expect there to be considerable variations in the answers.
Tightly coupled systems are distinguished by their inability to recover from an initial failure before those failures escalate or are propagated throughout an application. Sagan has argued that they are also characterised by plans for very rapid reactions to any adverse event. He notes at the end of his book that even in <91>high reliability<92> organisations, these plans can be difficult to implement. They have time dependent, invariant processes, little slack and little opportunity for improvisation once problems occur. Perrow argues that these are essential properties of high-technology systems. We push innovation to a point where efficiency is obtained by the absence of <91>slack<92>. Sagan is more ambivalent and argues that we can choose to design such opportunities for intervention into our systems providing that we adopt a correct <91>organisation for safety<92>.
In my view, it is far harder to apply these ideas to the Ariane 5 incident. In summary, approximately 40 seconds after <91>lift off<92> (H0) there was an extreme swivelling of the nozzles of the two solid boosters. The auncher veered abruptly and self-destruct was correctly triggered by rupture of the links between the solid boosters and the core stage. Nozzle deflections were triggered by the On-Board Computers (OBC), which had incorrectly interpreted a diagnostic bit pattern as valid flight data. This bit pattern was generated by a software exception in the Inertial Reference System. The computer could not call upon the backup reference system because it had also suffered the same exception 72 milliseconds before. The exception was caused by the conversion of 64-bit floating point number to a 16-bit signed integer. The floating point number was greater than that which could be represented by the integer. The Ada data conversion instructions were not protected from causing an Operand Error, although other conversions in the code were protected. The Operand Error occurred due to an unexpected high value related to the horizontal velocity sensed by the platform. This, in turn, occurred because the early part of the trajectory of Ariane 5 differs from that of Ariane 4 and results in considerably higher horizontal velocity values. On the one hand it can be argued that the Ariane 5 rocket failed because it was tightly coupled. An initial failure in the IRS quickly escalated into a failure of the entire mission. On the other hand, this is an inherent property of such real-time applications. The performance demands of such applications demand the close coupling of system components in order for basic control functions to be maintained. There was certainly no room for improvisation once the initial failure had occurred! At another level, it might be argued that coupling relates more to the quality and safety control within the development and operation teams. In this sense, it might be argued that Ariane 5 was too tightly coupled to Ariane 4. The exacting performance requirements of these applications meant that the new system was extremely vulnerable to small differences in the performance characteristics of both vehicles. The interdependence and coupling of application processes meant that the system was vulnerable to minor differences in the performance of legacy code that had no functional role at the point at which it failed.
Possibly a stronger argument might be made around Perrow<92>s idea of complexity <96> I have introduced this and described the relationship to coupling in the lectures. This is appropriate in the Ariane 5 case study because the Lyons report identified clear problems in determining whether or not the IRS legacy code could be safely taken out during the transition between Ariane 4 and Ariane 5.
The software failures related to the Mars Surveyor<92>98 missions consist primarily on the Mars Climate Orbiter, the Polar Lander and the Deep Space 2 missions. Unlike the Ariane 5 accident, there was almost no direct telemetry during the loss of these missions. In consequence, it is difficult to be so certain about the precise failure modes. It is assumed that the Climate Orbiter was lost because the performance characteristics of the engines was computer in pounds of force per second and not Newton<92>s per second as required by the Software Specification Interface. The Polar Lander is assumed to have failed because programmers forgot to clear a global variable indicating the craft had <91>touched-down<92>. This variable was incorrectly set to TRUE when the legs initially deployed, several hundred feet above the surface of the planet. When the Doppler radar confirmed that the craft was close to landing the jets were cut but too far to prevent severe damage to the craft. Finally, the Deep Space 2 missions were lost from a failure in the RF/battery assembly, which had not been adequately impact tested prior to deployment. I have described these failures in detail during the course and would be quite happy to award full marks if they decided to focus on one of these component failures.
As before, it is possible to argue that these missions failed because they were tightly coupled. For example the Climate Orbiter was lost because a single constant had been incorrectly recorded for a software component that the programmer did not consider to be mission critical. Similarly, the failure of the Polar Lander occurred in spite of the use of Doppler radar as a back-up check on the descent of the craft. These example, are interesting because it might be argued that they stemmed from loosely coupled organisations. The problems of co-ordinating Lockhead Martin, the Jet Propulsion Lab and NASA HQ led to a situation in which budgets were cut by management who had little idea of the real implications of this on engineering decisions. A more tightly coupled chain of command might have mitigated the consequences of these decisions. I think such an argument is a misuse of the coupling concept because the organisational background to this failure was one in which there was little slack even though the organisations were not tightly integrated.
A first class answer might go on to argue that the functionality demanded by both of these applications required a tight coupling and that it is unrealistic to expect otherwise. Sagan, of course, might then argue that we should not accept technologies which rely upon such inherently risky architectures.