Copyright Chris Johnson, 1999.
Xday, XX May 200X.

9.30 am - 11.15am



University of Glasgow





DEGREES OF BEng, BSc, MA, MA (SOCIAL SCIENCES).





COMPUTING SCIENCE - SINGLE AND COMBINED HONOURS
ELECTRONIC AND SOFTWARE ENGINEERING - HONOURS
SOFTWARE ENGINEERING - HONOURS





SAFETY-CRITICAL SYSTEMS DEVELOPMENT





Answer 3 of the 4 questions.

1.

a) Briefly describe the main stages involved in a Failure Modes, Effects and Criticality Analysis (FMECA) analysis.

[4 marks]

The main steps in FMECA can be summarised as follows:

  1. Construct functional block diagram.
  2. Use diagram to identify any associated failure modes.
  3. Identify effects of failure and assess criticality.
  4. Repeat 2 and 3 for potential consequences.
  5. Identify causes and occurence rates.
  6. Determine detection factors.
  7. Calculate Risk Priority Numbers.
  8. Finalise hazard assessment.
The key steps to note are that stage 4 will identify any knock-on effects throughout a system. In contrast, step 5 can be used to look for precursors to a failure.

b) FMECA can be used to calculate Risk Priority Numbers which are given as the product of the severity index, the occurence index and the detection index. What are the main limitations and dangers in using this approach to hazard analysis?

[6 marks]

A number of problems limit the utility of risk priority numbers. For instance, there are many unknowns early in the development proces. It can be difficult to make assumptions about what components will be used and what the associated values should be for the severity index, the occurence index and the detection index. This is problematic because it is precisely during the early stages of development that most resources can be devoted to have an impact on risk reduction.

Many of the values that can be associated with these indices are subjective and lack empirical support. For example, criticality is assessed usng key words such as `Moderate'. The more detailed descriptions associated with these different levels are industry dependent but are still difficult to apply consistently. There is no guarantee that two independent assessors would agree that the same failure mode would result in `Product is operable, but comfort or convenience item(s) are inoperable'.

There are further problems, given the low frequency of high-criticality failures in most industries, there may only be near-miss data to base any estimates upon. There may be no actual experience of the potential severity of particular modes and so we must rely upon heuristics such as `worst plausible outcome'. Similarly, the values associated with the detection of a failure mode rely upon the management and operation of maintenance and inspection practices. These can change over time and hence may invalidate any porevious assumpotions in the calculation of an RPN.

If the RPN is too low then there is a danger that the risk associated with particular failures will be under-estimated and insufficient design resources will be allocated to correct any potential problems. Conversely, if the RPN is too high then some aspects may be `over-engineered' and finite development resources will be wasted.

See also FMECA.COM.


c) You have been asked to devise a functional block diagram for a replacement to the London Ambulance Service dispatch system. Briefly sketch a high-level view of this diagram and make an initial list of pos sible failure modes for at least three of the components.

[10 marks]

[CWJ: there are many possible solutions - submit one and I'll provide feedback].



2.

a) Compare and contrast at least three different definitions of workload.

[4 marks]

There are numerous definitions of workload:

  1. Colloquial or everyday definitions focus on the level of stress that an individual feels when performig particular tasks. It often relates to the number of things that they have to do but can also be affected by the number of interuptions that they experience.
  2. Physical workload is a more limited concept. It can be measures and defined in terms of energy expenditure in terms of kilocalories and oxygen consumption.
  3. Psychological workload picks up on concepts from the everyday definition that are not considered in most physiological studies. In particular, it focusses on the mental processing that is required by different tasks. Within this definitions there are several distinct approaches. For instance, some researchers focus on the way in which the need to simultaneously attend to different visual and auditory sources can affect individuals. Others look at the way in which different tasks reduce an individual's ability to perform problem solving activities.
Other solutions might focus on situated definitions of workload - that look beyond laboratory studies and consider both shift patterns and fatigue. It is also possible to focus on workload in relation to function allocation and team working.

b) Briefly explain why Crew Resource Management techniques must explicitly consider the problems created by poor situation awareness.

[4 marks]

Crew Resource Management describes a series of techniques that are intended to help operators make the most of their finite cognitive and physical resources under conditions of high workload. These techniques include training about underlying error mechanisms, they also include the use of simulated operating scenarios so that teams can practice their performance in everyday and extreme situations. The term `situation awareness' has various definitions. In general terms, however, it refers to an individual's ability to detect important information in their environment and then use that information to anticipate the course of future interaction so that they can plan their subsequent intervention. Wickens and Flach provide a model in which cues from the environment are filtered according to a salience bias. A similar effect helps to identify critical information from the mass of effects that might be observed after any operator intervention.

Crew resource management must consider situation awareness because it is clearly important that each individual member of a team is aware of their current environment or situation so that they can respond appropriately. In particular, it is important that they monitor the activities of their colleagues and communicate any information that is necessary to help them follow their planned interventions. Problems often arise when the salience bias notes by Wickens and Flack can lead operators to make incorrect assumptions about their colleague's behaviour. They simply do not notice the cues that might help them to realise their colleagues are not performing in the way that they might have anticipated.


c) Explain why Crew Resource Management techniques are likely to be insufficient guarantors of situation awareness and optimal workload during the day to day operation of any of the safety-critical systems that have been introduced during this course.

[12 marks]

[CWJ: again several solutions are possible here. Submit an example and I'll provide feedback].

The key issues here are that many CRM techniques focus on the simulation of extreme situations. Pilots are trained under situations in which major aircraft systems fail or there are adverse weather conditions. Surgeon are trained in situations involving relatively rare complications. More recent work by the FAA and Seamster has focussed more on the development of CRM practices to ensure team coordination under more typical operating conditions. Individuals are encouraged to develop the communication and coordination practices that help people to monitor their interaction in everyday situations. The idea is that these practices will then be transferred more readily to emergency conditions.

Further doubts can be expressed about the utility of CRM training.



3.

a) Briefy explain why an appropriate and effective hardware management plan can have a significant impact on the acquisition of a safety-critical software system.

[5 marks]

The safety of a system is often measured in terms of service availability. This typically assumes that software meets its requirements, assuming that the requirements are `correct'. It also assumes that potential hardware failures are addressed, for example by preventative maintenance, before they affect the system as a whole.

Hardware management plans help to ensure that system components meet or exceed the integrity levels that are specified during any risk assessment. For example, the use of preferred suppliers can help to ensure that the computational infrastructure is based upon components of a known quality. They also help to ensure that parts are available when needed. The use of unproven or obsolete components can compromise these objectives.

In an ideal world, we might assume that softeware engineers will consider all hardware to be inherently unreliable. In such circumstances, considerable development resources should be allocated to detecting and resolving any hardware failures. In practice, however, risk assessments often direct the allocation of finite development resources. One consequence is that additional attention will be paid to software components that interact with potentially unreliable devices. However, if hardware management fail to ensure the assumed relaibility levels then those software development resources will be miss-directed.

Some additional notes


b) How does multilevel, triple modular redundancy remove a single point of failure from a standard TMR safety-critical system architecture.

[5 marks]

The key point to note is that the question asks about multi-level TMR, as illustrated in the diagram above. As can be seen, there are three voting elements at each stage. All of these propagate their results to the next level. In a single level TMR system, the voting element is not replicated and therefore acts as a single point of failure. In practice, however, the additional cost and complexity of implementing multi-level TMR can outweigh many of the potential benefits.


c) NASA's technical summary of the orbiter guidance and control systems contains the following passage:

"Each computer in a redundant set operates in synchronized steps and cross-checks results of processing about 440 times per second. Synchronization refers to the software scheme used to ensure simultaneous intercomputer communications of necessary GPC status information among the primary avionics computers. If a GPC operating in a redundant set fails to meet two redundant synchronization codes in a row, the remaining computers will vote it out of the redundant set. Or if a GPC has a problem with its multiplexer interface adapter receiver during two successive reads of response data and does not receive any data while the other members of the redundant set do not receive the data, they in turn will vote the GPC out of the set. A failed GPC is halted as soon as possible"

Explain why this approach provides a high degree of assurance in such a safety-critical application.

[10 marks]

There are several concepts that each contribute to the assurance level associated with this system:

  1. redundancy - this can be seen thgrough the explicit mention of a redundancy set. If one of the GPC's fails then the system can continue. The mention of a multiplexer interface may also suggest some form of communications redundacy although it is difficult to be sure from this snippet.
  2. synchronisation - the architecture uses a form of hot redundancy with parallel computation and comparison of results. In order for this to work the indepdent computations must synchronise the exchange of results. A failure to meet synchronisation requirements also indicates failure - the reference to two missed deadlines.
  3. voting - if there is disagreement about the results of any redundant computation then these is an algorithm for resolving the conflict.
  4. fault detection and reconfiguration - as mentioned synchronisation failures and status exchange can be used to ensure that non-faulty GPCs will continue in the presence of a fault to one of the redundant set.
There are several different ways in which these themes might be brought out in an individual solution to this question.

4.

Explain the reasons for AND against the introduction of COTS products into safety critical systems. What techniques can be used to improve the safety of systems that make use of such products.

[20 marks]

[CWJ: again several solutions are possible here. Submit an example and I'll provide feedback].


END