\documentclass[12pt]{article}

Using Reason's Model of Organisational Accidents in Formalising Accident Reports

C.W. Johnson * and R.M. Botting **

* Department of Computing Science, University of Glasgow, Scotland, G12 8QQ.

johnson@dcs.gla.ac.uk

**Graham Technologies, 41 Carlyle Avenue, Hillington, Glasgow, Scotland, G52 4XX.

This paper proposes a method for analysing what are called organisational accidents. The first step of the method involves using Reason's (1997) model of organisational failures. This provides heuristic guidance in identifying both the active and latent conditions that lead to major failures. The second step involves applying formal methods to support a detailed analysis of each latent and active condition. The method is demonstrated on a case-study: the railway accident at Watford Junction in the United Kingdom. Analysis of the formal model helps to identify organisational factors that might have prevented the accident. It also helps to identify weaknesses in the report itself. In particular we argue that a signalling standard was misunderstood, the consequences of which could lead to another serious accident.

Keywords: formal methods, organisational accidents, accident reports, LOTOS, simulation, latent factors.

1. Introduction.

This paper proposes a method for analysing what are called organisational accidents. The first step involves the use of a high level model to guide the identification of active and latent failures (Reason,1997). The second step involves applying formal methods to represent and reason about these failures at a level of detail that supports subsequent redesign.

1.1 Organisational Accidents

The central message of this paper is that accidents are complex. Modern systems are, typically, protected by many different defence mechanisms. These range from monitoring devices, such as integrated control units, through managerial and regulatory structures, including safety committees, to human factors techniques, including operator training and selection. When an accident does occur, it is usually because some trigger or catalyst has acted on more long-term failures in the design of a system. Figure 1 (taken from Reason, 1997) shows how these active and latent conditions combine to provide a path along which the various defences may be breached.

Figure 1: Latent and Active Failures that Contribute to an Accident Trajectory.

This paper focusses on one particular type of accident. We are less concerned with individual accidents in which a specific person or group is often both the agent and the victim of the accident. In contrast, this paper focusses on organisational accidents. These occur when no one person's failure was a sufficient cause for an accident. Their origins can, typically, be traced far back into many parts of the organisation. Reason (1997) describes three stages in the development of such accidents: the organisational factors stage, the local workplace factors stage and the unsafe acts stage. Organisational factors include strategic decisions, managing and auditing. Local workplace factors include insufficient training, poor communications, and unworkable procedures. Figure 2 extends Figure 1 to represent these various stages. In this paper organisational factors will include both the organisational factors and local workplace factors.

Figure 2: The Stages in the Development of an Organisational Accident.

1.2 The Watford Junction Case Study

Organisational accidents are complex, they have multiple causes involving many people operating at different levels in their respective companies. In consequence, Reason's model provides a heuristic guide for a detailed level of analysis. Formal methods support an in-depth investigation of both active and latent failures within organisational accidents. To illustrate this argument, we will look at a case-study: the Health and Safety Executive's (HSE) report into the Watford Junction Railway Accident (Robertson,1998). This is appropriate because it typifies reports into complex accidents that focus on organisational factors.

The text given below is taken from the Executive Overview of the report. (Figure 3 provides a graphical overview of the accident).

"The circumstances of the accident were that at 1724 on Thursday 8 August 1996, a collision occurred between two trains approximately 700m south of Watford Junction station on the West Coast Main Line (WCML). A northbound passenger train that had left London Euston station at 1704 and was travelling along the Down Slow line was struck by a southbound empty coaching stock (ECS) train that was progressing along the connections linking the Up Slow to the Up Fast lines.

A team from the HSE's HM Railway Inspectorate (HMRI) investigated the accident and was satisfied from the evidence that the driver of the northbound passenger train travelling towards Watford, failed to react correctly to two signals displaying caution aspects when approaching the red signal that protected the junction at Watford South. When that signal came into his view, his train was travelling at about 110 km/h (68 mile/h) and he made a full brake application. The distance between the point of the brake application and the red signal was insufficient to permit the train to stop at the signal and it eventually came to a stand 203 m (222 yds) beyond the signal, foul of the route set for the ECS train.

No defect was found with the braking system of the rolling stock or with the permanent way and there was no evidence of vandalism. However, a number of contributory and mitigating factors were revealed by the HMRI investigation.

Two main factors emerge. Firstly, there is no doubt that had Automatic Train Protection (ATP) been fitted to both track and trains, the collision would have been avoided. Secondly, the inappropriate application of a signalling standard in respect to the positioning of a permanent speed restriction (PSR) combined with the absence of a normal overlap (safety margin) at the signal that was overrun was a major element in the subsequent series of events(Robertson,1998).

Figure 3: Signalling Plan - Watford Junction Area (not to scale).

1.3 The Formal Analysis of Accident Reports

In recent work we have shown how formal methods can be used to improve accident reports. For example, Johnson (1997) and Telford and Johnson (1996) showed how formal methods could be used to identify weaknesses with the conclusions of a report. Botting and Johnson (1998) showed how a formally described task analysis model could assist our understanding of an accident. The previous focus was on individual behaviour during system failures. This paper demonstrates that formal methods in partnership with Reason's model of organisational accidents can also greatly enhance our understanding of an organisational accident.

The formal description technique used to analyse this report will be LOTOS. In LOTOS a system is seen as a set of processes which interact and exchange data with each other and with their environment. Each process may exhibit a range of internal behaviours. For instance, the LOTOS syntax G; P describes a process that first performs some action G and then behaves like process P. This can be used to describe sequences of behaviour. Similarly, P1>>P2 denotes the sequential composition of processes; P2 can start when P1 has finished. Conversely, P1 [] P2 can be used to denote a process that behaves either like P1 or like P2 depending on other aspects of the specification. This can be used to represent alternative traces of behaviour. LOTOS is, therefore, similar to other process algebras because it supports the representation of sequences of events. Unlike CSP, however, LOTOS supports the description of data types and the passing of values from one process to another (Abowd,1990). For example, inp(?x); outp(x!) represents a process that expects some input, which is bound to x, and then passes x as output from that process.

Later sections will describe how the Caesar Aldebaran Development Package (CADP) toolbox provides a suite of methods to analyse a LOTOS specification. The following sections focus on the application of the LOTOS notation to analyse the latent and active failures that can be identified using Reason's model of organisational accidents. A more complete introduction is available in Bolognesi and Brinksma (1989).

2. Weaknesses of Accident Reports.

Accident reports tend to be long documents containing a mass of data of varying degrees of relevance to the actual incident under consideration. Typically, an accident report will contain a synopsis, a conclusion and a body. The purpose of the conclusion is to explain the major causes of the incident and to provide recommendations that might avoid similar incidents in the future. The conclusions are based on evidence and analysis that is presented in the body of the report. However, there are several problems with this approach. The body consists of sections that tend to be written by different teams of experts. This often leads to inconsistencies. To identify all of the available details about a particular event or state of the system, it may be necessary to cross reference many different sections scattered throughout the report. For example, paragraphs 143/4 of the Watford Junction report state that:

"It is of considerable concern to note that SPAD incidents occurred at signal WJ759 in February and June 1994, August 1995 and July 1996 and yet no signal sighting committee had been convened...A signal sighting committee was convened ten days after the accident. It was noted that removal of trees on the left hand side of the left curving line would increase the sighting distance of signal WJ759 to about 735 metres."

Whereas much later, paragraph 171 states:

"…it is a matter of fact that train 2A17 did not take 730m to come to a stand from 110 km/h on the day of the collision."

Even if the report possesses adequate referencing, it can still be an arduous task to glean the needed information. This makes it very difficult to get an accurate view of the accident. This is especially true of incidents that involve many different aspects of an organisation.

It has long been known that, in the field of Software Engineering, specifications of systems described using natural language often contain ambiguities and inconsistencies. They can also be incomplete. In an analogous way this is also true of accident reports. These problems stem from the flexible and expressive nature of natural language. Such strengths can, however, create problems for the development of safety critical systems. They can also weaken accident reports if the conclusions are erroneous or misleading.

Recent work has shown that formal notations can be used in conjunction with traditional analysis to provide better quality reports (Telford and Johnson 1996, Johnson 1997, Botting and Johnson 1998). Building formal models of an accident can be used in helping to verify that:

1. the main body of the report is unambiguous, consistent, and complete.

2. the conclusions logically follow from the main body of the report.

It is not claimed that any particular formal notation can model all the events of an incident. There are many issues which are difficult to model formally, such as the movement and density of smoke in the Channel Tunnel fire incident (R. Allison et al.,1996). Another example concerns the Standard Signalling Principle, SSP 20, of the Watford Junction accident. We cannot easily write a procedure in a formal model that decides whether or not SSP 20 provides enough information to guarantee such an abstract property as ``safety''. Also, in producing a model, too much faith can be put in the assumptions that are made (Leveson,1986).

It is, however, impossible for accident reports to record all of the possible information that might be relevant to a particular failure. Judgements must be made about the frame or scope of the investigation. It can be impossible to identify precise information about all aspects of a failure in the aftermath of a major accident. This can lead to considerable differences in the level of detail provided by the analysis of different experts.

Formal models of accident reports can help to clarify salient issues and to identify ambiguities. We can formalise complex aspects so as to gain a better understanding of their detail. Different scenarios can be analysed, and properties about the system can be proved. However, in constructing a formal model, care and judgement must be exercised and any assumptions must be clearly stated.

2.1 Comments about the Executive Overview.

This paper will use formal methods to analyse the signalling standard that is cited as a mitigating factor in the executive overview. It will be shown that this standard is presented in a potentially misleading way within the report. Similarly comments can be made about another mitigating factor mentioned in the executive overview; the reduced overlap plays no part in the Watford accident. Train 2A17 came to rest 203 m beyond the signal and foul of the route set for the ECS train. This was beyond a normal overlap of 183 m.

The executive overview suggests Automatic Train Protection (ATP) as a technical solution to preventing a recurrence of such accidents. However, this measure has been rejected as too expensive for installation throughout the UK rail system. In consequence, it is important that we understand the details of this incident so that we can identify alternative safeguards.

"The main conclusion was that ATP was too expensive to be adopted generally...

Railtrack undertook, among other things, to consider the inclusion of ATP in

future major re-signalling schemes" (Robertson, 1998).

2.2 Organisational Aspects of the Watford Junction Railway Accident.

At first glance the Watford Junction railway accident may be considered an individual accident. The driver failed to react appropriately to two sets of signals. A more thorough analysis of the report reveals that there were many organisational factors, not just those mentioned in the Executive Overview, that led to the accident.

The driver's action of ignoring the lights was a violation: a deviation from safe operating procedures, standards or safety rules (Reason,1997). However, the error the driver committed, to overrun a signal at danger, (referred to in the report as an SPAD - Signal Passed at Danger), is a relatively common occurrence. It is for this reason that a proper understanding must be gained of what bearing organisational factors had on the accident:

"The introduction of improvements from Hazard Ranking process investigations and what is claimed to be a greater awareness resulting from briefing of 'Red Alert' has coincided with a small but welcome reduction in SPAD incidents from 879 in 1993/94 to 832 in 1994/95 and to 771 in 1995/96 of which 676 (88\%) were attributed to driver error" (Robertson,1998).

Numerous organisational factors combined to make the probability of the accident more likely. A summary of these is shown in Table \ref{table1} where a list of active failures and latent conditions are given (Reason,1997). Active failures are errors and violations committed by front-line personnel, they usually have immediate effect. Latent conditions are the consequences of top-level decisions, they may lie dormant for a time doing no particular harm until they interact with local circumstances to defeat the system's defences. Latent conditions are associated with organisational factors, whilst active failures are associated with the unsafe acts of Figure 2. The list of latent conditions in Table 1 is not exhaustive.

Active failures

Latent conditions

1. Driver violates signals and horns.

Speed boards placed in wrong positions.
Drivers never given information on why speed boards are there
There had been several SPADs but a signal sighting committee was not convened.

4. An inspection of the signalling system at Watford Junction was never carried out.

5. The driver did not know of the reduced overlap.

6. The driver had committed SPADs recently and so should have been in the `at risk' category.

Table 1: Active failures and latent conditions.

The first step of our method is to produce this list. This list was not available in the Watford Junction accident report. The report arguably puts too much stress on the driver's violations and not enough on the organisational failings. As Reason says `we cannot change the human condition, but we can change the conditions under which people work' (Reason, 1997).

Listing the latent conditions informs us that organisational factors may have been important in explaining why particular events occurred on the day of the accident. We cannot, however, use this list to explain the detailed ways in which these factors combined to provide the circumstances for an accident to take place. Nor can the table be used to identify what changes might have prevented the accident from occurring. Given only the table, we still need to rely on the natural language report to identify these details. As has been said, however, the reasoning in accident reports is often poorly structured and incomplete. For a much better analysis and representation of the accident, we turn, in step 2 of the method, to formal methods.

3. Modelling Organisational Factors of the Railway SSystem.

Step 1 of the method helps to identify areas for further analysis. Step 2 involves modelling the latent conditions recognised in step 1. Once the model is constructed we can explore the consequences of the different active failures.

In this section we will demonstrate how LOTOS facilitates the modelling of organisational factors such as poor quality control, outdated procedures and communications failure. Section 4 goes on to demonstrate that formal methods can be used to analyse a situation in more detail, so as to clarify our understanding. This section will start by introducing a simple process. Then we will show how simple processes can be combined into more complex descriptions. In this way a complex model can be constructed from simpler components.

3.1 Lack of quality control.

We start by describing a simple process that models the decision to carry out an inspection of the Watford Junction signalling system (number 4 in the list of latent conditions).

" In practice, it would have been normal for works such as the Watford re-signalling to have been inspected.... In October 1993 the Director of Inter-City West Coast wrote to HMRI giving notification that the Watford Junction Resignalling scheme had been completed and commissioned in May/June 1993 and included a certificate of compliance. The certificate of compliance stated that the overlap beyond signal WJ759 was less than the preferred norm of 200 yds, but that it complied with Standard Signalling Principle (SSP) 20 because a permanent speed restriction of 60 mile/h had been imposed on the approach to signal WJ759.... Between November 1994 and the time of the accident, HMRI made a number of attempts to arrange an inspection without success" (Robertson,1998)

So an inspection was not carried out. If there had been an inspection, it might have been realised that the positioning of the Permanent Speed Restrictions (PSRs) did not satisfy SSP 20 and so could have led to an accident. With this realisation the PSRs would have been repositioned.

The LOTOS process inspection models the situation above. The process has a parameter state that is a record storing information about the state of the Watford Junction signalling system. There is a choice between whether or not the system is inspected. If it is inspected the PSRs are moved so that their positioning agrees with the HMRI's recommendation, given by modifying the state field spbs to HSE_speedboards. We will say much more about the standard SSP 20 and the placing of PSRs in Section 4.

process inspection[inspect](state:state): exit(state) :=

(* report para. 155 *)

inspect !accepted_without_inspection;

exit(Set\_inspected(accepted_without_inspection,state))

(* certificate of compliance given without inspection *)

[]

inspect !accepted_with_inspection;

exit(Set_inspected(accepted_with_inspection, Set_spbs(HSE_speedboards,state)))

(* certificate of compliance given after inspection *)

endproc

In Section 3.2 we demonstrate how simple processes can be combined to form larger processes.

3.2 Several failings in driver management.

There were several failings in driver management. If these had been addressed then the accident might have been averted. The first was a communications failure (number 2 in the list of latent conditions), the second involved not updating a procedure (number 6 in the list of latent conditions).

3.2.1 A communications failure.

"North London Railways (NLR) have provided a written statement to the effect that there is no record of any information provided by what was, at the time, the InterCity West Coast Operations Department regarding special provisions as part of the Watford resignalling scheme. NLR drivers and traction inspectors who were interviewed by a HMRI Inspector were firmly of the opinion that the 60 mile/h PSR board at the 15 milepost south of Bushey station was either due to condition of track or because of reduced platform clearances through the curve at Bushey station".(Robertson,1998).

The process literature_available is similar in structure to the process inspection. It provides a choice between whether or not literature is available to the driver. As well as the parameter state, it has a parameter driver_status which is modified

process literature_available [lit_avail] (state:state,driver_status:driver_status):

exit(state,driver_status) :=

(* report para. 166 *)

lit_avail !lit_no; exit(state,Set_litavail(lit_no,driver_status))

(* drivers did not receive any literature *)

[]

lit_avail !lit_yes; exit(state,Set_litavail(lit_yes,driver_status))

(* drivers received literature *)

endproc

3.2.2 Not updating a procedure.

It has already been stated that drivers often passed a signal on red, Signal Passed at Danger (SPAD). To reduce the number of SPADs, drivers were assessed to determine the likelihood of their committing a SPAD.

"Guidelines were set out to assist in the identification of incident prone drivers and to provide a basis for necessary action to improve safety performance".(Robertson,1998).

NLR set up their own Driver Performance Profile (DPP) for drivers in 1994 and amended it to comply with Railtrack's Standard GO/RT 3251 `Safety requirements for train drivers' in the same year. However:

"While the above work was going on at NLR, the safety Directorate of British Railways Board (BRB) issued in December 1994 a revised and more detailed code of practice BR/BCT 508 entitled `At risk drivers' ... The points system was such that a driver in the `at risk' category was likely to remain `at risk' particularly if he was involved in another serious incident such as a SPAD within 6 years. This process was designed to ensure that drivers who were involved in repetitive incidents were constantly monitored" (Robertson,1998).

These more stringent principles were not adopted by NLR.

"In the case of the driver of the passenger train involved in the Watford South collision, he was identified at the start of 1995 as being `probably incident prone' but at his 1996 assessment was reduced to `non-incident prone' based partly on the fact that there had been no incidents in the year under review and partly on a discretionary management decision" (Robertson,1998).

This situation is modelled by the two processes given below:

process driver_assessment95[category](state:state,driver_status:driver_status):

exit(state,driver_status) :=

(* report para. 133/4/5 *)

category !driver1 !category_A;

exit(state,Set_category(category_A,driver_status))

(* driver1 is considered 'probably incident prone' *)

endproc

process driver_assessment96[category](state:state,driver_status:driver_status):

exit(state,driver_status) :=

(* report para. 133/4/5 *)

category !driver1 !category_C;

exit(state,Set_category(category_C,driver_status))

(* in 1 year driver1 jumps to 'non incident prone' *)

endproc

The driver of 2A17 had skipped a category, something that the code of practice BR/BCT 508 would not have permitted. If NLR had adopted BR/BCT 508 then the driver of 2A17 would have still been monitored and thus much less likely to ignore warning aspects. The process updated_category_reduction models the driver moving one category:

process updated_category_reduction[category](state:state,driver_status:driver_status):

exit(state,driver_status) :=

(* report para. 133/4/5 *)

category !driver1 !category_B;

exit(state,Set_category(category_B,driver_status))

(* driver's risk is reduced by one category *)

endproc

The process driver_assessment defines the complete situation. The driver is assessed under the old NLR procedure in 1995. Then by the same procedure in 1996 unless the procedure is updated. The final process Is_driver_monitored is omitted for brevity. It states whether or not the driver is monitored, based on their category.

process driver_assessment[category](state:state,driver_status:driver_status):

exit(state,driver_status) :=

(* report para. 133/4/5 *)

(

driver_assessment95{[}category{]}(state,driver_status) >>

accept state:state,driver_status:driver_status in

(driver_assessment96[category](state,driver_status) [>

updated_category_reduction[category](state,driver_status))

) >>

accept state:state,driver_status:driver_status in

Is_driver_monitored[monitor_or_not](state,driver_status)

endproc

The process driver_assessment is now put in parallel with the process literature_available of Section 3.2.1 to form the process Driver_mgmt:

process Driver_mgmt[lit_avail,category](state:state,driver_status:driver_status):

exit(state,driver_status) :=

literature_available[lit_avail](state,driver_status)

|||

driver_assessment[category](state,driver_status)

endproc

This section has demonstrated how LOTOS facilitates the modelling of organisational factors within an accident. From the description of simple processes we can build up a complex model. There is insufficient space here to describe the whole LOTOS model but given these examples it is not difficult to see how a model of all the salient organisational points of the accident could be constructed. The full LOTOS model will be discussed in Section 5. Building such a formal model will allow us to perform analysis which would be difficult to carry out using the report alone, whilst using an informal model would be prone to error. Section 6 will show how the analysis is carried out. The next section shows how formal methods are used to analyse the complex details that characterise many latent failures.

4. The Application of Formal Methods to Understand a Problematic Procedure

Paragraph 174 of the report includes the text given below:

"The overlap to Signal WJ759 was 148m (162 yds). At the time that the resignalling scheme at Watford Junction was designed, the rules relating to the lengths of overlaps were given in Standard Signalling Procedure (SSP) 20. The basic requirement of SSP 20 is that the overlap to a signal shall normally be 183 metres (200 yds), but clause 3 of SSP 20 states \emph{`when it is considered essential to avoid restrictions to traffic movements the overlap may be reduced in length. The length `of the overlap shall be based on the maximum train speed (with no signal restrictions) at a distance of 440 yards on the approach to the signal.'} There then follows a table that shows that for an overlap distance of 150 yards, the maximum train speed allowed is 60 mile/h" (Robertson,1998).

SSP 20 explains what should be done but not how it is to be achieved. At the time that the Watford Junction system was being put in place there was some confusion as to where speed boards, or permanent Speed Restrictions (PSRs), should be placed to satisfy the regulations (number 1 in the list of latent conditions).

"Eventually, it was agreed that the start of the 60 mile/h PSR should be placed 440 yds in the rear of the point where a train with the worst braking characteristics would have to start braking from 60 mile/h in order to stop at Signal WJ759 and the end of the PSR should be at the point where a driver would first see Signal WJ755. The reasoning behind the decision seemed quite sound, as if the Signal WJ755 was at yellow, the driver would continue to brake. [The speed boards were therefore placed as given in Figure 3.] With the benefit of hindsight, this decision was wrong, as it gave drivers the message that they may accelerate, as did driver of train 2A17, up to 90 mile/h while still approaching Signal WJ759" (Robertson,1998).

To model this situation a 3-field record, Speed_boards, was defined in LOTOS as an abstract data type. The three fields were the relative distances, in metres, of the 60 mile/h speed board, the 80 mile/h speed board and the 90 mile/h speed board from the signal WJ759. Thus the set-up on the day of the accident was (see Figure 3):

Actual_speedboards = (1823,1378,1024)

and the set-up in the report which seems to be advised by the HMRI:

"Had the 60 mile/h PSR sign been placed 440 yards ( 402 m) short of Signal WJ759 instead of Signal WJ755 and given the assumption that the driver observed the speed limit in the same way as he observed the PSR south of Bushby station, the accident would, in all probability, have been avoided..." (Robertson,1998).

was:

HSE_speedboards = (402,null,null)

The HMRI state that SSP 20 is ambiguous. Formal methods provide a means to identify further weaknesses in the implementation of this standard. These weaknesses are not recognised in the report. There are two properties of instances of Speed_boards that concern us. Firstly we would like to know whether they satisfy SSP 20. For instance, does Actual_speedboards satisfy SSP 20? Secondly, whether they satisfy SSP 20 or not, we would like to know if they are safe, i.e if under all reasonable circumstances their implementation will never lead to an accident. The first of these properties is considered below. The second will be considered in Section 7.

The process Speed_boards_process, consists of two processes in parallel: Place_speed_boards and Do_they_satisfy_SSP20.

process Speed_boards_process[ssp20](state:state): exit(state) :=

(* report para. 174/5/6 *)

Place_speed_boards[ssp20](state)

|[ssp20]|

Do_they_satisfy_SSP20[ssp20](state) >>

accept state:state,spbs:speed_boards,info:stateInfo in

exit(Set_spbs(spbs,Set_state_info(Add(info,state_info(state)),state)))

endproc

The process Place_speed_boards provides a choice between various instances of the type Speed_boards, such as Actual_speedboards and HSE_speedboards. We will discuss the other speed board set-ups in Section 7.

process Place_speed_boards[ssp20](state:state):

exit(state, speed_boards,stateInfo) :=

(* report para. 174/5/6 *)

ssp20 !HSE_speedboards; exit(state,HSE_speedboards,any stateInfo)

(* 60 mile/h PSR put at 402 m on approach to WJ759 *)

[]

ssp20 !Actual_speedboards; exit(state,Actual_speedboards,any stateInfo)

(* speed boards positioned as in the accident *)

[]

ssp20 !Other_setup_1; exit(state,Other_setup_1,any stateInfo)

(* 60 mile/h speed board put at 1500m before WJ759 *)

[]

ssp20 !Other_setup_2; exit(state,Other_setup_2,any stateInfo)

(* speed boards put at 1500m,360m and 100m from WJ759 *)

[]

ssp20 !Other_setup_3; exit(state,Other_setup_3,any stateInfo)

(* speed boards put at 1500m,400m and 100m from WJ759 *)

endproc

The process Do_they_satisfy_SSP20 checks whether or not the instance of type Speed_boards input, spb_positions, satisfies SSP 20. The result then forms part of the modified state. SSP 20 states that the train must slow down to 60m/h by the time it is 440yards (402m) away from the signal, in this case WJ759. It states that this should occur when all preceding signals are on green. The worst case scenario would be when a train with the worst braking characteristics, is approaching the signal at the maximum allowed speed, on a slow line, of 90 mile/h. The information available from the report concerning braking is restricted to the train 2A17. Thus the figures in Do_they_satisfy_SSP20 are based on train 2A17 on the day of the accident, which in fact broke in relatively quick time and in good conditions. The process checks to see if the 60mile/h speed board is put at least (725 + 402) metres from signal WJ759. 725 metres is the estimated distance for train 2A17 to brake from 90 mile/h to 60 mile/h. If this is not the case then SSP 20 is broken. If however, an 80 mile/h or 90 mile/h speed board is placed after the 60 mile/h speed board and more than 440 yds (402m) before WJ759, the train will again accelerate and SSP 20 will be broken.

process Do_they_satisfy_SSP20[ssp20](state:state): exit(state, speed_boards, stateInfo) :=

(* report para. Appendix 3, 3 *)

ssp20 ?spb_positions:speed_boards; (

(* The 60m/h speed board is put atleast 725m + 402m from WJ759, *)

(* and there is no 80m/h or 90m/h speed board before WJ759. *)

[(spb_60(spb_positions) - p402) ge p725] →

[spb_80(spb_positions) eq null] →

[spb_90(spb_positions) eq null] →

exit(state,any speed\_boards,satisfies\_SSP20)

[]

(* The 60m/h speed board is put atleast 725m + 402m from WJ759, *)

(* and the 80m/h and 90m/h speed boards are put less than *)

(* 402m from WJ759. *)

[(spb_60(spb_positions) - p402) ge p725] →

[spb_80(spb_positions) lt p402] →

[spb_90(spb_positions) lt p402] →

exit(state,any speed_boards,satisfies_SSP20)

[]

(* The 60m/h speed board is not put at a sufficient distance *)

(* from WJ759, so it will not slow to 60m/h by 402m *)

(* before WJ759 *)

[(spb_60(spb_positions) - p402) lt p725] →

exit(state,any speed_boards,breaks_SSP20)

[]

(* The 60m/h speed board is put at a sufficient distance from *)

(* WJ759, but the 80m/h speed board is seen before getting to *)

(* the 402m mark. *)

[(spb_60(spb_positions) - p402) ge p725] →

[spb_80(spb_positions) ge p402] →

exit(state,any speed_boards,breaks_SSP20)

)

endproc

It can be seen that the speed board positions of Actual_speedboards do not satisfy SSP 20 and neither do HSE_speedboards that the HMRI put forward as the correct interpretation of the signalling standard. This is very worrying. Even more worrying, as will be shown, is that the accepted interpretation of SSP 20 could easily have led to an accident similar to that involving 2A17.

5. The Full LOTOS Model.

The full LOTOS model is shown in Figure 4. The top part of this model describes the organisational decisions, whilst the bottom part describes the consequences of these decisions.

It can be seen that there are many more processes involved in describing the organisational part than there are in, what we will call, the scenario part. Thus the `shape' of the model matches the `shape' of Table 1. The accident is top-heavy with organisational aspects.

5.1 The full organisational model.

The organisational model instantiates the records state and driver_status before these values are passed to the scenario model. The first process build_wj_system initialises the record state. This process includes the process Speed_boards_process, see Section 4. The processes inspection, from Section 3.1, and spad_mgmt can modify the record state. The process spad_mgmt can alter the viewing distance of the signal WJ759. The process Driver_mgmt, see Section 3.2, sets up the record driver_status. We believe that this model describes the salient organisational issues of the accident.

(* The organisational model. *)

build_wj_system[atp,overrun_at_s759,spb,views](initial_state) >>

accept state:state in

inspection[inspect](state) >>

accept state:state in

spad_mgmt[spads,committee](state) >>

accept state:state in

Driver_mgmt[lit_avail,category](state,initial_driver_status) >>

accept state:state,driver_status:driver_status in

(* The scenario model. *)

(train[signal,speed_board,brake]

(state,driver_status,mktrain(train_2A17,eighty,370,green))

|[signal,speed_board ]|

down_slow[signal,speed_board](state))

Figure 4: The full LOTOS model.

5.2 The scenario model.

Figure 4 includes the scenario model. The down_slow process represents the state of the line on which the train 2A17 was running on the day of the accident. The position of the speed boards and the driver's view of the signal WJ759 depend on what values the record state contains. This is determined by the organisational model. The process train is in parallel with down_slow. How the driver reacts to the speed boards and signals depends on whether he has been informed of their reason for being there and whether he is being monitored.

6. Analysis of the Model: How Organisational Factors could have Prevented the Accident.

There are various ways of analysing the full LOTOS model using the Caesar Aldebaran Development Package (CADP) toolbox (Fernandez et al.,1996). One particular tool XSIMULATOR allows the interactive step-by-step simulation of a specification. Investigators can use this tool to drive simulations of accident scenarios directly from LOTOS descriptions, such as that given in Figure 4. At each step of a simulation, a choice is selected from those on offer. For example, a choice is offered on the gate inspect between !accepted_without_inspection and !accepted_with_inspection. The user of the toolkit can select which path the simulation will follow by using a mouse to select the appropriate CADP option.

The simulator will then take the path chosen before providing the next choice. All the observable events in a scenario are recorded. This enables accident investigators to explore alternative scenarios during the lead up to an accident. Thus a full simulation of the actual accident is given by:

"OVERRUN_AT_S759 !148M"

"SPB !MKSPEED_BOARDS (1823, 1378, 1024)"

"VIEW_OF_WJ759 !368"

"INSPECT !ACCEPTED_WITHOUT_INSPECTION"

"SPADS !FOUR_IN_2YRS"

"COMMITTEE !NOT_HELD"

"LIT\_AVAIL !LIT_NO"

"CATEGORY !DRIVER1 !CATEGORY_A"

"CATEGORY !DRIVER1 !CATEGORY_C"

"MONITOR_OR_NOT !DRIVER1"

"SIGNAL !SN751 !DOUBLE_YELLOW !370"

"SPEED_BOARD !SIXTY"

"SPEED_BOARD !EIGHTY"

"SPEED_BOARD !NINETY"

"SIGNAL !SN755 !YELLOW !350"

"SIGNAL !SN759 !RED !368"

"exit !COLLISION_POSITION

!ADD (BREAKS_SSP20, {}) !ADD (DRIVER_NOT_MONITORED, {})"

To create the scenario above, the user of the toolkit initially had a choice between OVERRUN_AT_S759 !148M and OVERRUN_AT_S759 !183M. The user selected the reduced overlap given by OVERRUN_AT_S759 !148M. He/she would then have been given a choice of which speed board set-up they wished to select, etc.

There are many alternative scenarios to the one above, all involving organisational factors. Firstly the speed boards could have been positioned differently. This will be considered in more detail in Section 7. Even with the speed boards placed as above, there are many organisational ‘safeguards’ which might have prevented the accident. An inspection should have been carried out, this is also mentioned in 7.1, a signal sighting committee should have been held, train drivers should have been aware of the reasons why the speed boards were placed at Bushey station, the driver should have been monitored. All of these alternative accident scenarios can be explored using the LOTOS tools. For instance, below is a scenario where the driver was assessed with the updated procedure of Section 3.2.2.

"OVERRUN_AT_S759 !148M"

"SPB !MKSPEED_BOARDS (1823, 1378, 1024)"

"VIEW_OF_WJ759 !368"

"INSPECT !ACCEPTED_WITHOUT_INSPECTION"

"SPADS !FOUR_IN_2YRS"

"COMMITTEE !NOT_HELD"

"LIT_AVAIL !LIT_NO"

"CATEGORY !DRIVER1 !CATEGORY_A"

"CATEGORY !DRIVER1 !CATEGORY_B"

"MONITOR_OR_NOT !DRIVER1"

"SIGNAL !SN751 !DOUBLE_YELLOW !370"

"SPEED_BOARD !SIXTY"

"SPEED_BOARD !EIGHTY"

"SPEED_BOARD !NINETY"

"SIGNAL !SN755 !YELLOW !350"

"SIGNAL !SN759 !RED !368"

"exit !STOP_AT_SIGNAL

!ADD (BREAKS_SSP20, {}) !ADD (DRIVER_MONITORED, {})"

In this scenario, the updated procedure ensures that the driver was monitored. This leads to a safe outcome. Under other scenarios the accident might still have occurred.

7. An Investigation of SSP 20

In this section we use the CADP simulator to investigate the SSP 20 standard, and how it was interpreted in the accident report.

7.1 Using the Speed Board Positions Suggested by HMRI.

If an inspection had been carried out, the model would give:

"OVERRUN_AT_S759 !148M"

"SPB !MKSPEED_BOARDS (1823, 1378, 1024)"

"VIEW_OF_WJ759 !368"

"INSPECT !ACCEPTED_WITH_INSPECTION"

"SPADS !FOUR_IN_2YRS"

"COMMITTEE !NOT_HELD"

"LIT_AVAIL !LIT_NO"

"CATEGORY !DRIVER1 !CATEGORY_A"

"CATEGORY !DRIVER1 !CATEGORY_C"

"MONITOR_OR_NOT !DRIVER1"

"SIGNAL !SN751 !DOUBLE_YELLOW !370"

"SIGNAL !SN755 !YELLOW !350"

"SPEED_BOARD !SIXTY"

"SIGNAL !SN759 !RED !368"

"exit !COLLISION_POSITION

!ADD (BREAKS_SSP20, {}) !ADD (DRIVER_NOT_MONITORED, {})"

This produces the same result as the actual accident - a potential collision! Why is this? Remember that it is assumed that if an inspection is held then the speed boards would have been positioned as suggested by the HMRI, i.e. the 60 mile/h PSR would be placed 402 m (440 yds) before the signal WJ759. Now the train, at signal WJ167, the signal before signal WJ751, was travelling at 80 mile/h. So assuming it continues to travel at this speed until the 60 mile/h PSR it would take an estimated 878 m to come to rest. Inspection of Figure 3 shows that this would have led to a potential collision situation. Due to its importance we quote in full paragraph 176 of the report.

"Had the 60 mile/h PSR sign been placed 440 yards ( 402 m) short of Signal WJ759 instead of Signal WJ755, and given the assumption that the driver observed the speed limit in the same way as he observed the PSR south of Bushby station, the accident would, in all probability, have been avoided as the on board data recorder shows that the train 2A17 took 338 m (370 yds) to come to a stand from 96km/h (60 mile/h). Allowing for a maximum of three seconds for the brake cylinders to be apply full brake pressure to the wheels during which time the train would have travelled 80m (88 yds), the overall stopping distance would have been about 418 m (458 yds) plus an allowance for driver reaction time, thus making the train stopping point only a short distance past Signal WJ759 and well within the overlap safety margin" (Robertson,1998).

It is evident that the HMRI assume the train is already travelling at 60 mile/h when it is 440 yds from Signal WJ759. This does not address the need to slow trains to that speed before reaching this position. Recommendation 12 of the report is:

"Railtrack, following discussion with HMRI, to amend the wording of SSP 20 in order to eliminate any ambiguity and to ensure that all persons implement the Standard in the same manner." (Paras. 174 to 177) (Robertson,1998)

If Railtrack amend SSP 20, to say that the 60 mile/h speed board should be 440 yds from the signal with the reduced overlap then there would be a great risk of a similar accident occurring.

7.2 Does Satisfaction of SSP 20 Preserve Safety?

Through informal analysis of the SSP 20 it was assumed that its implementation implied that there could never be an accident if the driver observed the PSRs. However, after construction of the LOTOS specification it was realised that there was a scenario that satisfied SSP 20, and yet did not ensure safety. Consider the following three set-ups, remembering that each figure refers to the position of a speed board relative to the Signal WJ759:

Other_setup_1 = (1500,null,null) ;

Other_setup_2 = (1500,360,100) ;

Other_setup_3 = (1500,400,100)

In all three of these set-ups the 60 mile/h PSR is at 1500 m from the Signal WJ759, well in time for the train to slow to 60 mile/h by the time it is 402 m (440 yds) from the signal. As there are no other PSRs before the 402 m mark all three set-ups satisfy SSP 20. Now assume the viewing distance of Signal WJ759 is 368 m, as it is on the day of the accident. The set-ups Other_setup_1 and Other_setup_2 can easily be seen to be safe, as the positioning of the 80 mile/h and 90 mile/h speed boards is less than the viewing distance to the Signal WJ759. However, in the third set-up the driver would see the 80 mile/h PSR before he saw the red aspect of the Signal WJ759. Thus the train may well accelerate for the short but potentially important distance of 30 m, before the brakes are re-applied. Simulating this gives:

"OVERRUN_AT_S759 !148M"

"SPB !MKSPEED_BOARDS (1500, 400, 100)"

"VIEW_OF_WJ759 !368"

"INSPECT !ACCEPTED_WITHOUT_INSPECTION"

"SPADS !FOUR_IN_2YRS"

"COMMITTEE !NOT_HELD"

"LIT_AVAIL !LIT_NO"

"CATEGORY !DRIVER1 !CATEGORY_A"

"CATEGORY !DRIVER1 !CATEGORY_C"

"MONITOR_OR_NOT !DRIVER1"

"SIGNAL !SN751 !DOUBLE_YELLOW !370"

"SPEED_BOARD !SIXTY"

"SIGNAL !SN755 !YELLOW !350"

"SPEED_BOARD !EIGHTY"

"SIGNAL !SN759 !RED !368"

"exit !OVERRUN_SIGNAL

!ADD (SATISFIES_SSP20, {}) !ADD (DRIVER_NOT_MONITORED, {})"

So there would not be potential for a collision in these circumstances. However, the braking estimations are based on the train 2A17 in good conditions, not on the worst braking train in bad conditions. The smallest allowed viewing distance for any signal on a slow line is 308 m, allowing for another 60 m of acceleration. So it seems there may well be danger in applying the standard SSP 20.

8. Further Work

The introduction argued that accidents are complex. It can be difficult to identify the many latent and active failures that contribute towards major incidents. We have addressed this complexity by using Reason's model of organisational failure to guides the initial analysis of our Watford Junction case study. Such general heuristics must, however, be backed-up by a more detailed investigation of the particular context for each latent and active failure. Mathematical modelling and simulation techniques provide tools for performing such complex investigations. They can represent the ways in which an accident develops over time. They can also be used to reason about the relationship between a particular incident and the normative regulations that are intended to safeguard operation.

We chose to use Reason's model because its organisational focus provides a deliberate challenge to our application of formal methods. These mathematical techniques have traditionally been used to analyse a narrow class of system failures. There are, however, a number of alternative models that might have been used to guide our initial heuristic analysis. Similarly, the insights that we gained from the application of mathematical reasoning tools might also have been gained by less formal analysis. The critical point is that any alternative tools must enable analysts to reason about complex latent and active failures at a level of detail that goes beyond that currently supported by most existing accident models.

Further work intends to establish whether less formal techniques might offer benefits similar to those described for mathematical specifications. We are also concerned to explore the relationship between rhetoric and the logical structure of argument in an accident report. Formal analysis can identify inconsistencies and omissions in the conclusions of an investigation. These inconsistencies can be masked through rhetorical techniques, such as the repetition of circumstantial evidence. In order for our analysis to be successful, it is critical that the analyst is immune to these effects during the construction of the formal model. We are currently developing a range of tools and techniques to support this analysis.

9. Conclusions

In this paper we have focused on organisational factors of an accident, by way of two levels of analysis. At the higher level of analysis we used Reason's model of organisational accidents to provide an overview of the accident case study. At the lower level we used the LOTOS specification language to construct a formal model of the factors recognised by Reason's model. With the use of simulation to analyse the formal model this combination of techniques has been able to:

1. show how there were many organisational factors involved, factors which were not made immediately obvious in the report.

2. demonstrate how organisational factors can be represented in a unified formal model.

3. demonstrate how different scenarios of the formal model can be explored.

4. demonstrate the importance of investigating these organisational factors, by showing that they had an important bearing on the accident.

5. investigate in detail the standard SSP 20 and provide a clear understanding of its implications.

6. demonstrate that the writers of the report, HMRI, did not have a clear understanding of SSP 20 or what constituted a safe set of PSRs.

7. demonstrate that there were possible dangers in implementing the SSP 20.

Reason has provided a high-level model that guides an analysis of the contributory factors involved in an organisational accident. In contrast, formal methods provide tools that can be used during the more detailed investigation of latent and active failures. They can be used to build models that improve our understanding of these salient issues in combination. Finally, formal methods increase our confidence that the results of such analysis are correct.

We are currently seeking to integrate our method with Reason's theory of defences (Reason,1997). In the case of the Watford Junction accident the main defences, the PSRs and the signals, were not independent. The relation between the speed boards and signals is a complex one and should be considered carefully when designing signalling systems. Formal methods would seem the appropriate tool to analyse this complexity.

References

Abowd, G.D. (1990). Agents: Communicating interactive processes. In D.~Diaper et al.,(Eds), Human-Computer Interaction -INTERACT '90, Elsevier Science Publishers B. V.(North-Holland), pages 143--148.

Allison, R., and Lejuez, R. et al.(1996). Inquiry into the Fire on Heavy Goods Vehicle Shuttle 7539 on 18 November 1996. Technical report. Department of the Environment(U.K): Transport and the Regions.

Bolognesi, T. and Brinksma, E. (1989). Introduction to the ISO specification language {LOTOS}.

In P.~H.~J. van Eijk, et al.(Eds) The Formal Description Technique LOTOS, Elsevier Science

Publishers B. V.(North-Holland), pages 23--72.

Botting, R.M., and Johnson, C.W. (1998). A formal and structured approach to the use of task analysis in accident modelling. International Journal of Human Computer Studies 49(3):223--244.

Fernandez, J.C., Garavel, H., Kerbrat, A., Mounier, L., Mateescu, R. and Sigireanu, M. (1996).

CADP: A protocol validation and verification toolbox. In R.Alur and T.A. Henzinger,(Eds), Proceedings of the 8th Conference on Computer-Aided Verification.

Johnson, C.W. and Telford, A.~J.(1996). Using formal methods to analyse human error and system failure during accident investigations. Software Engineering Journal, 11(6):355--356.

Johnson, C.W.(1997). Proving properties of accidents. In C.M. Holloway and K.J. Hayhurst,(Eds), 4th NASA Langley Workshop on Formal Methods, Hampton, United States of America. NASA Langley Research Centre. pp. 21-34.

Leveson, N.G. (1986). Software safety: Why, what and how. ACM Computing Surveys 18(2):125--163.

Reason, J.(1997). Managing the Risks of Organizational Accidents. Ashgate Publishing Company,Aldershot,U.K.

Robertson, S.S.J.(1998). Railway accident at Watford. Technical report, HM Railway Inspectorate(U.K).