Software Safety Precepts
Clifton A. Ericson II
The Boeing Company
ABSTRACT
The development of computer controlled systems and software is a complex process. In addition to being a complex process, software safety is also a complex and controversial issue. This issue is compounded by the many software safety myths that exist, such as "software partitions make safe software" and "correct code is safe code". The following are some basic rules intended to help resolve some of the common software safety problems. These precepts are not intended as absolutes, but as thought provoking ideas to further define the process, and debunk some common software safety myths. A methodology is also proposed for the software safety process that is rigorous and utilizes existing tools and techniques in a cost effective manner, but requires a change in focus.
Precept #1 -- Accidents don't just happen.
An accident is defined as an unintended event, undesired event, or mishap that results in death, injury or significant property loss.
Accidents are the result of a unique combination of events and conditions. Any unique combination of events/conditions leading to an accident is known as a hazard. A system can have many sets of unique hazards, all of which could result in an accident. When the unique events/conditions of a hazard are fulfilled, the hazard occurs, which in turn generates the concomitant accident.
Hazards are the result of a unique system design. The components, functions, timing, materials and processes of a system are combined in such a way that unique hazards are built into the system. Designers often do not realize the hazards exist, or they do not fully understand the true risk involved. Often the hazards cannot be avoided, however, they can be controlled.
Accidents don't just happen randomly -- they are designed to happen! Accidents are essentially pre-planned events. They are pre-planned in the sense that once the system exists, the system unique hazards exist also, and nothing can be done except to live with the resulting risk. The risk can be controlled by controlling the probability of the hazard, or the effect of the hazard.
Many systems have the potential for an accident, but are considered relatively safe because the risk is very small. Therefore, the accident potential for a system is a function of the inherent hazards, the unique design and the level of risk, plus environmental factors. Thus, a somewhat generalized concept for accidents might be:
Accidents = function( hazards + design + risk + environment ).
Safety is the prevention of accidents through the control of conditions and/or events that can lead to an accident (i.e. hazards). A safe state exists when the probability of an accident is acceptably small and/or the effect of the accident is acceptably minimal. The risk is then considered acceptable.
Accidents are controllable pre-planned events.
Precept #2 -- Accidents are the result of hazards.
An accident, or mishap, can be broken down into hazards, where a hazard is a unique set of conditions existing in a system design, which can lead to an accident/mishap. This unique set of conditions can include failures, errors, defects, timing problems, inadvertent release of an energy source, normal operation, environmental conditions, etc.
The probability of occurrence of the hazard, and the severity of the potential mishap define the level of risk. The potential hazard is considered safe if the hazard is controlled and the resulting risk is acceptable (i.e., low probability and/or not severe).
A system can have many hazards inherent in the design that can potentially result in an accident. Generally, potential hazards are associated with items that have lethal energy sources involved. For example, a nuclear power plant, missile system, and a laser system have inherently hazardous energy sources. Other hazards are created by the need for safe operation of critical equipment, such as an aircraft flight control system, or medical equipment. It is often impossible to eliminate all hazards due to the very nature of the system. For example, regardless of our technical improvements there is the ever present hazard of an automobile steering failure causing an auto accident. Therefore, hazards must be controlled to an acceptable level of risk, where risk is the probability of an accident times the magnitude of the accident. We have accepted the level of risk associated with automobile steering system designs.
Making software safe is more than just removing errors or failure modes. Hazards can exist in normal operation without the occurrence of failures or errors. For example, a correctly built aircraft can enter a hazardous state when wind shear is encountered. In this situation errors or failure modes do not contribute to the hazard, the hazard is caused by the inherent need for controlled flight, and the unplanned condition of unexpected wind shear at a critical time. Therefore, in making software safe one must look at all system aspects including: errors, failures, normal operation, unexpected conditions, abnormal conditions, interfaces, timing and hardware combinations.
In every system there are known hazards and unknown hazards. The goal is to find and document all of the hazards and their associated risk. Although it is more difficult, special emphasis must be placed on ferreting out unknown hazards. Once a hazard is known, it is usually a simple process to find some design means to eliminate or control the hazard (risk), and thereby control the occurrence of accidents. However, if an unknown hazard exists, the system risk level is inaccurate.
Hazards are the root cause of accidents, and the key to implementing safety. An accident can only occur when a hazard exists and is not adequately controlled. Therefore, the safety goal is to identify all potential hazards unique to a system design, determine their causing factors, and eliminate or control them. Safety is the elimination or control of potential hazards to an acceptable risk level.
Hazard control is the key to accident prevention.
Precept #3 -- Hazards are unique, not random.
Random is defined by Webster as having no specific pattern or purpose; haphazard; each member has equal chance of being chosen. Hazards do not display this characteristic, they do have a specific pattern or design. Hazards are an unintended design, that exists along with the intended design. A sneak circuit is a classic example of a typical hazard. It is a hidden path of events built into the design unintentionally, and just waiting to happen. A hazard will definitely occur, it is just a matter of timing, when all the ingredients are filled. A hazard is like a prophecy waiting to be fulfilled.
Reliability theory is based on components having equal chance of failing over a given period of time. Hazards on the other hand do not occur randomly, they occur when all of the necessary conditions are fulfilled. Some of the necessary conditions may be dependent on random component failures, others may not be.
Hazards are a result of system composition (i.e., hazardous elements, hazardous functions), how the system is tied together, critical timing requirements, and external events. Thus, the unique design of a system is what really creates the unique and inherent hazards of a system. Hazards are generally unintended functions built into the system.
Unique designs (systems) cause hazards.
Precept #4 -- Hazards will always exist, but risk is controllable.
Almost everything we do in life involves some inherent risk. Often we do things involving low risk, and therefore take the risk for granted. For example, the risk of an auto accident is far greater than the risk of an aircraft accident, yet we fear flying much more than driving. Technology seems to bring us more risk along with greater benefits.
Risk is a combination of the likelihood of an accident AND the severity of the potential consequences. Risk increases if either the likelihood or the magnitude of loss increase, conversely risk decreases if either decrease. Different factors can affect these two components.
Some of the problems with risk management include: 1) imposing risks on the public which they are not aware of, 2) selecting risk levels for the good of a few at the exposure of many, 3) conflicting goals in risk management.
Risk Management is an approach that has been successfully applied to other disciplines, such as construction and economic projects, to reduce different types of risk. Safety, accidents and hazards are very interrelated. They are also related to risk in the areas of exposure and probability. Therefore, it makes sense to rigorously apply the risk management approach to software safety to reduce the risk of software hazards.
Hazard control is achieved via risk management.
Precept #5 -- Risk Management applied to hazards is the most viable approach to software safety.
Risk is the exposure to the possibility of loss or damage. Risk involves events, uncertainty, potential gain/loss and choices. There are many different types of risk, such as project completion risk, safety risk, insurance risk, product failure risk, etc.
Risk exposure is the product of the probability of a loss occurring, times the magnitude of the expected loss. Risk is reduced in two ways: by reducing the magnitude of the loss, or by reducing the probability of occurrence of the loss (or a combination of both).
Risk Management is the systematic process for the identification, evaluation and control of risks. Much literature exists today on Risk Management theory [references 2 & 3]. The generic Risk Management model, modified for software safety, is shown is Figure 1. The purpose of Risk Management is to prevent problems before they occur. The overall Risk Management process is comprised of two major elements: Risk Assessment and Risk Control. Risk Assessment involves risk identification, risk evaluation and risk prioritization. Risk Control involves risk control planning, risk resolution and risk monitoring.
A Hazard Risk Management (HRM) program is the application of Risk Management to hazards (i.e., to control hazards and prevent mishaps before they occur). A HRM program will utilize all the tools necessary to identify and control system risks (hazards). These are the tools discussed in this paper: hazard analysis, reliability analysis, defect analysis, proof of correctness, design methods and testing. In the Risk Assessment step, hazards are identified, evaluated in terms of probability, loss, exposure and criticality, and prioritized in terms of risk levels. In the Risk Control step, the prioritized hazards are resolved using the appropriate tools, and the results are then monitored for effectiveness.
Hazard Risk Management integrates the tools of software engineering and safety engineering *to achieve safe software. HRM provides for hazard identification, risk quantification and assessment, and prioritization of risk reduction action. It also provides for feedback to verify the actual level of safety achieved. No single methodology is sufficient by itself. Each risk must be identified, evaluated and resolved individually using the particular tools necessary. The right tool and the right resolution should be applied in the appropriate place. The HRM process is the glue that binds the software discipline into safe software. It provides for the application of all tools necessary to identify and resolve hazards. Instead of a silver bullet approach, it provides an integrated arsenal.
Some of the major benefits of a HRM program include the following:
· focuses on prevention rather than detection.
· promotes an understanding of problems.
· leverages resources.
· can employ project indicators.
· applies the right tool to the right problem.
· focuses on hazards.
· value added process.
HRM is a new paradigm in software safety. It is a proactive process for identifying and resolving problems before they occur, or rather than treating problem symptoms. It provides considerable leverage, by utilizing the right tools in the right places (i.e., using proof of correctness only when it is truly applicable). It provides a method for focusing on the real concern -- hazards.
Hazard Risk Management is the key to software safety.
Precept #6 -- Hazard Analysis is the key to controlling hazards.
Accident prevention and safety risk is based on hazards. All of the hazards must be included in the risk management model, or the result will not be accurate. Just because a potential hazard is unknown, does not mean that it does not exist in the system design. Therefore, the first step in the risk management process is performing hazard analysis to convert all unknown hazards to known hazards. This can only be achieved via hazard analysis. Thus, software safety and risk management success hinge upon hazard analysis.
Once a hazard is identified, it can be resolved based on its level of risk. The major problem in many of today's systems is that not all hazards are identified before they occur. This means that more focus and emphasis needs to be placed on hazards and hazard analysis. Early hazard identification and control is vital.
Hazard analysis is the problem identification phase in the hazard risk management model.
Precept #7 -- Software design safety features are used to prevent/control hazards.
Software design safety features are used to avert hazards by elimination or control of the hazard. Certain specific features are often used for specific types of safety problems. However, the use of design tools can also cause problems, and the use of tools can be obstinate in achieving desired results. Its like giving penicillin to a patient to prevent any possible disease the patient might be exposed to. Software design tools should be used thoughtfully and carefully, and not just applied as a catchall. Software design safety features effectively eliminate/control hazards, but are often intractable.
Some of the major software design safety tools include the following:
A. Partitions
Partitioning is a design feature for avoiding or controlling hazards by isolating safety critical functions. The idea is to isolate safety critical functions so that failures outside the partition boundary cannot impact the safety critical function. And, the software inside the isolation boundary is designed and proven to be highly reliable. Partitioning does not simplify software design, it tends to make it more complex. Partitions must be proven to isolate safety critical functions. Blind faith in partitions is a trap in itself. Functions that are physically isolated (by hardware) can generally be easily proven to be truly isolated. For functions that are partitioned by software it may be difficult or even impossible to truly prove that complete isolation exists.
Software partitions must be fully proven before the software is assumed safe.
B. Fault Tolerance
Fault tolerance is a fault detection monitoring technique used to monitor a function, or set of functions, to determine if they are erroneous or out of tolerance. When an erroneous condition is detected, then some corrective action is taken. Different levels of response or corrective action include, fault detection and warning, fault detection and avoidance, fault detection and containment, or any combination of these. Fault detection monitoring is a useful tool, but it's use can also cause additional problems. Some of these problems include added complexity, tolerance boundaries, voting schemes and failures that fail to provide dependable warnings.
C. Dissimilarity
This design feature implements redundant software modules that are coded by separate design teams and in different languages. The philosophy is that any design errors in one module will not be present in an independently developed module. However, some research results have shown that even independent design teams tend to make similar mistakes given the same set of starting requirements. In addition, this scheme also introduces added complexity and voting problems.
D. Fail Safe Design
This design safety philosophy implements a design that ensures a module, or system, is safe given that any errors or failures should occur. Like partitions, this design feature requires significant analysis to verify that a module will always be safe when failures occur, and that all possible failure combinations have been anticipated.
E. Safety Coding Standards
This design safety philosophy implements only a safe subset of a language, rather than the entire language. The purpose is to avoid dangerous problem causing constructs and non-deterministic constructs. For example, Ada tasking does not yield the same timing results every time, given the same set of inputs (per the Ada reference manual). By using a proven safe language subset, with strict coding guidelines and standards, potential errors and safety problems could be avoided for safety critical modules.
Coding standards for safety critical systems is a must.
This is not a complete list of hazard control design measures, just a few of the more controversial ones. There are other options available also, such as hardware controls, safety devices, procedures, etc. Refer to reference [1] for more detail.
Design features are the fix in the hazard risk model, but care must be taken.
Precept #8 -- Software is only hazardous in a system.
Software is a computer program consisting of instructions, algorithms, and data. It is written in a human readable computer language, and then translated into digital binary bits (1's and 0 's) for the computer's interpretation. It is merely a set of bit states that can do nothing, until executed by a computer.
Software by itself, is rather benign and intrinsically safe. Software contains no energy sources (electrical, mechanical, nuclear, etc.) that can cause damage. It is because software contains no energy sources that it is inherently safe. If not combined with hazardous elements or hazardous functions it has no potential for causing an accident.
Software can only be hazardous when used in a computer system context, and given the responsibility to control and/or monitor hardware, because it is the hardware that contains potentially hazardous sources. Although it is the hardware that is directly involved in the hazard, the software may have placed the hardware in a hazardous state. For this reason, software safety is "systems" oriented, that is it involves hardware elements, as well as software elements. And, it is global in nature, rather than local.
A software hazard exists when the software operation is directly responsible for an accident (i.e., hazard in the software controlling the hardware). Hazards are the result of building complicated integrated systems, combining both software and hardware. Sometimes hazards are quite obvious, other times they are very subtle and difficult to identify. Hazards are ubiquitous little creatures like mosquitoes or no-see-ums.
The following are the major basic causes of software hazards:
· requirements error or deficiency
· design error
· coding error
· hardware failure inducing software error
· unanticipated timing problems
· sneak software paths
· unintended functions
To be hazardous software must control hazardous elements/functions.
Precept #9 -- Hazard analysis (prevention) requires a systems approach.
All hazards are really system hazards, which can be sub-divided into hardware caused hazards and software caused hazards. Software safety is the process of identifying potential system hazards, tracing them back to the causing software factor(s), and implementing software design measures to eliminate or control the risk of the hazard.
As discussed previously, hazards are unique system designs. Often hazards can be traced to contributing factors from several different places in a system. Hazards can become very complicated from inter-related system components. It takes a systems viewpoint to identify these types of hazards. Refer to reference 1 for more detail on this.
The three S's of safety -- System, System, System!
Precept #10 -- Classifying software as safety critical is not enough.
Classifying software, and hardware, as either safety critical (SC) or not safety critical (NSC) is essential for two reasons. First, it helps narrow the scope of work, and second, it provides an identification of areas where safety must be vigilantly applied and maintained. Hazards, and potential accidents, will reside in the SC software, and not the NSC software. SC software contains the potential hazards and the accident risk. This makes it cost effective to concentrate safety effort primarily on SC software. Detailed software safety analysis and software design solutions can be applied directly to the potential safety problem areas (i.e., SC software).
Safety critical code is the code containing the high risk safety hazards, or the safety critical functions that must be performed. SC software should receive major safety emphasis of the design, test and verification process. It should be remembered that even in a SC system, not all of the code is actually SC. NSC code does not require as much analysis and verification effort. By placing emphasis on safety critical code the safety job can be contained and reduced.
However, software code needs to be viewed at another level, that of hazard rated software. Hazard rated software can be measured, tracked and managed. Hazards can be quantified in terms of quantity, probability and severity. This means that software modules need to be given a hazard rating, so that when the module is developed or modified, is must meet certain criteria based upon its rating.
Hazard rated software provides a measurable dimension for risk management.
Precept #11 -- Software does not have bugs!
The term software bug has been used in the software industry to denote a software problem. The term implies that a software error has innocently cropped up in the code. This term is used very loosely and tends to give an unwarranted forgiveness for poor design. It implies that no one was at fault, the errors just happened. Blame is not the goal here, professionalism is. It must be recognized that software problems, or bugs, are the result of design errors somewhere in the overall process. When the software process deals only with errors and hazards it will then become more technical and professional. Using the term software bug breeds design contempt.
A software bug could more aptly be called a no-see-um mosquito. It's there, it’s bothersome and it stings -- it just has not been seen or identified yet. This is where design rigor and hazard analysis are critical, to identify all potential hazards in the system, before they are unexpectedly found through usage.
Software may have design errors -- but not bugs!
Precept #12 -- More lines of code does not necessarily mean less safe software.
Some software developers feel that the fewer lines of code involved the safer the module. These developers feel that code written in C is safer than code written in Ada, because it will generate fewer lines of code. Actually, safe code is based on preventing hazards, not on anything else. Thus, software size is really irrelevant. Whichever language will generate fewer hazards is the real issue. In this case, it can be said for Ada that it has many more built-in safety features than does C.
Generally speaking, the real culprit in software safety problems is complexity and criticality, not software size. The design complexity and criticality must be reduced to hazards, in order to determine the relative safety level. The problem is that as code size increases, designers tend to lose control and thereby reduce their focus on hazards and safety.
Safe code is based upon hazard free design, not size.
Precept #13 -- Computer language is a safety critical issue.
Some software developers think that the best software language is one that provides much flexibility, capability and freedom. Again, it must be understood that safety can only be evaluated in terms of hazards, not language size or flexibility. In reality, the best language for software safety is one that is limited and constrained specifically for purposes of safety, such as a special safety subset of Ada. Languages that allow unsafe, ambiguous designs are undesirable. And, languages with non-deterministic functions that do not produce consistently repeatable and predictable results every time the same module is run are unacceptable.
Limiting a designer's scope is safer than providing maximum flexibility.
Precept #14 -- Defect free software is not necessarily safe software.
A software defect is defined as a software product anomaly, which equates to the software not performing to specification. A defect can be in the specifications, requirements, code, test procedures, etc. A defect can be an error or a failure. A failure is inability of the system or system component to perform a required function within specified limits.
There are three possible safety outcomes from a defect:
· no effect on system operation (safe)
· minor effect on system operation (nuisance, safe)
· major effect on system operation (safe or unsafe)
Are all defects hazardous? If a defect results in major system operation concern, then it is more than likely a safety issue also. If analysis shows this is true, then the defect can be considered as a contributing factor in a hazard. If the defect results in no effect or minor system effect, it is not considered to be a hazard. Each major defect must be individually evaluated to determine if it contributes to an unsafe condition and is therefore a hazardous factor. Thus, software defects may or may not be hazardous.
The natural software quality goal is to deliver a software product that is defect free. A defect free product is generally assumed to be safe and reliable. However, this is not necessarily the case. Even when all software defects or errors have been removed it is still possible for hazards to exist. Is this possible? Unforeseen hazards can exist due to unique combinations of tasks, timing, hazardous elements or hazardous functions. These hazards can exist in system design even if the system meets specifications and performs without any defects being encountered. These are circumstances unforeseen by design specifications, and therefore are not necessarily considered product defects.
Software does not have to be defect free to be safe. There may be defects in the software that are minor, or cause system failure and shutdown, but if no hazards are created, the software is considered safe. Not all defects are hazardous, and not all hazards are the result of defects.
If a hazardous defect is not encountered (but exists), that does not mean the hazard does not exist. No, it merely means the hazard (and defect) have not yet received the right conditions to cause their occurrence. It is still a potential hazard, and defect. The probability of occurrence in this case may be small.
In order for a system to be safe it must be "hazard free", not just defect free.
Precept #15 -- Reliable software is not necessarily safe software.
Reliability is aimed at making a system free of failure; to keep the system operating for a specified length of time. Safety is concerned with safe operation. A system can operate reliably, and still be either safe or unsafe. An unreliable system may, or may not, be safe. The bottom line is, reliability does not equate directly to safety; reliability is not sufficient for safety.
Reliability is the probability that an item will perform a function for a specified length of time. Software errors, defects and faults result in software failure, which in turn results in unreliability. A software failure is inability of the system or system component to perform a required function within specified limits.
Software can be built to perform reliably, yet still have inherent hazards. This concept has been well understood and developed in the hardware arena. There are many design situations where reliability and safety are at odds with each other. For example, two sets of code could be implemented redundantly for reliability, yet they could both have the same hazards, even if they are developed independently. In addition, the added complexity may lead to potential safety problems.
One of the problems associated with software is that there is currently no accurate method to calculate software failure rates. This makes it difficult to quantitatively assess software reliability and safety problems.
Since software can fail and not always result in a hazard, not all failures are of safety concern. However, software failure modes should be evaluated for safety impact, and be designed to fail-safe when necessary. Software must be reliable, fail-safe and hazard free.
In order for a system to be safe it must be "hazard free", not just reliable.
Precept #16 -- Proof of correctness will not guarantee safe software.
Correctness is adhering to a given standard or specification. Correct software is formal mathematical verification that the requirements are consistent and correctly implemented in the code. However, hazards can exist regardless of whether or not the requirements are complete, consistent and correct. Hazards are not necessarily in the requirements, and hazards may exist regardless of how good the requirements are. This is a consequence of system development complexity. Therefore, software safety involves eliminating and/or controlling hazards, as well as making correct software. This involves identifying hazards, which in turn involves the use of many different tools, including analysis, test, formal methods, etc.
One of the major problems is that software can be correct when compared to specifications, and specifications can be correct to the best of everyone's ability, yet hazards can still exist. The software may be correct, but is it the right software (i.e., the hazard free design)? For example, what if a compiler error occurs and generates erroneous code?
Not too surprisingly, this concept also applies to system requirements. Requirements can be correct and complete, but hazards can still be present. Correct requirements do not preclude hazards. There might be an inclination to say that if the requirements were correct and complete, that no hazards could possibly exist. However, requirements cannot take into consideration all possible contingencies, failure combinations, and unforeseen events of nature.
Take for example the Patriot missile, which due to a timing problem failed to intersect and destroy an incoming SCUD missile, which in turn killed 28 people. This mishap was due to timing errors built-up over a large period of operation. The Patriot was originally not intended to be used for more than 24 hours without having the timing reset [Ref. 4]. The timing accuracy was the problem in this case. Should this incident be considered a failure, a defect, or an inherent design hazard? A failure analysis or a proof of correctness would not have detected this hazard, since the design met specifications. However, a HRM hazard analysis could have possibly predicted the designed-in hazard, which resulted from misusing the system from original intent.
In order for a system to be safe it must be "hazard free", not just correct.
Precept #17 -- Terms like "dependable software" and "trustworthy software" confuse the issue.
The need for dependable software stems from safety critical systems that need to operate dependably because their failure has dire consequences (i.e., loss of life). For this reason, some in the industry feel that safety critical software must always perform as desired and should never fail. Safety critical systems need to be dependable, but more importantly, they need to be safe. Requiring that something never fail is not being in touch with reality. Dependability is really a disguised reliability objective.
Software trustworthiness has been defined as the acceptably low probability of catastrophic flaw in the software, after testing and review. Software that is trustworthy (i.e., no flaws) is still susceptible to potential hazards in the overall system. Trustworthiness has more to do with defects and reliability, than it does with safety. Safety deals primarily with hazards, hazards, hazards. Defects, flaws and reliability are only a part of the safety equation. Safety is the effort involved in identifying and eliminating catastrophic problems in the software and the overall system.
The objective of system safety is that a system operate safely, and if it should fail, that it fail safely. This translates into acceptable risk. Risk can be controlled by designing to limit the probability of failure, or it can be controlled by changing the design such that the consequence of failure is limited. For example, say that the requirement for your auto gas tank failing and resulting in an injuring explosion is 1 x 10-10. If analysis shows that this probability cannot be met, then the design must be changed to contain the explosion such that no injury can occur.
Dependability and trustworthiness are not necessarily congruent with safety.
Precept #18 -- Testing will not guarantee safe software.
It is practically impossible to test all code under all possible conditions, because the number of combinations is too great. Most testing is done against design requirements. Since hazards are not generally in a requirement, it becomes difficult to test strictly for hazards. Hazards usually occur during operation or testing when unplanned (or unforeseen) events and conditions are encountered. This is not to say testing will not uncover hazards, it will uncover some, but not all and perhaps not the critical ones.
Testing has major objectives that are not necessarily in line with safety, such as showing that the system works and that all requirements are met. Testing does not show or verify the software's level of safety. Only analysis can show the level of safety. Special safety tests can be used to verify that safety systems do indeed work when certain events or failure modes are induced.
Test cases must be designed and planned. Generally the goal of test designers is to show that the software operates as specified, and that it does not contain unused branches, endless loops, incorrect logic, etc. Test designers seldom test for unintended functions or hazards. These are normally identified only through hazard analysis.
Testing may show the presence, or absence, of some hazards -- but not all hazards.
CONCLUSION
The final message should now be clear and simple. Accidents are the result of hazards. Hazards are the result of unique system designs. Hazards are unique and have unique properties and dimensions. Therefore, the key to software safety is hazard control, through the Hazard Risk Management process.
Making software safe is not just a matter of removing errors, defects and failures. Hazards must be removed or controlled. Hazards can exist in normal operation without the occurrence of failures or errors. For example, an aircraft can enter a hazardous state when wind shear is encountered, without any failure of the aircraft.
Perhaps disciplines and definitions limit our thinking. Disciplines tend to have silver bullet mentality, trying to make specific tools solve the entire safety problem. Definitions are used from different disciplines to define software safety. Software safety cannot be limited by a single tool, or overlapping and misleading definitions. To achieve software safety, project focus should be placed on hazards and hazard control. Focusing on failures, defects, bugs, trustworthiness, etc. misdirects attention and responsibility in regard to safety.
By now it should have become apparent that one of the major themes of software safety is that it's not enough that software be reliable, correct and defect free -- it must be hazard free. And, there is only one way to achieve hazard free software, by implementing a hazard based HRM process. The HRM process is the glue that binds analyses, tools, techniques and risk. This also means that more focus and accountability needs to be placed on hazards, similar to reliability and their emphasis on failure modes.
In reference [5], Henry Petroski is quoted "Success depends upon a constant awareness of all possible failure modes, and whenever a designer is either ignorant of, uninterested in, or disinclined to think in terms of failure he can inadvertently invite it". Replacing failure with hazard describes the software safety situation -- success depends upon a constant awareness of all possible hazards, and whenever a designer is either ignorant of, uninterested in, or disinclined to think in terms of hazards he can inadvertently invite them.
REFERENCES
1. Ericson II, C. A., "Software And Systems Safety", Fifth International System Safety Conference, 1981.
2. Boehm, B. W., "Tutorial: Software Risk Management", IEEE Society Press, 1991.
3. Charette, R. N., "Software Engineering Risk Analysis And Management", McGraw-Hill, 1989.
4. Littlewood, B. and Strigini, L., "The Risks Of Software", Scientific American, Nov. 1992.
5. Peterson, Ivars, "Fatal Defect -- Chasing Killer Computer Bugs", Times Books, 1995.
BIOGRAPHY
Clifton A. Ericson II
The Boeing Company
18247 150th Ave SE
Renton, WA 98058 USA
Currently Mr. Ericson works in system safety on the 767 AWACS program. He has 30 years experience in system safety and software design with the Boeing Company. He first began research in software safety in 1976 with a software hazard analysis of the B-1A Avionics software. He has taught software safety at the University of Washington. Mr. Ericson holds a BSEE from the University of Washington and an MBA in quantitative methods from Seattle University. He has also developed an interactive fault tree graphics software program for the IBM PC.