NSD's Y2K ASSESSMENT PRINCIPLES
Introduction
The Year 2000 computer problem or Millennium Bug, is a well rehearsed topic and will not be dealt with here (see ref 1 for discussion of the topic). The purpose of this paper is to develop the assessment principles that will be used by assessors in their reviews of the licensees' submissions relating to this problem. These principles do not, of course, replace HSE's Safety Assessment Principles (SAPs) for Nuclear Plants (ref 2); the appropriate sections of the SAPs still apply.
Critical Dates
The critical dates are generally regarded as:
Additionally, 21-22 August 1999 might cause a problem to systems which depend upon the Global Positioning System (GPS); for example, the transporting of nuclear fuel where knowledge of its location is important.
PRINCIPLES
Other Dates
mP1 Since there may be critical dates additional to those above which are associated with licensees' particular systems, evidence of a review of critical dates should be available (see ref 3 Appendix B for some additional problematic dates).
Nature of the Issue
mP2 The millennium bug does not pose difficult computer problems. The main issue from NSD's point of view is the identification of all systems (which impact safety in any way) that might be affected, followed by their investigation to establish those actually affected. These problematic systems have to be either fixed, replaced or work-arounds developed in tempo with the critical dates.
mP3 The issue should be the safety of the system(s) and not necessarily of its(their) millennium compliance. In fact, it should be recognised that making a sub-system millennium compliant may cause the overall system to behave unsafely because no other sub-system recognises the new date format.
Project Management and Scope
mP4 A Y2K project is one of resource and record management; these should be seen to be properly and systematically managed. This, of course, means providing an auditable trail enabling all systems to be unambiguously traced to their eventual outturns. There should be a documented strategy, project plan and Quality Assurance (QA) plan. All activities should be covered by documented procedures and guidance to ensure completeness and consistency. The emphasis of all guidance should be that of positive demonstration with safety as the central focus.
mP5 Licensees should demonstrate that their projects have addressed not only all on-line plant systems but also all relevant off-line systems, including those at their headquarters, contractors' premises and elsewhere. For example, configuration management and software development systems may need to be considered since incorrect versions of the software might be incorporated into a new system build following the millennium change.
mP6 The project scope should include, in addition, the safety of all non-nuclear equipment on a nuclear licensed site, i.e. equipment that does not pose a radiological hazard but which might otherwise pose a risk to health and safety due to a computer-related, date problem. For example, machine tools and other workshop facilities, and plant producing or handling hazardous chemicals need to be included. Evidence of an appropriate review process should be provided (see HSE's current guidance in refs 1, 4 & 6).
Justification for Continued Operation (JfCO)
mP7 Prior to each of the critical dates, the licensee should produce a justification for continued operation (JfCO) beyond each of these critical dates. This justification should show that the inventory was properly established; the investigation was comprehensive and thorough; the solutions are appropriate (and safe) and properly tested; and that the contingency plans (including supply chain management) are appropriate.
mP8 The JfCO should cover not only the continuous operation of the plant but all its modes of operation including shutting down and starting up after a critical date. Where the licensee opts to shut down a plant prior to a critical date with a view to restarting up again following that critical date, the JfCO(s) should demonstrate that the plant will be safe in the shutdown mode through the critical date and that it will be able to be operated safely in all proposed modes following the critical date. This equally applies to any operations which are not of a continuous nature, and irrespective of the periodicities of their use.
mP9 The final solutions to the problem systems (close-outs) must be demonstrably safe, taking due account of any interactions between, or otherwise involving, proposed 'work-arounds' (both in normal operation and during fault conditions).
mP10 Where licensees wish to continue operation with a number of degraded safety-related systems, then the synergistic effect should be demonstrated to be safe. Any information, obtained from other sources and used in support of the plant's JfCO, should be sufficiently detailed and authenticated to enable the safety arguments to be evaluated without the need to seek further information held by others.
mP11 New equipment purchased prior to and during the periods of the critical dates should be subject to the JfCO process.
Strategy Paper
mP12 Licensees should demonstrate by means of a suitable document that their approach to the Y2K problem is properly controlled through the application of a strategy which broadly matches the following phases.
mP13 All licensed nuclear sites, and any other locations associated with these sites which hold safety-significant computer systems, should be covered by appropriate strategy documents which should include QA plans and project programmes covering the critical dates. These strategy documents should clearly state that safety is paramount.
Project Programme
mP14 The project programme should become more detailed once the inventory is developed and the problem-systems are identified. The updated programme should show when the tests will be carried out and should include the identification of any plant outages required. Because the dates are immutable, the project programme should be supported by an analysis demonstrating that there is adequate resourcing to meet the programme's key dates.
Quality Assurance Plan/Programme
mP15 Quality Assurance plans/programmes should show all project responsibilities and demonstrate, by means of the status and competencies of the personnel involved, that the organisation is committed to resolving the issues prior to the critical dates. Regular project reviews should also be included. There should be other evidence of effective quality control such as a system of peer reviews/checking and approval with appropriate signing off of all activities.
Prioritised Inventory
mP16 There should be appropriate inventory development and prioritisation procedures linked to suitable guidance. The procedures and guidance should ensure that the approach is sufficiently comprehensive to ensure that the inventory is complete and correct. The inventory should include all systems and these should be uniquely identified and their configurations recorded. Each system should be categorised according to its safety significance.
For the production of the inventory, a top-down/bottom-up approach is recommended. The top-down approach should include a review of the safety case documents, the maintenance schedules, SOIs etc. and the bottom-up approach would require operations and maintenance staff to be consulted (with reference being made to operational and maintenance manuals) linked to a plant and office walk-round with these same members of staff. Cross checks with the inventories of similar sites could also usefully be performed.
mP17 The completeness of the inventory should be kept under constant review to ensure that any date-dependant systems identified elsewhere, or subsequently, are included, as appropriate, in the inventory.
mP18 System aspects that the inventory needs to address include:
mP19 Embedded systems are of particular concern because it is less obvious that equipment and plant contain such devices. Examples of plant items and equipment which may contain embedded systems are: cranes, circuit breakers and associated supply system protection equipment, smart instruments (gas detectors, etc), smart valves, lifts and road transport vehicles. The inventory development process should be such as to ensure that such systems are captured - this may require questioning of the manufacturer or supplier.
mP20 Licensees' systems important to safety, and their systems which support safety (such as maintenance database systems and off-site systems), should be considered since degradation of any of these systems may have direct safety impacts on the plants involved, especially if there is a synergism between more than one failed system. The systems to be considered include (but are not limited to):
mP21 Prioritisation should be in terms of safety significance and required plant outages. Evidence should be available which demonstrates that systems are being investigated and solutions found, according to this prioritisation, so as to ensure that safety is being optimally secured prior to the critical dates.
Investigation
mP22 The licensee should have adequate procedures and guidance for controlling the investigation. The guidance should describe how to identify systems with potential date dependency (see ref 4, Appendix A(ii) for guidance), and how to test these systems for date-related problems (see ref 6 for guidance). The guidance should show how the safety issues which might arise during testing should be addressed.
mP23 During any investigation, plant safety must be paramount. Where on-line testing is envisaged, plant investigations must be covered by the existing procedures (in line with Licence Condition LC22's requirements). This may require safety submissions/risk assessments to be reviewed by the Nuclear Safety Committee. Additionally, the testing should be covered by an appropriate permissioning regime. The potential for the system not to be able to recover from the test because of software and/or data corruption should be investigated and recorded as part of the documented demonstration of a safe testing regime. This should include consideration of the achievement of a safe plant state, or the ability to implement a recovery programme, following a test.
mP24. For systems important to safety, licensees have a duty to conduct their own investigations, including testing. It is not considered sufficient in this respect to rely solely on a supplier's statement of Year 2000 compliance (see ref 4, Appendix C).
mP25 Where supporting use is made of suppliers' compliance statements, there should be a clear demonstration that these statements refer to the installed version of the software with the specific hardware and software configuration of the system under investigation. The statements of millennium compliance and other advice from manufacturers should be carefully reviewed.
mP26. During the investigation phase, an overarching principle should be one of prudence. The licensee should assume that systems and process with date dependencies will fail. They should not use probabilistic arguments to justify any lack of investigation.
mP27 Proposed desk-top software audits, and system tests, should be adequate (see ref 3, Appendix E for suggestions for date strings to be used in searches of source code; also see ref 7, Appendix A, which provides a more general discussion). The configurations of any systems used for off-line testing should be demonstrated to be sufficiently representative of the installed system so as to give a high degree of confidence that the test results accurately mimic the behaviour of the installed system. The coverage of any tests should be sufficient to detect any potentially adverse effects on the system's functionality due to the system's date dependency.
mP28 There should be evidence of a systematic approach to the investigation with justification for the recorded outturn, i.e. why a solution is required, or alternatively, the reasons (e.g.. no computer in system, no use of date, use of date causes no safety concerns). Decisions should be peer reviewed.
System Close-out
mP29 There should be procedures and guidance covering the full implementation of the close-out activity, including that of providing a documented demonstration of safety in relation to the critical date(s). Where a change is required (either because of a software modification or replacement of the equipment), the site change control procedures should be used in the normal way (LC 22 arrangements apply) - including appropriate re-testing of the system and its interfaces following the change. Of particular importance is the maintenance of the plant's safety case. Hence, a comprehensive impact analysis should be undertaken for any such changes to ensure that all interactions are addressed - making one system safe may make another unsafe.
mP30 The solutions proposed to close-out a concern should be fully documented and demonstrated to be safe. For example: turning the clock back should be shown to be safe in the overall plant operating context; in particular, the impact on date-related records/activities should be systematically investigated and, where problems are identified, appropriate remedies implemented. Where a plant is to be restarted following a pre-critical-date shutdown that startup must be shown to be safe.
mP31 Any software tools used to detect date information in source code and/or implement corrections must be demonstrated to be fit-for-purpose: and the solution offered must be subjected to all the site change control procedures.
mP32 Operations which are not of a continuous nature may be shut down over the date-critical periods. Operators of such plant have a duty to carefully review, and where necessary rectify, date-discontinuity problems: taking no investigative/corrective action and assuming that shutting down over the critical periods constitutes an appropriate work-around should be regarded as unacceptable.
mP33 Work-arounds should be demonstrated to be safe. This should include the human factors aspects and their impact on the safety of other activities and work-arounds upon which the site depends for continued operation. Actual and potential operators' loading should be considered (as necessary) and demonstrated to be manageable.
Contingency Plans
mP34 Licensees should demonstrate that they have contingency plans appropriate to the consequences of major plant failure. These should recognise the possibility of widespread disruption of a licensees own internal infrastructure caused by multiple failures in seemingly non-safety related systems, or the possible disruption of the industrial infrastructure of the UK. Both of these events will place very high demands on staff in licensees organisations, with indirect detriment to safety. There should be adequate procedures and guidelines in place covering the production of contingency plans.
mP35 Licensees should demonstrate that their staffing levels, and staff competencies and levels of authority will be appropriate for the potential risk and consequences over each critical date associated with the millennium change. In each case this should be reviewed and the proposed arrangements shown to be adequate. Staff should be adequately trained in all the plant work-arounds (and changes) prior to the critical dates to which they apply. Staff should also be advised to be alert to potential system malfunction following each of the critical dates and should be aware of, and adequately trained in, the actions that should be taken in the event of the failure of any system.
mP36 Evidence should be provided that all necessary external supplies have been secured prior to each critical date such that the need to re-order does not occur during the associated critical period. This may include the licensees establishing that their suppliers of safety significant items have made the appropriate securing provisions themselves.
mP37 There should be confirmation by the licensees that there are no plans to perform non-essential intrusive activities (such as re-fuelling) through the critical dates.
mP38 Licensees should demonstrate that the emergency arrangements for the critical dates have been reviewed which should include also the availability of the communication systems. In particular, both on-site and off-site equipment involved in the emergency arrangements should be checked and contingency plans laid. The review should include the need for specific manning of the licensee's emergency facilities over the critical dates.
References
1. "Safety and the year 2000", HSE Books 1998, ISBN 0 7176 1491 3.
2. "Safety Assessment Principles for Nuclear Plant", Health and Safety Executive, 1992, ISBN 0 11 882043 5.
3. "Embedded Systems and the Year 2000 Problem, Guidance Notes", IEE Technical Guidelines 9:1997, ISBN 0 85296 930 9.
4. "Health and safety and the year 2000 problem: guidance on the year 2000 issues as they affect safety-related control systems", Health and Safety Executive, INDG267 C1000 5/98.
5. "A definition of Year 2000 Conformity Requirements", DISC PD2000-1, British Standards Institution.
6. "Testing safety-related control systems for year 2000 compliance", Health and Safety Executive, 1998, ISBN 0 7176 1596 0.
7. "Managing year 2000 conformity: a code of practice for small and medium enterprises", BSI, DISC PD2000-2, 1997, ISBN 0 5802 7445 4.
Bibliography
1. "The year 2000: a practical guide for professionals and business managers", British Computer Society, 1997, ISBN 0 90 1865 97 4.
2. "The year 2000: a practical guide for professionals and business managers - volume 2", British Computer Society, 1997, ISBN 0 90 1865 98 2.