“Many who came before you have tried and failed” said someone famous, somewhere, once. The road towards the retrospective application of safety standards is fraught with danger, and beware the temptresses and sirens lurking within the standards who seek to lure you into the crushing depths of frustration with the mirage of simplicity.
Which Safety Standards?
With the metaphors now suitably expunged, to the matter in hand. The specific standards I refer to for the purpose of this article are the ARP 4754A suite of publications (which include DO 178C and DO 254), and BS EN/IEC 61508 (aerospace, and ‘vague’, respectively). I’m discussing these for two main reasons.
The first reason is because they are widely adopted, and although they have a specific industry focus, their use is not isolated to the aerospace, and generic industries specifically – as they are also used as (purported) exemplars of recognised good practice for the functional safety assurance of software and complex electronic hardware.
The second reason expanded on below is motivated by their use being recognised as a quasi-form of an acceptable means of compliance by the (UK) Defence Standard 00-055 (the 3rd standard alluded to by the Defence Standard is ISO 26262 which is a derivative and child of 61508 anyway).
What is Retrospective Application?
As elements become smaller, more complex, and/or more expensive to produce, many are mass-manufactured for a wide-range of customers. Many developers of large scale and/or complex systems no longer rely on or have the capacity to create such elements in a bespoke manner and choose instead to integrate a varying amount of existing ones (many of which will be a form of Off The Shelf, or pre-existing elements).
There are two main procurement types here – an entire system which has already been developed, or the selection of existing components for integration into a bespoke development (at varying ratios of bespoke versus existing); and there are many reasons and rationale behind the selection of both types of procurement.
Regardless of the scale or ratio of existing elements being procured, many safety-related industries require compliance with an Open Standard. Indeed, the standards discussed in this article are indicative of those required by the (UK) Defence Standard 00-055 for the use across the different UK Armed Forces.
This introduces complexities and challenges for organisations supplying materiel for use by the UK Armed Forces that has elements which are already designed, and often designed by the Original Equipment Manufacturer (OEM) for an application which is not safety-related in nature. Retrospective application is therefore the art form (which no one has perfected yet) of trying to argue that the manner and means by which an element or system was designed, aligns (by chance) with the objectives and requirements of a standard to which it was not designed, and also perhaps for use in a context and/or environment that was never intended.
How do the Standards Deal With the Safety Assurance of Existing Elements?
There are differences across the standards as to what level of assurance (and argument) is required. ARP 4754A provides very little room for negotiating existing systems which have not already been certified by a regulatory body as safe and fit for purpose (airworthy). Further, it does not provide any ‘entry and exit’ points into parts of the lifecycle for any pre-existing software or hardware elements. So, unless the existing system is already certified for use on another aircraft type (which will bring its own complications for arguing applicability and read-across for any different aircraft types and operating envelopes and environments), the only hope would be to find a friendly certifier and regulator (which is highly unlikely – to put it mildly).
BS EN 61508 provides 2 routes for arguing over existing elements. In the context of ‘Hardware Safety Integrity Architectural Constraints’ (Part 2 of the Standard), general requirements are given as to the hardware fault tolerance of any (sub) system implementing a safety function. The standard sets out different rules dependent on whether the elements that comprise the (sub) system are Type A or Type B.
Type A elements are those (required to achieve the specified safety function) whose failure modes are well defined, whose behaviour under fault conditions are well defined, and which have sufficient dependable failure data to show that the claimed failure rates for detected and undetected dangerous failures are met.
Can you detect the aroma of oil from a serpent yet? No? Consider a typical OTS element; consider how much data you will have about it? Will you have an implicit understanding of all of its failure modes? Will you be aware of its explicit behaviours under all perceivable fault conditions? Worse still, how could the OEM (even if they wanted to) understand what constitutes a failure condition for the safety function you are instantiating after the OEM design process has completed?
Now for Type B elements. The standard takes 6 lines to state that these are those elements where the above doesn’t apply.
Type A elements get the easier ride by the way.
The lure of simplicity becomes even less appealing should the existing system be comprised of Type B elements, however. The standard requires (inter alia):
- A confidence greater than 90% that the target failure measures will be achieved
- A minimum diagnostic overage of not less than 60%
- Reliability data that is based on field feedback (which must be for the same elements in a similar application and environment).
Many have tried, and spectacularly failed with this approach…but this is just the requirements for existing hardware and the architecture. At the software boundary, attention turns to assuring the Systematic Capability of software. BS EN 61508 Part 3 presents 3 routes to compliance:
- Compliant Development (meet the standard as described)
- Proven in Use
- Assessment of non-compliant development.
Anecdotally, and from experience, the ‘proven in use’ approach isn’t attempted. It requires that any ‘proving’ evidence:
- Is based on an analysis of operational experience of a specific configuration of the element
- Is based on previous use which is the same, or sufficiently close to that expected for its ‘new application’ (which includes the environment, modes of use, functions performed, configuration, interfaces to other systems, operating system, translators, and human factors by the way)
- The dangerous failure (for your intended application) rate has not been exceeded in previous use
- Demonstrates equivalence (or at least has impact analysis that argues it is not an issue).
What the authors of BS EN 61508 are really very good at, is hiding some of the most important data in notes of font size 2 (which most people would skim past). Hidden (in plain sight, honest guv) in the section on ‘proven in use’ is a little gem that refers you out to an Annex of Part 7 for guidance on using the probabilistic approach for determining SC. Here you will note that, in addition to the above criteria, for a SIL 2 safety function operating at high demand (so greater than 1 demand per annum), the necessary history is between 3e-7 and 4.6e-7 hours of operation, AND a probability of dangerous failures between 1e-6 and 1e-7 per hour!
Can you foresee many (any) applications for this route?
Which brings us to the route for non-compliant (with the standard) development. Now this does look tempting…9 steps, one of which is planning, and one is for a manual. Result! This is the route for me! And off the intrepid developer trots, content in the simplicity of this route…
And then, they become aware (ordinarily because some annoying auditor or academic points it out) that:
- Step 1 requires compliance with another clause, and a table of techniques and measures for how the safety requirements must be specified AND traced through design (of the existing elements)
- Step 2 requires compliance with another clause, 10 sub-clauses and an entire annex
- Step 3 requires compliance with another 3 sub-clauses and 2 Tables of techniques and measures
- Step 4 requires compliance with another 3 sub-clauses and a Table of techniques an measures
Step 5 requires compliance with another 5 sub-clauses and an entire annex, and 4 Tables of techniques and measures from another
… so practically compliant development after all.
Oh, and you’ll still have to meet the rest of the objectives and requirements of all 3 normative parts of the standard.
Is There a Better Way?
Perhaps, and the specific answer will be sector, application, and technology specific of course. Here is where designers, developers, integrators, and safety specialists need to pause, to reflect on what the end game is.
Systems that cannot demonstrate compliance with an Open Standard are not necessarily unsafe. Of equal importance, systems that can demonstrate compliance with an Open Standard are not necessarily safe.
There is no empirical evidence that compliance with an Open Standard improves safety. To be frank, there is no empirical evidence that safety cases improve safety (but that’s for another article). For the sake of discussion, let us consider the mitigation of systematic failures. Where we can largely test our way to argue against random failures (such as the Dreamliner wings being flexed every hour of every day for years before, and years after the aircraft entered service), systematic failures are argued to have been mitigated by the selection and adoption of appropriate design techniques and methods.
So How Can I Know my System is Safe?
Whilst random failures can be claimed to be managed through something approaching exhaustive testing, one cannot embark on a series of exhaustive tests to argue the mitigation of systematic failures, as testing only proves presence of bugs, never the absence of. Rather than blind, retrospective application of a standard to an existing design, consider instead what assurance cannot be claimed (assurance deficits), and what aspects of the deployed technology is ‘lacking’ (technical debt) and seek to remedy this through design changes and/or operating limitations.
Accepting that the ‘recovery’ options available will depend on the specifics of the deficit and debt, consider that ‘failure x’ will occur (perhaps in a manner that is not immediately detectable) and seek to instantiate mechanisms that will detect and remove the fault before it can lead to a failure, instigate appropriate alerting mechanisms, and facilitate recovery away from a hazard state. Alternatively (and additionally), this is achieved by applying software design features that prevent errors from being introduced (or fault tolerance through product assurance) – and there are ways to argue over the goodness of design processes other than those deemed worthy by committee.
Having a form of retrospective argument will NOT qualify residual risk and will only serve to argue what cannot be claimed. More safety benefit will be realised through qualifying the residual risk (owing to the assurance deficits and technical debt), and applying finite resources to architectural changes/modifications to the architecture such as monitors, watchdogs, voters, wrappers or even component changes (considering the cost of doing so in proportion to the residual risk and perceived benefit of the changes).
If we cannot tolerate any residual risk after such measures are exhausted (and the above are exemplars only), we can seek to limit or change operational or operating aspects. After that we can at least express the residual risk to the Duty Holder, who MAY decide to accept it.
Standards like BS EN 61508 is often decried, and perhaps rightly so. However, it’s principles are sound, and to crudely paraphrase for simplicity’s sake; it provides objectives that are fulfilled though meeting requirements, these requirements can be met by adapting certain techniques and measures. Very few of these techniques and measures are mandatory, though – and alternate means of conformance may be argued instead with reasoned argument. Intelligent thought needs to be applied. You may hear me say that again from time to time…