As we continue to research the safety of autonomous systems, we constantly evaluate the efficacy of existing safety engineering techniques. So, what’s wrong with traditional safety engineering techniques such as HAZOP, FHA, FMECA, and even SHARD and STPA? The short answer is probably nothing. There is probably a lot wrong with their application, however.
Conversations with safety engineering practitioners and researchers alike suggest that many analysis techniques are not as effective as we would hope, or expect them to be, and as we apply them to Autonomous Systems (AS) their potential limitations are laid bare. I’m going to clump the aforementioned techniques into a single basket for the sake of argument, as although they are designed for differing levels of design granularity, and some consider the difference between the belief state and ground truth of a system model (STPA) and the effect of the environment, they are remarkably similar in design. At a basic level they apply guidewords or consider failures to identify hazards.
The issue lies not with the intent or design, but their interpretation – or more accurately, their lack of. It is far to easy to read a standard, or a published book that shows how such techniques are to be applied, and simply ‘copy and paste’ the examples shown verbatim, and without thought.
To illustrate the potential problem, let’s consider a HAZOP. Should we apply this blindly in a manner that is ‘straight out of the box’ we would perhaps use the following guidewords:
- More/Too much
- Less/Too Little
- Other Than/Instead of
- Early/Too Soon
- Late/Too Late.
I propose there are 5 types of risk associated with such blind application:
- Application at the wrong level of abstraction
- Mis/poor interpretation of guidewords
- Coupling of guidewords
- Failure to consider threats external to the system
- Failure to consider additional guidewords/failure conditions.
Whilst I acknowledge that not all these risks will manifest from all the techniques I have brazenly grouped together for brevity and simplicity of argument, I’ll take each of these risks in turn, and provide an example by means of illustration.
Application at the Wrong Level of Abstraction
This could manifest from applying the technique to either the wrong level of design abstraction (sub-system instead of component) or function. Consider if you will the analysis of an autonomous robot in a domestic setting. In this example we are considering the impact of the function ‘move’ (from one point to another). We apply the guideword ‘More/Too Much’ and elicit either no valid failure condition or perhaps an overspeed. But now consider application of the same guidewords to the sub-functions of ‘move’ (i.e. move left, move right, move forward etc.) we could now elicit a failure condition where too much speed/turning force is applied, and the robot topples over.
Mis/Poor Interpretation of Guidewords
Although I use the term ‘guidewords’ here, this could equally apply to failure modes. The English language is wonderfully rich and diverse – which is great for prose and poetry, but not so good for engineering. Great care must be taken with the use of natural language in any engineering application. As a tongue-in-cheek example of how different stakeholders can have different interpretations of (guide) words, or short phrases (failure modes), let us consider a couple of stark, yet fabricated examples.
The first is ‘time flies’. How many interpretations can you make of these two words? Here’s my attempt:
- An uttering which elicits the response ‘it sure does’
- An instruction to grab a stop watch and time that fly from point a to point b
- A mere statement on the ‘time’ species of flies.
Or how about ‘shoes must be worn’? Is this a requirement to ensure stocking feet are covered more robustly, or a required status as to the goodness of the tread?
You get my point.
Consider also a simple prompt or failure mode such as more/too much – how many different ways can this be interpreted? Let us consider the prompt ‘more’ as applied to data receipt or transmission (see below about erroneously coupling this with too much). This could mean more data (lines of code or size) transmitted than expected; the speed of transmission/receipt; repeated sending of the same data packets etc. It could feasibly involve all of these.
Coupling of Guidewords
As well as ensuring that every stakeholder involved in the analysis (including any absentee reviewers) has a common understanding of phrases or words in our uncommon language, we must also not erroneously couple the guidewords. The most common example I see of this is the coupling of guidewords such as Early/Too soon or Late/Too Late into the same prompt. They are distinct:
EARLY: in relation to clock/calendar time
TOO SOON: in relation to a chronological sequence.
Failure to Consider Threats External to the System
Naturally becoming more prevalent with the pervasion of autonomous systems (where actors external to the system can influence safety), this is still a relevant (yet often overlooked) feature of deductive and inductive safety analyses of ‘traditional’ systems. Often, external factors are not strictly external either. Consider the analysis of a system whose feedback or control loop involves a socio-technical interaction with a human operator. More often than not, the human is incorrectly classified as NOT being part of the system (a whole other article coming up on that one!), but let us leave that to one side for the sake of an example.
In such instances, it is insufficient to consider technical failures, and one must also consider the prompts in relation to a human. Does your analysis consider the impact of a human pressing a button too many/not enough times? Could external factors such as weather or sea state impact the outcome?
Failure to Consider Additional Guidewords/Failure Conditions
All too often, guidewords and/or failure modes are lifted straight out of the proverbial textbook and applied blindly and rigidly without question. It must be remembered that the standards and books out there are only giving examples to make the point. It is the role of the analyst/chair to establish a complete and defensible set of terms to be applied. It is vital to also have a clear understanding of their interpretation that is shared by all stakeholders.
Let’s imagine an analysis where weather conditions may be a contributing factor. It may be insufficient to elicit a set of terms to signpost stakeholders to consider the impact of poor- or fair-weather conditions, but (similar to the issues with the level of abstraction above) there may be lower-level, additional guidewords that need to be applied should ‘glare’ or low-level sunshine be a contributory factor.
Further, we may need to consider entirely new terms to argue completeness. Perhaps ‘intermittent’ (which is not considered by more/less/erroneous) needs to be applied? The key is to stop copying and pasting terms from a book and apply rigorous, intelligent thought to these long-practiced techniques (which was always the intention behind them).
We could always carry on regardless (is that a film?), and rely on additional, catch-all techniques in mitigation. The aviation industry has a nasty habit of doing so with the use of Particular Risk Analysis (PRA) by considering ‘Birdstrike’ for example. The mere existence of PRAs should be a warning to us all that the analyses gone before them was inadequate. They are considering yesterday’s accident, and not tomorrows.
So please continue to apply these techniques, but do so in the manner they were intended:
- Establish the granularity of function
- Establish what the ‘system is’ and what extraneous forces may impact it
- Establish the granularity of terms
- Agree the need for additional terms
- Agree the definition of terms
- Apply intelligent thought.