FMECA – Why we Need to Rethink its Application
Failure Mode and Effects Analyses (FMECA) are a form of ‘FMEA Plus’; insofar as they add more detail to a FMEA by considering the failure paths in more granularity, and focus on asserting the criticality of failure. I deliberately avoid the use of terms like ‘determining’, or ‘establishing’ criticality, as they may not be supported by a robust data set at the time of their undertaking (discussions on that to come).
Recent conversations with design organisations have centred on ‘who should carry out a FMECA’ and ‘can a single FMECA satisfy functional safety and ‘mission’ needs?’ Again – for clarity I use the term ‘mission’ so as not to exclude ‘availability and reliability’ considerations from safety. Naturally we need highly available and reliable safety functions, but we can have highly reliable and highly available UNSAFE functions…which is part of my motivation for this article.
Having discussed which ‘transversal speciality’ should/could/does/doesn’t normally perform a FMECA, and how this differs across industry sectors and organisations, the conversation morphed to whether the output of a single FMECA could serve two mistresses – safety and ‘mission’.
The answer to this conundrum is a resounding ‘no’ in the vast majority of cases. In fact, try though I may, I cannot envisage a scenario where the response could even be theoretically ‘yes’. Let me explain.
Firstly, FMECA is a deductive technique that is often deployed to consider the effect of ‘failure’ of a component or function; to establish the casual/failure path; and assert its criticality. But what do we mean by ‘failure’? Failure is no longer limited to an end effect that is a smouldering, overheated ‘box’ with no potential for any further functioning. ‘Failure’ has many facets, including (but limited to) consideration by the following hardware failure modes:
• Compete failure
• Partial failure, OR
• Intermittent Failure.
For systematic (software) failures, we may consider:
• Function fails
• Function runs, but provides incorrect results
• Function occurs early or prematurely
• Data isn’t sent
• Data is sent too early/late
• Data is corrupted
• Software stops/crashes
• Software hangs
• Memory Capacity is exceeded
These lists of potential failure modes are by no means exhaustive, and they must be elicited and agreed in a manner that is predicated on the system design. When we think about HOW a system can fail, we instinctively consider failure modes with the quality attribute we wish to preserve (to determine the criticality of said failures). As such, considering functional failures from a safety perspective, may elicit different/further failure modes than when considering the reliability and availability of mission-related attributes of a system. This is before we even begin to consider the impact on FMECAs to the evaluation of an autonomous systems!
Indeed, different transversal perspectives may also consider different elements of the system. In extremis (and to try to make the point) this could have safety engineers only considering safety functions (and the safety-related systems that implement them), and reliability engineers in the mission team focusing on only data sources, perhaps.
Secondly, what is ‘critical’ from a safety perspective may not be critical from a mission perspective, and vice versa. This can elicit conflicting requirements for actions in the event of the same failure. Safety may require a function to be shut down in the presence of certain failures, whereas ‘mission’ may require it to function even in a degraded state. It is a role of the design organisation to manage these conflicts in a systematic manner from an informed position – which cannot be done if a single FMECA is carried out for than one transversal specialisation.
Thirdly, the failure modes that ‘mission’ transversals identify as critical may represent a different causal path within the same system architecture in which ‘safety’ have also identified an equally ‘critical’ but different failure mode and path. This probably requires a focus on entirely different elements of the same system. Further, risk (from a safety perspective) is not linear, and is influenced by many socio-technical factors which are themselves influenced by the state(s) of the operating environment (and so it is time to admit that risk being a simple aggregation of severity versus likelihood is perhaps a fallacy, as we’ve never been clear about what the severity refers to, nor what the event we’re attributing a likelihood to).
Finally, the last difference I suggest is more philosophical, perhaps – but as important just the same. Yes, we must have reliable systems and functions. Yes, we must have available functions. But, it is entirely possible to have a highly available, highly reliable, yet wholly unsafe system. It is not clear how a single FMECA could hope to differentiate between the two.
I also have reservations about publicly available reliability data that is often used to define ‘failure’ rates (which kind?) and to infer a mean time to repair or restoration (MTTTR/MTR) – as the time to repair will be impacted by where and how an element is instantiated into a specific design.
So, can we undertake a single FMECA and use it in the assurance of multiple quality attributes within the same system? Yes, but only if the following hold:
• The system elements are the same
• The system architecture that provides all functions is the same
• The causal paths are the same
• The failure paths are the same
• Reliability considers the reliability of a safety function
• Availability considers the availability of a safe system
• The criticality of each failure mode has an equal impact (criticality) on safety as it does to ‘mission’ (or other transversals)
• Publicly available reliability and availability data considers more than just ‘plain ol’ loss’ – and its ‘failure’ matches the mode of failure we care about
• Mean Time to Repair aligns with the ability to repair the component(s) in the specific design solution under analysis that fulfils ‘safety’ AND ‘mission’ functions.
In all likelihood therefore, the answer is in fact ‘probably not’.