Human factors engineering/usability engineering (HFE/UE) for medical devices has matured as an engineering discipline. Use error probability remains controversial, as it appears not to fit in the modern HFE/UE paradigm. However, in this two-part series, I argue that there is room for probability assessments in HFE/UE for medical devices and suggest a means to incorporate it into the paradigm.
An influential work in the history of science, Thomas Kuhn’s “The Structure of Scientific Revolutions” suggests that science evolves through periods of “scientific revolution” (paradigm-busting shifts in scientific theory or practice that re-define a discipline), to “normal science” (mature, convergent problem-solving under an accepted paradigm with accepted tools).1 With some poetic license, I suggest that engineering disciplines may be viewed similarly. At least, we can characterize an engineering discipline as either stable, mature and convergent, or fluctuating, evolving and divergent.
Leaving aside the historical account, current human factors engineering/usability engineering (HFE/UE) practices for medical devices (including combination products with a device constituent part) may have settled on a stable paradigm and stable tools. Acceptable HFE/UE tools include simulated use studies, task analysis, heuristic reviews, contextual inquiry, etc. The presence of industry standards, regulatory guidelines, and textbook publications further suggests standardization, normality and convergence.2,3,4
However, within modern HFE/UE practices, one element that continues to garner attention is use error probability (a.k.a. likelihood, probability of occurrence, or rate). Here, I argue that there is a need and a place for probability within the HFE/UE paradigm. The hope is that by fitting probability into the current paradigm, fears over its misuse and potentially dangerous effects can be mitigated. Probability works within current HFE/UE practices. No revolution needed.
So what’s wrong with probability?
FDA explains the probability problem as (my emphasis):
“ANSI/AAMI/ISO 14971…defines risk as the combination of the probability of occurrence of harm and the severity of the potential harm. However, because probability is very difficult to determine for use errors, and in fact many use errors cannot be anticipated until device use is simulated and observed, the severity of the potential harm is more meaningful for determining the need to eliminate (design out) or reduce resulting harm.” – FDA
This sets the stage for industry’s approach to usability risk management. Difficulties associated with measuring use error rates lead to the suggestion that they be de-emphasized. Of course, methodological difficulties measuring a phenomenon do not entail that the underlying phenomenon is not important. However, if use error probabilities are difficult to measure, then attempts to measure them may be poor. Poor measurements may be “worse than useless” in risk management, resulting in distinctly poor decisions.* Strochlic and Wiklund further explicate this position, suggesting that focusing on severity of harm can eliminate the “detrimental effects of making wildly incorrect estimates of likelihood.”5
The emerging consensus is echoed in an equally authoritative industry standard, IEC/TR 62366-2:2016. IEC/TR 62366-2:2016 concurs with the FDA, stating, “Because of the difficulty of determining the probability of occurrence of use errors, manufacturers should focus primarily on the severity of the potential harm rather than on the risk derived from the combination of severity and use error probability.” In the end, the de-emphasis of probability is derived from methodological difficulties measuring it. As a result, we are shifting focus to assessing the severity of harm of use errors and use-related hazards. Such an approach frames the modern emphasis on critical tasks.
Why Are Use Error Probabilities so Difficult to Estimate?
Evaluating medical device usability often involves conducting simulated use studies. During human factors validation testing, FDA recommends a minimum of 15 users per user group.3 As is often stated of the simulated use test paradigm, small sample sizes result in statistically underpowered usability tests.2,3,4,6
In fact, the primary goal of usability testing throughout development is to identify, describe, and assign root causes to use errors.4 Methodologically, such an approach is distinctly qualitative. Quite explicitly according to FDA, “use errors are recorded but the purpose is not to quantify the frequency of any particular use error…”3 IEC/TR 62366-2:2016 further emphasizes the qualitative nature of simulated use studies by stating, “usability studies for summative evaluations are qualitative investigations…” Overall, FDA claims that the 15-participant minimum sample size is reasonable, limits the amount of resources needed for conducting a usability test, and can detect most usability issues.3 Further, IEC/TR 62366-2:2016 suggests that the additional cumulative probability of detecting a use error exponentially decreases when the sample size exceeds 10 users.
Regardless of the qualitative test paradigm, the ability to detect use errors for further analysis is still rooted in their probability. So, are 10 to 15 participants enough to detect the problems on critical tasks (high severity of harm), regardless of whether one wishes to quantify use error probabilities or qualitatively assess the user interface? FDA bases the 15-participant minimum in part on Faulkner’s empirical study.3,7 In this study,15 participants interacting with a software user interface detected a minimum of 90% and a mean of 97% of all software user interface issues.
However, Faulkner does not prove that 15 participants can uncover 90-97% of user interface issues, for any usability issue, of any underlying probability of occurrence. Instead, the Faulkner study implies that the user interface issues associated with the studied software interface have a high enough underlying probability of occurrence that they are detectable in a small sample size.7 For a 15-participant study to detect 97% of user interface issues, the underlying probability of occurrence for each issue must be approximately 21% (assuming each issue has the same probability of occurrence). Furthermore, it does not mean that a 15-participant human factors validation test can detect the user interface issues associated with the most serious harms, which may have lower probabilities of occurrence. Faulkner states, “…a glaring problem has a high probability of being found, but a subtle problem has a lower one…Unfortunately, the subtle problem may have more serious implications…”7 Theoretically, given that the severity of a risk and its probability of occurrence are often inversely related, it is likely that the current simulated use test paradigm is most well suited for higher likelihood, lower severity use errors.8 At the very least, the only problems on critical tasks that are likely to be detected are the “glaring” ones. There may be several more “subtle” ones that go undetected and hence unanalyzed. Furthermore, the claim that a 15-particpant sample size can detect “most” use errors, may only be true if most use errors have a high likelihood.3 If a device has many low likelihood use errors, then most of them will go undetected in a 15-particpant human factors validation test.
Prior to the modern HFE/UE paradigm, some medical device developers would conclude that a user interface is safe based on quantitative acceptance criteria. For example, 95% of users completed a task successfully. The modern HFE/UE paradigm was in part born out of the following question from regulators, “What about the 5% that failed?” The question correctly points out the somewhat arbitrary nature of a sponsor’s chosen acceptance criteria. In the modern paradigm, the need for mitigation is based on a root cause assessment of any detected use errors or close calls on critical tasks, regardless of their probability. However, I ask the following to the qualitative paradigm:
- What about the 3-10% of user interfaces issues that the 15-participants could not uncover and thus have no basis for a qualitative investigation?
- What about the rare, but severe use errors that you are unlikely to see in small samples (which may account for more than 3% of all use errors)?
Therefore, I argue that there is a shortcoming in the current paradigm. If we see a use error on a critical task in a 15-participant study, attributable to the design of the device, then the underlying probability of occurrence is likely high (otherwise, it is unlikely that we would have detected it). Fix those issues if possible, regardless of how often they appear in the study. What is problematic are the potential low likelihood use errors on critical tasks. Even if we test the critical tasks with representative users, using representative devices, in representative environments, with 15 participants, we may not see the use error. The error likelihood is too low to detect in small sample sizes. Then, we have limited substrate on which to base a qualitative root cause investigation. Regardless of whether we have the keenest observers and the best interview questions, how do we do a deep qualitative dive into the root cause of use errors we never saw? It’s not that probabilities allow us to hide behind artificial 95% pass rates. It’s that a lack of probabilistic-thinking means we never see the 5% that fail. It is not the fact that sponsors can try to claim that an observed use error would occur at a low rate. It’s that the need for mitigation, even in the qualitative paradigm is based on probability by virtue of being based on detectability.
Strochlic and Wiklund sum up the current industry paradigm as follows:
“Because of these inherent conditions in usability testing, error likelihood should be disregarded as a driving factor determining the need for risk mitigation…The updated (i.e. modern) way to approach use-related risk analysis is to base decisions about the need for risk control measures on the severity of the harm that can result from a use error, regardless of whether the likelihood is 1 in 100, 1 in 1,000, or 1 in 10,000.”5
The problem is not whether risk mitigation should be based on likelihood (though in a sense, it always is. But that’s for another day). The fact of the matter is that the ability to detect use errors under the current usability test paradigm is based on likelihood, even if explicit decisions are not. In fairness, Strochlic and Wiklund allow that probability estimates may be useful in a final estimate of residual risk. This is precisely what I suggest be developed further in HFE/UE paradigm.5
This work is derived in part from the author’s doctorate of engineering dissertation at George Washington University. Read Part II of this series, “So What Do We Do?”.
*The concept of worse than useless is borrowed from Cox (2008). In his example, poorly constructed risk matrices can be “worse than useless” resulting in performance worse than random (i.e. distinctly detrimental).
- Kuhn, T.S. (1970). The Structure of Scientific Revolutions. Second Edition. Chicago: The University of Chicago Press.
- International Electrotechnical Commission. Medical devices – Part 2: Guidance on the application of usability engineering to medical devices. IEC/TR 62366-2:2016. IEC, approved April 2016.
- FDA. (2016). “Applying Human Factors and Usability Engineering to Medical Devices”. Guidance for Industry and Food and Drug Administration Staff. Retrieved from https://www.fda.gov/regulatory-information/search-fda-guidance-documents/applying-human-factors-and-usability-engineering-medical-devices
- Wiklund, M. Kendler, J. and Strochlic, A.Y. (2011). Usability Testing of Medical Devices. Boca Raton: CRC Press.
- Strochlic, A. and Wiklund, M. (November 10, 2016). “Medical Device Use Error: Focus on the Severity of Harm.” MedTech Intelligence. Retrieved from https://www.medtechintelligence.com/feature_article/medical-device-use-error-focus-severity-harm/
- Wiklund, Michael, Andrea Dwyer, and Erin Davis. 2016. Medical Device Use Error Root Cause Analysis. Boca Raton: CRC Press.
- Faulkner, L. (2003). “Beyond the five-user assumption: Benefits of increased sample sizes in usability testing.” Behavior Research Methods, Instruments, & Computers. 35 (3): 379-383. https://doi.org/10.3758/BF03195514
- Cox, L.A.. (2008). “What’s Wrong with Risk Matrices?” Risk Analysis 28(2): 491-512. https://doi.org/10.1111/j.1539-6924.2008.01030.x.