(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .



Going beyond the means: Exploring the role of bias from digital determinants of health in technologies [1]

['Marie-Laure Charpignon', 'Massachusetts Institute Of Technology', 'Institute For Data', 'Systems', 'Society', 'Laboratory For Information', 'Decision Systems', 'Boston', 'Massachusetts', 'United States Of America']

Date: 2023-10

Many of the DDoH mechanisms encountered in medical technologies and formulae result in lower accuracy or lower validity when applied to patients outside the initial scope of development or validation. Our clinical recommendations caution clinical users in completely trusting result validity and suggest correlating with other measurement modalities robust to the DDoH mechanism (e.g., arterial blood gas for pulse oximetry, core temperatures for NCIT). Our research recommendations suggest not only increasing diversity in development and validation, but also awareness in the modalities of diversity required (e.g., skin pigmentation for pulse oximetry but skin pigmentation and sex/hormonal variation for NCIT). By increasing diversity that better reflects patients in all scenarios of use, we can mitigate DDoH mechanisms and increase trust and validity in clinical practice and research.

In light of recent retrospective studies revealing evidence of disparities in access to medical technology and of bias in measurements, this narrative review assesses digital determinants of health (DDoH) in both technologies and medical formulae that demonstrate either evidence of bias or suboptimal performance, identifies potential mechanisms behind such bias, and proposes potential methods or avenues that can guide future efforts to address these disparities.

Funding: AIW is supported by the Duke Clinical and Translational Science Institute by National Center for Advancing Translational Sciences of the NIH under UL1TR002553. GD is funded by the Institute for Data Valorization (Grant CF00137433), Montréal, and the Fonds de Recherche du Québec (Grant 285289 & 295291). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Moreover, we reference Alba and colleagues in definitions of calibration and discrimination [ 5 ]. Calibration refers to the accuracy of absolute estimates, effectively comparing empirical observations with predicted or estimated measurements (e.g., arterial blood gas oxygen saturation versus pulse oximetry) and improving their correlation by adjusting device settings. Discrimination refers to how well a model can differentiate between groups.

This paper uses the terminology skin tone instead of skin color to describe the color of the skin, including concepts of melanin concentration along with jaundice and other skin pigments. We believe this framing choice is essential as the term is more nuanced and inclusive, including all gradients of skin pigmentation.

Based upon the framework by Kadambi and colleagues [ 4 ], we organize this review as follows. First, we describe biases based on characteristics that patients were born with and that are immutable without active intervention, e.g., biological sex at birth or skin tone (Physical and biological bias section). Then, we transition to discuss biases that can result from the confluence of medical technology and community-dependent cultural or social norms, e.g., the quality of electroencephalography (EEG) signals may vary with patient hairstyles (Interaction of human factors and cultural practices section). Finally, we consider biases resulting from design choice or interpretation (e.g., formulae for pulmonary function tests (PFTs) including race as a factor, although valid alternatives excluding it have been proposed) (Interpretation bias section).

This review does not cover the impact of medical technologies relying on artificial intelligence and machine learning, such as algorithms for clinical decision-making. Indeed, insufficient diversity in patient sampling—commonly due to selection bias, inequitable decision-making, or systemic racism—has already been well documented as influencing the performance of these models [ 3 ]. This review also excludes the direct impacts of social determinants of health, such as the underdetection of diabetes among patients of color due to factors affecting their access to medical care, which is also a topic well documented in the literature.

While previous articles have focused on social and economic determinants of health, this narrative review investigates digital determinants of health, as defined earlier in this collection [ 1 ]. Specifically, it identifies digital technologies and medical formulae that demonstrate evidence of bias or suboptimal performance. Such pitfalls generally arise from insufficient consideration of patient diversity. Herein, we describe some known physical or biological mechanisms underpinning differences among patients—including those based on sex (either current or at birth), race, and ethnicity—and identify ways in which these characteristics affect the accuracy of digital medical technology for some populations. One such example is pulse oximetry: disparities in its performance among racial groups are thought to result from a lack of patient diversity in clinical trials [ 2 ]. Another example is body temperature measurement: Differential thermoregulation among females affects the estimates provided by some thermometers. Further, we explain possible repercussions of these biases on digital determinants of health, formulate potential reasons why inadequate patient sampling has resulted in such impacts, and derive implications for clinical care. Finally, when applicable, we present existing solutions to mitigate these biases or suggest ways that corrections may be developed.

Novel medical technologies have arisen to assist clinical teams and facilitate diagnosis by physicians, especially under budget constraints: Since 2010, 523 new medical devices have been approved for commercialization by the Food and Drug Administration (FDA). In parallel with this development, retrospective studies have revealed evidence of disparities in access to medical technology and of bias in the measurements resulting from such devices.

Physical and biological bias

The characteristics of a patient’s skin can influence the performance of medical devices, as illustrated by BioMetric Monitoring Technologies (BioMeTs). Pulse oximetry and non-contact infrared thermometry (NCIT) provide 2 such examples.

Pulse oximetry Pulse oximetry is a common device that measures oxygen saturation or SpO 2 . Physiologically, pulse oximeters compare absorption at 2 wavelengths of light to estimate the ratio between deoxyhemoglobin and oxyhemoglobin in arterial blood, thereby performing a simplified version of spectrometry [7]. Differential performance of pulse oximetry across patient subpopulations has been known for over 4 decades [8–12] but has recently been brought back to the forefront due to large-scale data analyses by Sjoding and colleagues [13], Wong and colleagues, and Henry and colleagues [13–15], suggesting persisting racial-ethnic disparities in oxygen readings. Despite being debated, evidence suggests that pulse oximeters overestimate actual oxygen levels in hospitalized and intensive care unit (ICU) patients, especially at lower oxygen saturations [16]. Oxygen saturation measurements may be influenced by melanin (Fig 1), a chromophore of the skin present in higher concentrations in patients of darker skin tone that affects light absorption—a key element underlying this technology [7]. This artifact appears to be the likely mechanism underpinning disparities in the performance of pulse oximeters among racial-ethnic subgroups [8,9]. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 1. Differential light absorption as a function of melanin levels and skin thickness. Alone or in combination, these factors can alter the performance of medical devices relying on red and/or infrared light. NCIT, non-contact infrared thermometry. https://doi.org/10.1371/journal.pdig.0000244.g001 In the United States, the FDA, which regulates medical devices and ensures their safety and effectiveness, requires at least 2 individuals or >15% of patients participating in trials evaluating new pulse oximeters to have “darkly pigmented skin.” Yet, clear guidance on how skin pigmentation should be quantified or measured is still lacking [17]. Results from the large studies by Sjoding, Wong, and Henry [13–15] demonstrate persisting limitations, with heterogeneous mean absolute percentage errors in the estimation of blood oxygen saturation across racial-ethnic subgroups, reemphasizing that FDA requirements were insufficient to ensure appropriate calibration of pulse oximeters. In response to the publication by Sjoding and colleagues, the FDA released a warning to raise awareness among clinicians about the potential lower accuracy of pulse oximeters for patients with darker skin. However, the agency currently provides no specific recommendations to counteract such measurement biases in device evaluation or in clinical practice [18]. Several approaches might address the lack of systematic device evaluation. First, similar to the performance of other medical devices, the evaluation of pulse oximeters can be improved by enrolling more patients in clinical trials and by ensuring a more diverse set of patients, i.e., with a variety of skin tones. Determining the optimal overall sample size and sociodemographic composition of a trial can be challenging. This reality warrants more methodological work to guide power analysis by anticipating effect sizes and calculating adequate population size and distribution across patient strata. Meanwhile, case studies of a few thousand patients would be an acceptable option. Beyond enhancements in study design and data collection, better representation of patients from different racial and ethnic origins should be sought. Despite guidelines from the National Institutes of Health (NIH) advocating for better representation of communities of color in clinical research [19,20] and similar commitments by the FDA Office of Minority Health and Health Equity [21], barriers to their enrollment remain at multiple levels: (a) systemic (e.g., community hospitals lacking the infrastructure to support clinical trials, despite capturing a more diverse population); (b) individual (e.g., the reluctance of healthcare professionals to register patients from underrepresented racial-ethnic communities due to implicit bias of oft-speculated lower adherence to assigned treatment); and (c) interpersonal (e.g., doctor–patient relationship and building of trust required for a patient to accept to join a trial). Moreover, some patients may have a historically motivated mistrust of the research enterprise associated with violating their human rights in the past [22,23]. In America, the All of Us program launched by the NIH in 2018 was the first step toward improved patient representation. Expected to continue for at least a decade, this study aims to collect data from over 1 million people of different racial-ethnic origins, ages, and backgrounds who live in all parts of the country [24]. Since measurement inaccuracies are thought to be more prevalent among individuals with darker skin [8,9,13–15], it may be relevant to oversample patients across a gradient of darker skin tones when recruiting for studies. Going forward, computational modeling will be key to enhancing study population design by predicting likely effect size ranges via simulations. For example, tissue-mimicking phantoms that closely reproduce the properties of human tissue can be leveraged to elaborate on existing medical devices or propose new treatment options [25]. These bench-top methods are already used in optics to understand the optical characteristics of biological tissues, standardize bio-optical techniques, and calibrate metrics on human-like tissues before issuing a clinical trial [26]. Similarly, quantitative biology and pharmacology studies increasingly rely on interconnected microphysiological systems or organs-on-chips [27]. Using a feedback loop process, these in vitro studies can be tested experimentally and results of in vivo tests integrated in the simulation pipeline through iterative updates. Second, by learning from the discrepancies observed among patients of a given skin tone using paired arterial blood oxygen saturation (SaO 2 ) and pulse oximetry (SpO 2 ) measurements, statistical solutions could potentially be developed to de-bias raw measurements from the pulse oximeter. One way to address these first 2 approaches could include estimating weightings for measurement value adjustment as a function of skin tone, age, and other patient characteristics, and then applying them as part of a correction formula. To our knowledge, existing devices do not currently implement such a strategy. Third, new pulse oximeter architectures are being developed to address the numerous noise sources in the photoplethysmography (PPG) waveform, forming the basis for SpO 2 measurement. The PPG signal is affected by individual user variations (e.g., skin tone, skin thickness, body mass index (BMI), age, temperature, perfusion index, and sex) and environmental perturbations (e.g., motion artifact, sensor placement). Several studies have shown the use of polarized imaging-based techniques to discriminate between light components reflected from various penetration depths to suppress skin effects and improve SpO 2 accuracy [28,29]. Such architectures may reduce inaccuracies in oxygen saturation measurements and ensure similar calibration across subpopulations. Further, these 3 solutions—from revised clinical trial cohort composition principles to the estimation of statistical learning-based correction terms and improved device design—could be combined to allow their respective effects to compound. All recommendations for pulse oximetry have been summarized in Table 1. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Table 1. Summary for pulse oximetry recommendations. https://doi.org/10.1371/journal.pdig.0000244.t001

Non-contact infrared thermometers (NCITs) and temporal artery thermometers (TATs) As a vital sign, body temperature is routinely monitored in hospital settings. It is generally used to assess health status, facilitate diagnosis, and target treatments [30,31]. Given the widespread use of non-contact infrared thermometers (NCITs) during the Coronavirus Disease 2019 (COVID-19) pandemic to detect fever associated with Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) infection [32], the evaluation of potential racial and ethnic biases in the performance of such devices has reemerged. There is a precedent for using NCITs in emergency settings. For example, NCITs have already been used to screen for fever during past epidemics, including SARS in 2003 and H1N1 in 2009 [33–35]; they are also currently recommended as a useful screening device for prevention in the FDA’s COVID-19 pandemic guidelines [36]. Despite widespread adoption of NCITs in response to the COVID-19 pandemic, evidence comparing the performance of NCITs with that of devices commonly used for temperature measurement in adults is lacking. This prompted Australian researchers in May 2021 to study the difference between temperature measurements taken by NCITs and temporal artery thermometers (TATs)—considered as a gold standard device for inpatient care in Australian hospitals [31]. Both devices use infrared sensors and estimate body temperature from skin temperature measurements. However, patient characteristics such as skin tone and biological sex can affect the accuracy of temperature measurements [31]. Overall, NCITs were less precise than reference TATs, as measured by the absolute mean difference between measurement types [31] (Fig 2). Specifically, according to the Australian study, patients with light skin tone had a larger difference between body temperature estimates resulting from the 2 devices (0.27°C) than those with medium dark skin tone (0.12°C). Additionally, NCIT demonstrated a larger difference in females (0.32°C) than in males (0.21°C). In contrast with other medical devices, where inaccuracies mostly arise in darker-skinned individuals, the lack of instrument precision affecting NCITs—as estimated by the absolute mean difference with the reference measurement—was higher for light-skinned individuals. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 2. This figure describes the discrepancy between NCIT and reference TAT measurements. Source: Data from [31] Khan and colleagues, Comparative accuracy testing of non-contact infrared thermometers and temporal artery thermometers in an adult hospital setting. Am J Infect Control. 2021. NCIT, non-contact infrared thermometry; TAT, temporal artery thermometer. https://doi.org/10.1371/journal.pdig.0000244.g002 Of clinical importance, the difference in body temperature estimates derived from the 2 thermometer types was larger when the actual body temperature was higher than 37.5°C (99.5°F). In these circumstances, using an NCIT rather than a TAT could lead to an incorrect diagnosis since most healthcare providers consider a patient to have a fever when their body temperature exceeds 38°C (100.4°F) [37]. Although these deviations may seem small, the normal body temperature only ranges from 36.16°C to 37.02°C (97.1 to 98.6°F); therefore, the observed differences based on skin tone and sex represent up to 37% of the overall healthy range of body temperatures [38]. Given this tight interval of temperature values, a deviation of up to 0.5°C can span up to half of this range. Similarly, in another recent retrospective study [39], the use of temporal rather than oral temperature measurements consistently yielded a lower likelihood of identifying fever in Black patients, irrespective of the considered temperature cutoff, while no such difference was found in White patients. In women, temperature fluctuations due to hormone cycles can prevent reliable comparison of temperature measurements over time and thus further complicate patient evaluation and subsequent treatment decision-making. Indeed, the luteal phase of the menstrual cycle (and high-hormone phases in women using oral contraceptives) is associated with an increase in body temperature by 0.5°C [40,41]. In parallel, prior research has shown that females show greater thermal responses to exogenous and endogenous heat loss than males—a likely cause of mismeasurements in body temperature [42]. This influence could affect the infrared energy measured by these devices. Therefore, the inaccuracy of temperature measurements for a given individual may be subject to time-varying perturbations, e.g., during different phases of the hormonal cycle. This reality could have important implications in medical practice. For example, a clinician monitoring a female patient with COVID-19 may not reliably track the status of their patient as daily changes in estimated body temperature [43]. This difficulty can result from either underlying device inaccuracy, hormone-induced fluctuations, or the combination of both factors—making causal interpretation challenging. In summary, a negative difference in estimated basal body temperature measurements could lead to underdiagnosis (false negative). Conversely, a positive difference could potentially lead to overdiagnosis (false positive). Existing research suggests that the design of thermometers may have contributed to such discrepancies: the intrinsic properties of a patient’s skin (e.g., tone, thickness, perspiration) [31], likely to affect estimated body temperature measurements, may not have been considered by manufacturers while calibrating the device. Going forward, improving the sensitivity-specificity trade-off of NCITs will thus require elaborating patient-specific adjustment factors for temperature. In particular, further studies are needed to determine the influence of other patient-related factors—such as age, blood flow under the skin, metabolic rate, cardiac output, and hormonal levels—on the accuracy and reliability of temperature measurements. This step will be crucial to mitigating the thermometer’s inaccuracy in women and individuals with varying skin tones. Moreover, more research should be dedicated to characterizing intersectional sources of bias, e.g., biological sex and skin tone, associated with this medical technology to remediate improper device evaluation. To date, no dataset has a sufficiently large sample size to quantify differences between temperature measurements emanating from NCITs and TATs, stratified by biological sex and skin tone. However, multiple sources of bias could compound in practice: Although there currently is no evidence for this yet, device discrepancies associated with more pronounced thermal responses could add to those related to skin tone, yielding higher rates of inaccurate body temperature estimates, for example, among females with fever and with a lighter skin tone. Adequate documentation and reporting of the characteristics of patients enrolled in prospective studies and trials evaluating thermometers should be emphasized. Without such documentation, quantifying digital determinants of health, monitoring their temporal evolution, and correcting for addressable biases in measurements of both body temperature and other vital signs will be infeasible. All recommendations for NCIT have been summarized in Table 2. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Table 2. Summary for NCIT recommendations. https://doi.org/10.1371/journal.pdig.0000244.t002

[END]
---
[1] Url: https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000244

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/