(C) PLOS One

(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .

Development and validation of a deep learning model for detecting signs of tuberculosis on chest radiographs among US-bound immigrants and refugees [1]

['Scott H. Lee', 'National Center For Emerging', 'Zoonotic Infectious Diseases', 'Us Centers For Disease Control', 'Prevention', 'Atlanta', 'Georgia', 'United States Of America', 'Shannon Fox', 'Raheem Smith']

Date: 2024-10

Immigrants and refugees seeking admission to the United States must first undergo an overseas medical exam, overseen by the US Centers for Disease Control and Prevention (CDC), during which all persons ≥15 years old receive a chest x-ray to look for signs of tuberculosis. Although individual screening sites often implement quality control (QC) programs to ensure radiographs are interpreted correctly, the CDC does not currently have a method for conducting similar QC reviews at scale. We obtained digitized chest radiographs collected as part of the overseas immigration medical exam. Using radiographs from applicants 15 years old and older, we trained deep learning models to perform three tasks: identifying abnormal radiographs; identifying abnormal radiographs suggestive of tuberculosis; and identifying the specific findings (e.g., cavities or infiltrates) in abnormal radiographs. We then evaluated the models on both internal and external testing datasets, focusing on two classes of performance metrics: individual-level metrics, like sensitivity and specificity, and sample-level metrics, like accuracy in predicting the prevalence of abnormal radiographs. A total of 152,012 images (one image per applicant; mean applicant age 39 years) were used for model training. On our internal test dataset, our models performed well both in identifying abnormalities suggestive of TB (area under the curve [AUC] of 0.97; 95% confidence interval [CI]: 0.95, 0.98) and in estimating sample-level counts of the same (-2% absolute percentage error; 95% CIC: -8%, 6%). On the external test datasets, our models performed similarly well in identifying both generic abnormalities (AUCs ranging from 0.89 to 0.92) and those suggestive of TB (AUCs from 0.94 to 0.99). This performance was consistent across metrics, including those based on thresholded class predictions, like sensitivity, specificity, and F1 score. Strong performance relative to high-quality radiological reference standards across a variety of datasets suggests our models may make reliable tools for supporting chest radiography QC activities at CDC.

After COVID-19, tuberculosis is the second leading cause of death from infectious disease in the world. The U.S. has relatively low rates of the disease—about 2.5 cases per 100,000 members of the population in 2022—but immigrants, refugees, and other migrants seeking entry into the U.S. often come from areas where the background rates are much higher. To help treat these populations and prevent disease from being imported into the U.S., the Division for Global Migration Health (DGMH) at the Centers for Disease Control and Prevention (CDC) oversees a health screening program for applicants seeking entry. To help detect tuberculosis, most of these applicants receive a chest x-ray, which is then checked for signs of the disease. DGMH receives around half a million of these x-rays each year and conducts ad-hoc quality control assessments to make sure the panel physicians overseas are interpreting the x-rays according to the screening program’s standards. To make these assessments more efficient, we developed a machine learning algorithm that can reliably detect signs of tuberculosis in the x-rays. In testing, the algorithm worked well on a variety of datasets, suggesting it will be a good tool for supporting these important quality control efforts.

Data Availability: Neither trained model weights nor raw images will be made publicly available to protect applicant privacy. Requests for data may be made through CDC’s Migration Health Information Nexus at [email protected] .

This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.

To achieve our goals, we trained and validated models for performing three tasks: classifying images as abnormal (Task 1), classifying images as abnormal and suggestive of tuberculosis (Task 2), and identifying the specific abnormalities in the images (Task 3) (we use the same numbering scheme to identify the corresponding models). To meet the two use-cases above, we tested our models on a variety of data sets, both internal and external, and we measured their performance using two operating points, one chosen to optimize individual-level classification performance, and one chosen to optimize accuracy in predicting prevalence. Although we did not formally test abnormality localization methods, e.g., via object detection models, we implemented a number of common saliency methods for visualizing suspected abnormalities on the input images to improve model interpretability and pilot interactive methods for manual review.

The primary use-cases of models in the literature have mostly been clinical decision support and workflow improvement, with special emphasis on individual-level classification performance (often as measured by AUC), interpretability, and usability. With respect to TB, emphasis has also been placed on the potential benefit for models to bolster TB screening and diagnosis in low-resource settings, e.g., by rank-ordering radiographs in batches by their probability of disease to guide manual review. For this project, we evaluated our models’ ability to meet these goals, and we also sought to evaluate their performance in estimating sample-level prevalence, i.e., in predicting the number of abnormal x-rays in a given batch. These measures mirror two important operational use-cases of the model in the overseas screening program: supporting panel physicians in providing high-quality initial reads during the exams (an unplanned but potentially impactful application), and enabling DGMH to conduct quality control (QC) with the radiographs once they have been collected (the primary focus of our current project).

Of special note, when laboratory tests are used as reference standards, model performance tends to drop relative to performance against a radiological standard; however, a small number of models have met the World Health Organization’s (WHO) Target Product Profile (TPP) for tuberculosis triage tests at 90% sensitivity and at least 70% specificity [ 26 , 29 ] relative to NAAT or culture, even when testing does not rely on initial radiographic interpretation to identify images with abnormalities (see e.g. Qin et al. [ 26 ] and Khan et al. [ 30 ], where all study participants received both a chest x-ray and either a GeneXpert MTB/RIF test or a sputum culture upon enrollment).

In chest imaging, applications have generally focused on identifying abnormalities associated with specific diseases, like pneumonia [ 17 – 18 ], COVID-19 [ 19 ], lung cancer [ 20 – 21 ], and tuberculosis [ 15 , 17 , 22 ] . Recent work [ 23 – 24 ] has broadened the scope to include abnormalities in general. Studies focusing on tuberculosis have ranged from the narrow evaluation of specific models (typically commercial) on relatively small test sets [ 25 – 26 ] to the development of original algorithms from custom largescale training sets [ 23 , 27 – 28 ]. The references standards for these studies are often mixed, comprising radiological findings, clinical diagnoses, microbiological testing, and nucleic acid amplification testing (NAAT).

Artificial intelligence (AI), especially as enabled by deep learning algorithms, has been widely studied for applications in medical imaging. Examples include diabetic retinopathy [ 7 ], cardiovascular risk prediction [ 8 ], cancer histopathology [ 9 – 11 ], and imaging for musculoskeletal [ 12 – 13 ], cardiac [ 14 ], and pulmonary [ 15 ] conditions. Models are typically designed for diagnostic tasks, like segmenting anatomical structures or indicating the presence of disease, but they have also been designed for prognostic tasks, like predicting survival time for patients from histopathology whole-slide images [ 16 ].

DGMH’s Technical Instructions for tuberculosis seek to prevent disease importation by detecting and treating infectious tuberculosis before arrival, and to reduce tuberculosis-related morbidity and mortality in these populations. Requirements include a medical history and physical examination. All applicants 15 years and older receive chest x-rays, and anyone with a chest x-ray suggestive of tuberculosis, signs or symptoms suggestive of tuberculosis, or known HIV, then has three sputum specimens collected for smears and cultures [ 4 – 5 ]. In September 2018, DGMH began receiving digital copies of chest x-ray images from panel sites. This was due to the rollout of the eMedical system, an electronic health processing system that collects data from the required overseas immigrant examinations. In 2018 alone, 124,551 images for 521,270 applicants were collected, raising the possibility of using machine learning methods to complement DGMH’s already effective oversight for the radiologic components of tuberculosis screening for US-bound immigrants and refugees [ 6 ].

Every year, approximately 550,000 immigrants and refugees apply to enter the United States. The Division of Global Migration Health (DGMH) within the Centers for Disease Control and Prevention (CDC) has regulatory responsibility to oversee the medical examinations of these applicants. The examinations are conducted overseas in accordance with CDC DGMH’s Technical Instructions for panel physicians. All panel physicians are licensed local medical doctors on an agreement with the US Department of State to perform these examinations, and many are affiliated with the International Organization for Migration (IOM), an intergovernmental agency under the United Nations system that supports migrants. IOM works closely with US Department of State and CDC to ensure the healthy migration of US-bound immigrants and refugees.

Tuberculosis is an infectious disease caused by Mycobacterium tuberculosis (MTB) that typically affects the lungs [ 1 ]. Those who are infected but do not show symptoms have latent tuberculosis infection (LTBI) and may never develop tuberculosis disease. LTBI is not infectious but still needs to be treated to prevent the progression into tuberculosis disease. Tuberculosis disease causes coughing, chest pain, fatigue, weight loss, fever, and many other symptoms, and is contagious [ 2 ]. It is the 13 th leading cause of death in the world, and the second leading infectious killer after COVID-19 [ 1 ]. In the United States, tuberculosis rates have been declining, and the tuberculosis incidence rate for 2021 was 2.4 cases per 100,000 persons, with the majority of reported cases occurring among non-US–born persons (71.4%). Non-US born persons had an incidence rate 15.8 times higher (12.5 cases per 100,000) when compared to US-born persons (0.8 cases per 100,000) [ 3 ].

Methods

Internal dataset curation and description For our internal datasets (hereafter HaMLET, from our project title, Harnessing Machine Learning to Eliminate Tuberculosis), we obtained an initial convenience sample of 327,650 digitized radiographs from four sources: eMedical, the US Department of State’s immigrant health data system, a web-based application for recording and transmitting immigrant medical cases between the Panel Physicians, US Department of State, and the CDC [31]; the Migrant Management Operational System Application (MiMOSA), the International Organization for Migration’s (IOM) refugee health data system; IOM’s Global Teleradiology and Quality Control Centre (GTQCC); and a small number of individual US immigrant panel sites that screen a relatively high number of applicants with tuberculosis each year (site names are provided in the Acknowledgments). Importantly, all these sites have experienced radiologists, and most conduct either double or triple readings on all chest x-ray images as a measure of quality control. Regardless of source, all radiographs were stored as Digital Imaging and Communications in Medicine (DICOM) files, and all radiographic findings were extracted directly from the structured entries in the DS-3030 Tuberculosis Worksheet [32] instead of from free-text radiology reports by way of natural language processing (NLP). The set assembled for this project was taken from screenings conducted during a ten-year period from October 2011 to October 2021 and not exclusively from the digitized radiographs routinely received by DGMH since 2018 (S1 Table shows the distribution of exams by region and year). The digitized radiographs began in 2018 due to the eMedical rollout, but we also received screenings directly from private immigrant panel sites and from IOM that predated the eMedical rollout. We excluded radiographs from applicants less than 15 years of age (n = 52,523), as well as those stored in DICOM files whose pixel arrays were missing, corrupt, or otherwise unreadable by the software we used for extraction (n = 107,115) (Fig 1 shows a flow diagram providing a detailed numerical accounting of these two exclusion steps). The remaining 168,012 radiographs constituted our final dataset, which we split into training, validation, and testing portions following the procedure described below. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 1. Flow diagram detailing the number of radiographs by data source and abnormality status from before exclusion for age (All Ages), after exclusion for age under 15 years, and after exclusion for whether the DICOM pixel arrays were Python-readable (Valid Images). Radiographs were collected between the years of 2011 and 2021 and constitute a convenience sample of the total applicant population for that time period. The number of images excluded at each step is shown in the last column as N excl . https://doi.org/10.1371/journal.pdig.0000612.g001

Radiologist annotations Chest radiograph abnormalities for the immigration exam fall into one of two groups: those suggestive of tuberculosis and those not. Abnormalities suggestive of tuberculosis include: infiltrates or consolidations; reticular markings suggestive of fibrosis; cavitary lesions; nodules or mass with poorly defined margins; pleural effusions; hilar or mediastinal adenopathy; miliary findings; discrete linear opacities; discrete nodule(s) without calcification; volume loss or retraction; and irregular thick pleural reaction. Abnormalities not suggestive of tuberculosis include cardiac, musculoskeletal, or other abnormalities; smooth pleural thickening; diaphragmatic tenting; calcified pulmonary nodules; and calcified lymph nodes. Most abnormal images in our internal validation and test sets were suggestive of tuberculosis. Although we did benchmark our generic model against two open datasets with a wider range of abnormalities (described below), we focused primarily on the tuberculosis classification task for our analysis. Importantly, however, because only a small number of the abnormal images (1,551) were from applicants with active tuberculosis at the time of screening—the vast majority were from applicants who had previously been screened, diagnosed with tuberculosis, and received treatment, or whose tuberculosis sputum testing results were negative—we chose not to benchmark our models against a microbiological or bacteriologic reference standard, focusing instead on a purely radiological reference standard.

External test sets To supplement our internal testing data, we benchmarked our binary models 1 and 2 on four external datasets: ChestX-ray8 [33]; the Montgomery County, USA (MCU) and the Shenzhen, China (SHN) tuberculosis datasets [34]; and VinDr-CXR [35]. VinDr-CXR was the largest with 2,971 images in total, 161 of which were suggestive of tuberculosis; the others ranged in size from 138 images (MCU) to 810 images (ChestX-ray8). For all datasets, we use the testing splits specified in their original publications, and the original labels, with the exception of ChestX-ray8, for which we use the refined test labels provided by Google [23,27]. MCU, SHN, and VindDr-CXR had labels indicating the suggested presence of tuberculosis (reference standards varied by dataset and included radiographic, clinical, and laboratory evidence of disease), but only ChestX-ray8 and VinDr-CXR also had labels indicating a variety of other abnormalities. Because of this imperfect overlap between our classification tasks and the labels in the datasets, VinDr-CXR is the only dataset on which we test both binary models (1 and 2); for the other three, we test only Model 1 (ChestX-ray8) or Model 2 (MCU and SHN).

Dataset splitting For our internal data, we began with 168,012 images in total, which we then randomly split into training (152,012; 15% abnormal), validation (8,000; 50% abnormal), and testing (8,000; 50% abnormal) sets, following a sample size calculation we used to determine the number of images we would need to achieve a 5% margin of error in estimating sensitivity (technical details on the procedure are provided in the S1 Text). Training images were single-read images randomly drawn from all sites. Testing and validation images for Task 2 had either been double-read as part of the IOM Teleradiology QA/QC program or single-read at a handful of panel sites in areas with high TB burden. For ChestX-ray8, we reserved an additional 8,000 images from the original training data to serve as validation data for Task 1 (in our internal validation dataset, abnormalities not suggestive of tuberculosis were underrepresented, as the abnormal images were almost always abnormal and suggestive of tuberculosis).

Operating point selection When validation data was available, we used it to select two operating points for thresholding the models’ predictions on the corresponding test sets: one that maximized Youden’s J index (all tasks), and one that minimized the relative error in predicted prevalence (Tasks 2 and 3 only). We named these two operating points the “J” and “count” operating points, respectively. Because the proportion of abnormal images in our internal test set was different than the corresponding proportion in the training set, the latter being generally representative of the screening program’s data distribution over a multiyear period of time, the count-based operating points were selected using a reweighting scheme that minimized error in predicting the proportion from the training set using the model’s performance characteristics (sensitivity and specificity) on the validation set; this procedure is described in full in S1 Text. Finally, when validation data was not available, as was the case for all external datasets except for ChestX-ray8, we selected a single operating point that maximized Youden’s J index on the test sets. We provide all operating points in S2 Table.

Image preprocessing After discarding DICOM files with corrupt pixel arrays, we extracted the pixel arrays and saved them as 1024x1024-pixel PNG files. We then used optical character recognition (OCR) software to identify images with evidence of burned-in patient metadata and removed them from the dataset. We describe both of these procedures more fully in S1 Text.

Model architecture and training procedures To improve the model’s ability to generalize to unseen data, we used a custom image augmentation layer as the input layer, randomly perturbing brightness, contrast, saturation, and other characteristics to the radiographs during training; value ranges for these perturbations were taken from Majkowska et al. 2020 and remained fixed during training [27]. For the feature extractor, we used EfficientNetV2M [36], which was pretrained on ImageNet [37]. The final layers in our model were a dropout layer (probability = 0.5, held fixed) and then a dense layer with a sigmoid activation and binary cross-entropy loss. We trained all models in minibatches of 12 images (4 per GPU) with the Adam [38] optimizer and a fixed learning rate of 1e-4. For all tasks, we allowed training to continue until AUC began to decrease on the validation data at which point we saved the model weights and proceeded to testing.

Performance metrics and statistical inference We calculated common classification performance metrics for all models and test sets, including AUC, sensitivity, specificity, and F1. For tuberculosis-specific datasets, we also calculated specificity at 90% sensitivity and sensitivity at 70% specificity, in line with the WHO’s TPP for tuberculosis triage tests for use in community settings. For the HaMLET test set, we calculated the model’s relative error in predicting prevalence (i.e., the true number of abnormal-TB images), mirroring our primary operational use-case for the model as a tool for internal QC activities. For all metrics, we calculated bias-corrected and accelerated (BCA) bootstrap confidence intervals [39], down-sampling abnormal images in the bootstrap replicates so the percentage of abnormal images in each was equal to the percentage in the training data (target percentage for each task provided in Table 1; see S1 Text for more details). We did not adjust the intervals for multiplicity. Finally, we note here that model performance may depend not only on the reference standard used for classification (i.e., radiography vs. bacteriologic or molecular testing), but also on the prevalence and severity of abnormalities present in the population in which it is intended to be deployed. Applicants who attend immigration-related health screenings tend to be healthier than, say, patients in inpatient settings or those whose present with severe clinical symptoms of TB, and so we expect certain of these metrics to be lower in our study than they would be in a study of patients with more severe disease. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Table 1. Distributions of age and sex for the applicants in our training, validation, and testing datasets, along with the geographic distribution of their corresponding health screening exam sites. https://doi.org/10.1371/journal.pdig.0000612.t001

Abnormality localization We used two saliency methods, Grad-CAM [40] and XRAI [41], to generate abnormality heatmaps for the images. We examined a small selection of the heatmaps for true-positive and false-positive images (abnormal and normal images, respectively, with high model-based probabilities of abnormality) to explore their use as approximate abnormality localization methods. Because we did not have ground-truth bounding box annotation for the images, this step was primarily exploratory.

Software and Hardware Our code is publicly available at https://github.com/cdcai/hamlet.git. Complete information on the software and hardware used is available in S1 Text.

[END]
---
[1] Url: https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000612

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/