(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .
Heart murmur detection from phonocardiogram recordings: The George B. Moody PhysioNet Challenge 2022 [1]
['Matthew A. Reyna', 'Department Of Biomedical Informatics', 'Emory University', 'Atlanta', 'Georgia', 'United States Of America', 'Yashar Kiarashi', 'Andoni Elola', 'Department Of Electronic Technology', 'University Of The Basque Country Upv Ehu']
Date: 2023-11
For categorical variables, the entries of the table denote the fraction of the dataset with each possible value. For numerical variables, the entries of the table denote the median and first and third quartiles, respectively, of the values in the dataset, i.e., median [Q1, Q3].
Table 1 summarizes the variables provided in the training, validation, and test sets of the Challenge data. Table 2 summarizes the distributions of the variables provided with the data.
In total, the Challenge dataset consisted of 5272 annotated PCG recordings from 1568 patient encounters with 1452 patients. We released 60% of the recordings in a public training set and retained 10% of the recordings in a private validation set and 30% of the recordings in a private test set. The training, validation, and test sets were matched to approximately preserve the univariate distributions of the variables in the data. Data from patients who participated in multiple screening campaigns belonged to only one of the training, validation, or test sets to prevent data leakage. We shared the training set at the beginning of the Challenge to allow the participants to develop their algorithms, and we sequestered the validation and test sets during the Challenge to evaluate the submitted the algorithms.
The clinical outcome annotations were determined by cardiac physiologists using all available clinical notes, including the socio-demographic questionnaire, clinical examination, nursing assessment, and cardiac investigations. In particular, these notes include reports from an echocardiogram, which is a standard diagnostic tool for characterizing cardiac function. The clinical outcome annotations indicated whether the expert annotator identified normal or abnormal cardiac function. The clinical outcome annotations were performed by different experts, and these experts were different from the expert who performed the murmur annotations.
The murmur annotations and characteristics (location, timing, shape, pitch, quality, and grade) were manually identified by a single cardiac physiologist independently of the available clinical notes and PCG segmentation. The cardiac physiologist annotated the PCGs by listening to the audio recordings and by visually inspecting the corresponding waveforms. The murmur annotations indicated whether the annotator could detect the presence or absence of a murmur in a patient from the PCG recordings for the patient or whether the annotator was unsure about the presence or absence of a murmur. The murmur annotations did not indicate whether a murmur was pathological or innocent.
Each patient’s PCGs and clinical notes were also annotated for murmurs and abnormal cardiac function (described below). These annotations served as the labels for the Challenge.
The choice of locations, the number of recordings at each location, and the duration of the PCGs varied between patients. The PCGs were recorded by multiple operators, but the PCGs for each patient encounter were recorded by a single operator, and they were recorded in a sequential manner, i.e., not simultaneously. The PCGs were also inspected for signal quality and semi-automatically segmented using the three algorithms proposed in [ 7 ], [ 8 ], and [ 9 ] and then corrected, as deemed necessary, by a cardiac physiologist.
Auscultation locations for the CirCor DigiScope dataset [ 6 ], which was used for the Challenge: Pulmonary valve (PV), aortic valve (AV), mitral valve (MV), and tricuspid valve (TV).
The PCGs were recorded using an electronic auscultation device, the Littmann 3200 stethoscope, from up to four auscultation locations on the body; see Fig 1 :
During the data collection sessions, each participant answered a sociodemographic questionnaire, followed by a clinical examination (anamnesis and physical examination), a nursing assessment (physiological measurements), and cardiac investigations (cardiac auscultation, chest radiography, electrocardiogram, and echocardiogram). The collected data were then analyzed by an expert pediatric cardiologist. The expert could re-auscultate the participant and/or request complementary tests. At the end of a session, the pediatric cardiologist either directed the participant for a follow-up appointment, referred the participant to cardiac catheterization or heart surgery, or discharged the participant.
The CirCor DigiScope dataset [ 6 ] was used for the George B. Moody PhysioNet Challenge 2022. This dataset consists of 5268 PCG recordings from one or more auscultation locations during 1568 patient encounters with 1452 distinct patients. The patient population was primarily pediatric, ranging from neonates to adolescents, but it also included pregnant adults; no recordings were collected from fetuses. The dataset was collected during two screening campaigns in Paraíba, Brazil from July 2014 to August 2014 and from June 2015 to July 2015. The study protocol was approved by the 5192-Complexo Hospitalar HUOC/PROCAPE Institutional Review Board under the request of the Real Hospital Português de Beneficiência em Pernambuco. A detailed description of the dataset can be found in [ 6 ].
Challenge objective
We designed the Challenge to explore the potential for algorithmic pre-screening of heart murmurs and abnormal heart function, especially in resource-constrained environments [18]. We asked the Challenge participants to design working, open-source algorithms for detecting heart murmurs and abnormal cardiac function from PCG recordings. For each patient encounter, each algorithm interpreted the PCG recordings and/or demographic data.
Challenge timeline. The George B./ Moody PhysioNet Challenge 2022 was the 23rd George B. Moody PhysioNet Challenge [11, 18]. As with previous Challenges, the 2022 Challenge had an unofficial phase and an official phase. The unofficial phase (February 1, 2022 to April 8, 2022) introduced the teams to the Challenge. We publicly shared the Challenge objective, training data, example algorithms, and evaluation metrics at the beginning of the unofficial phase. At this time, we only had access to the patients’ murmur annotations, so we only asked the teams to detect murmurs. We invited the teams to submit the entries with the code for their algorithms for evaluation, and we scored at most 5 entries from each team on the hidden validation set during the unofficial phase. Between the unofficial phase and official phase, we took a hiatus (April 9, 2022 to April 30, 2022) to improve the Challenge in response to feedback from the teams, the broader community, and our collaborators. During this time, we added the patients’ clinical outcomes for abnormal cardiac function to the CirCor DigiScope dataset [6]. The official phase (May 1, 2022 to August 15, 2022) allowed the teams to refine their approaches for the Challenge. We updated the Challenge objectives, data, example algorithms, and evaluation metric at the beginning of the official phase. At this time, we had access to both the patients’ murmur annotation and clinical outcomes, so we asked the teams to detect murmurs and abnormal cardiac function. We again invited the teams to submit their entries for evaluation, and we scored at most 10 entries from each team on the hidden validation set during the official phase. After the end of the official phase, we asked each team to choose a single entry from their team for evaluation on the test set. We allowed the teams to choose any successful model from the official phase, but most teams chose their best-scoring entries. We only evaluated one entry from each team to prevent sequential training on the test set. The winners of the Challenge were the teams with the best scores on the test set. We announced the winners at the end of the Computing in Cardiology (CinC) 2022 conference. The teams presented and defended their work at CinC 2022, which was held in Tampere, Finland. As described in the next section, the teams wrote four-page conference proceeding paper describing their work, which we reviewed for accuracy and coherence. The code for their algorithms will be publicly released after the end of the Challenge and the publication of the papers on the Challenge website.
Challenge rules and expectations. We encouraged the teams to ask questions, pose concerns, and discuss the Challenge in a public forum, but we prohibited them from discussing or sharing their work during the unofficial phase, hiatus, or official phase of the Challenge to preserve the diversity and uniqueness of their approaches. For both phases of the Challenge, we required the teams to submit the complete code for their algorithms, including their preprocessing, training, and inference steps. We first ran each team’s training code on the public training set to train the models. We then ran the trained models on the hidden validation and test sets to label the data; we ran the trained models on each patient sequentially to reflect the sequential nature of the screening process. We then scored the outputs from the models using the expert annotations on hidden validation and test sets. We allowed the teams to submit either MATLAB or Python code; other programming languages were considered upon request, but there were no requests for other programming languages. Participants containerized their code in Docker and submitted it by sharing private GitHub or Gitlab repositories with their code. We downloaded their code and ran it in containerized environments on Google Cloud. We described the computational architecture of these environments more fully in [12]. Each entry had access to 8 virtual CPUs, 52 GB RAM, 50 GB local storage, and an optional NVIDIA T4 Tensor Core GPU (driver version 470.82.01) with 16 GB VRAM. We imposed a 72-hour time limit for training each entry on the training set without a GPU, a 48-hour time limit for training each entry on the training set with a GPU, and a 24-hour time limit for running each trained entry on either the validation or test set either with or without a GPU. To aid the teams, we shared example MATLAB and Python entries. These examples used random forest classifiers with the age group, sex, height, weight, pregnancy status of the patient as well as the presence, mean, variance, and skewness of the numerical values in each PCG recording as features. We did not design these example entries to perform well. Instead, we designed them to provide minimal working examples of how to read the Challenge data and write the model outputs.
Challenge evaluation. To capture the focus of the 2022 Challenge on algorithmic screening for heart murmurs and abnormal cardiac function, we developed novel scoring metrics for each of the two Challenge tasks: detecting heart murmurs and identifying clinical outcomes for abnormal or normal heart function. As described above, the murmurs were directly observable from the PCG recordings, but the clinical outcomes were determined by a more comprehensive diagnostic screening, including the interpretation of an echocardiogram. However, despite these differences, we asked the teams to perform both tasks using only the PCGs and routine demographic data, which allowed us to explore the diagnostic potential of algorithmic approaches for the interpretation of relatively easily accessible PCGs. The algorithms for both of these tasks effectively pre-screened patients for expert referral. Under this paradigm, if an algorithm inferred potentially abnormal cardiac function, i.e., the model outputs were murmur present, murmur unknown, or outcome abnormal, then the algorithm would refer the patient to a human expert for a confirmatory diagnosis and potential treatment. If the algorithm inferred normal cardiac function, i.e., if the model outputs were murmur absent or outcome normal, then the algorithm would not refer the patient to an expert and the patient would not receive treatment. Fig 2 illustrates this algorithmic pre-screening process as part of a larger diagnostic pipeline. PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 2. Screening and diagnosis pipeline for the Challenge. All patients would receive algorithmic pre-screening, and patients with positive results from algorithmic pre-screening would receive confirmatory expert screening and diagnosis. (i) Patients with positive results from algorithmic pre-screening and expert annotators would receive treatment; they are true positive cases. Patients with positive results from algorithmic pre-screening and negative results from expert annotators would not receive treatment; they are false positive cases or false alarms. Patients with negative results from algorithmic pre-screening who would have received positive results from the expert annotators would have missed or delayed treatment; they are false negative cases. Patients with negative results from algorithmic pre-screening who would have also received negative results from expert annotators also would not receive treatment; they are true negative cases.
https://doi.org/10.1371/journal.pdig.0000324.g002 For the murmur detection task, we introduced a weighted accuracy metric that assessed the ability of an algorithm to reproduce the results of a skilled human annotator. For the clinical outcome identification task, we introduced a cost-based scoring metric that reflected the cost of expert diagnostic screening as well as the costs of timely, missed, and delayed treatment for abnormal cardiac function. The team with the highest weighted accuracy metric won the murmur detection task, and the team with the lowest cost-based evaluation metric won the clinical outcome identification task. We formulated versions of both of these evaluation metrics for both tasks to allow for more direct comparisons; see S1 Appendix for the additional metrics. We also calculated several traditional evaluation metrics to provide additional context to the performance of the models. Cost-based scoring is controversial, in part, because healthcare costs are an imperfect proxy for health needs [13, 14]; we reflect on this important issue in the Discussion section. However, screening costs necessarily limit the ability to perform screening, especially in more resource-constrained environments, so we considered costs as an imperfect proxy for improving access to cardiac screening.
[END]
---
[1] Url:
https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000324
Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.
via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/