(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .
A method for rapid machine learning development for data mining with doctor-in-the-loop [1]
['Neva J. Bull', 'School Of Psychological Sciences', 'University Of Newcastle', 'Callaghan', 'Nsw', 'Hunter Medical Research Institute', 'John Hunter Hospital', 'New Lambton Heights', 'Bridget Honan', 'Alice Springs Hospital']
Date: 2023-07
Abstract Classifying free-text from historical databases into research-compatible formats is a barrier for clinicians undertaking audit and research projects. The aim of this study was to (a) develop interactive active machine-learning model training methodology using readily available software that was (b) easily adaptable to a wide range of natural language databases and allowed customised researcher-defined categories, and then (c) evaluate the accuracy and speed of this model for classifying free text from two unique and unrelated clinical notes into coded data. A user interface for medical experts to train and evaluate the algorithm was created. Data requiring coding in the form of two independent databases of free-text clinical notes, each of unique natural language structure. Medical experts defined categories relevant to research projects and performed ‘label-train-evaluate’ loops on the training data set. A separate dataset was used for validation, with the medical experts blinded to the label given by the algorithm. The first dataset was 32,034 death certificate records from Northern Territory Births Deaths and Marriages, which were coded into 3 categories: haemorrhagic stroke, ischaemic stroke or no stroke. The second dataset was 12,039 recorded episodes of aeromedical retrieval from two prehospital and retrieval services in Northern Territory, Australia, which were coded into 5 categories: medical, surgical, trauma, obstetric or psychiatric. For the first dataset, macro-accuracy of the algorithm was 94.7%. For the second dataset, macro-accuracy was 92.4%. The time taken to develop and train the algorithm was 124 minutes for the death certificate coding, and 144 minutes for the aeromedical retrieval coding. This machine-learning training method was able to classify free-text clinical notes quickly and accurately from two different health datasets into categories of relevance to clinicians undertaking health service research.
Citation: Bull NJ, Honan B, Spratt NJ, Quilty S (2023) A method for rapid machine learning development for data mining with doctor-in-the-loop. PLoS ONE 18(5): e0284965.
https://doi.org/10.1371/journal.pone.0284965 Editor: Ernesto Iadanza, University of Siena: Universita degli Studi di Siena, ITALY Received: October 31, 2022; Accepted: April 13, 2023; Published: May 10, 2023 Copyright: © 2023 Bull et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: The study uses two human health-related data sets, both of which require Human Research Ethics Committee (HREC) approval for access and analysis, and both of which are third-party and require direct application to data custodians. In order to access these data, approval is first required through the NT Department of Health/Menzies School of Health Research HREC (email:
[email protected]), referencing this study (reference 2021-4056). Once granted HREC approval, data custodians are as follows: first, mortality data for the Northern Territory from 1980-2019 available through Australian Births Deaths and Marriages (email:
[email protected]). The second set, all aeromedical retrievals in the NT from 2018-2019, is available through a combination of Careflight NT clinical retrieval database (email:
[email protected] (Top End)) and the Alice Springs Hospital Medical Retrieval and Consultation Centre (MRaCC) clinical retrieval database (email:
[email protected]). Funding: The authors received no specific funding for this work. Competing interests: The authors have declared that no competing interests exist.
Introduction Electronic health records represent a wealth of information for health service researchers, but the utility is limited by the challenges of extracting meaningful information from vast datasets of unstructured text [1]. Machine learning (ML) tools can be used to classify free text from clinical notes into categories, with useful clinical and research applications. ML has been used to classify cases into a medical speciality type based on clinical notes from inpatient and outpatient healthcare encounters [2]. Similarly, ML has been used to categorise healthcare encounters as either related or not related to falls, to inform clinical and health service interventions [3]. In these examples, ML tools have utilised natural language processing techniques that require significant computer programming and data science expertise, which may be beyond the scope of clinician researchers. Furthermore, the ML tool must first learn from a labelled training data set, which may be difficult, time-consuming, or expensive to obtain [4–6]. There have been recent software developments that make ML readily available to a broader range of low-resource research challenges [7], although such readily available tools have yet to appear as mainstream research tools. The Human-in-the-Loop (HITL) approach incorporates human skills and expertise into ML processes, including creating labelled datasets [8, 9]. Compared to an automated approach, this interactive approach may achieve greater accuracy with fewer training labels because of the human expert capacity to identify patterns from relatively few samples [10]. Investigators requiring bespoke classification tasks of large-scale datasets must either create their own training dataset or use previously labelled datasets. For a research project investigating the impact of heat waves on specific health outcomes, there were no published examples of use of ML tools to classify prehospital aeromedical triage notes into medical specialty type. Furthermore, previous attempts to classify death certificate records into cause-of-death categories have achieved limited accuracy due to high number of categories and automated approaches without clinician input. The objective of this study was to assess the accuracy and speed of the HITL ML methodological approach for clinician researchers to classify clinical notes into customised categories for future health services research purposes.
Methods Study design This is a validation study of a ML algorithm to extract clinical categorical data from unstructured text. Two distinct data sources and categorisation tasks from residents of the Northern Territory (NT) of Australia were used. The success of the method was measured in terms of overall accuracy, sensitivity, specificity, and time taken to achieve accuracy over 90% in both datasets. Data sources Access to the data was approved by the Human Research Ethics Committee of the Northern Territory Department of Health and Menzies School of Health Research. All data had been fully anonymized before access and the ethics committee waived the requirement for informed consent. Dataset 1 was a 40-year mortality database from 1980 to 2019 from the Northern Territory Births, deaths and marriages registry. The variables included age, sex, Indigenous status (self-reported), location at death, usual residential address, and cause of death as verbatim hand-recorded on the death certificate and then transcribed electronically into the database. The outcome of interest was stroke category (ischaemic stroke, haemorrhagic stroke or no stroke). This text was era-dependent with medical language and diagnostics changing over the 40-year period. There were two language structures–clinical (when recorded by the certifying doctor) and legal (when reported by the coroner). It was unstructured natural language and had numerous spelling and typographical errors. Examples of the cause of death text are shown in Table 1. PPT PowerPoint slide
PNG larger image
TIFF original image Download: Table 1. Examples of natural language in (a) mortality dataset, (b) aeromedical retrieval dataset.
https://doi.org/10.1371/journal.pone.0284965.t001 The variables included were date of retrieval tasking, age, triage priority (1–6 in order of urgency, with 1 representing highest level emergency and 6 representing non-urgent), whether a doctor was tasked to accompany the retrieval, and clinical reason for retrieval recorded in shorthand by the doctor receiving the initial emergency phone call triggering the retrieval. Examples of the reason for retrieval are shown in Table 1. Classification outcomes The coding task chosen for cause-of-death text was a classification into 3 mutually exclusive categories: Ischaemic, Haemorrhagic or Not (no mention of stroke). This was not restricted to the word “Stroke” and included anything that could be reasonably interpreted by a clinical expert as meaning stroke, such as CVA, cerebrovascular accident, middle cerebral artery infarct or intracerebral haemorrhage. The coding task for the aeromedical retrieval was coding reason-for-retrieval text into a 5-category classification of in-hospital receiving specialty destination for each retrieval. The original research objective required coding of this data into broad clinical categorisation of hospital speciality destinations, as the pre-hospital ‘reason for retrieval’ was recorded pre-diagnostically. The specialty destination categories were Medical, Surgical, Trauma, Obstetric and Psychiatric (where trauma is a sub-specialty of surgery but a discrete entity in terms of aetiology of injuries sustained and subsequent retrieval services requested). Creation of validation datasets Two medical specialists developed a set of rules to define coding for each dataset. For instance, where there was mention of both ischaemic and haemorrhagic stroke in any given case, ‘Ischaemic’ was labelled, unless there was a three-month period between events where the haemorrhagic stroke was clearly recorded as the final event. Subdural and subarachnoid haemorrhages were labelled as ‘not’, as was intracranial haemorrhage, not further specified (as these events were often associated with a fall or trauma), however intracerebral haemorrhage (which was presumed to be an intra-parenchymal event) was labelled as ‘Haemorrhagic’. Septic emboli to the brain were labelled ‘Not’. Each medical specialist independently labelled 300 cases for each validation dataset. If there was disagreement between the specialists, they attempted to resolve by consensus or deferred to a third independent medical specialist for a final decision. Preprocessing and selection of ML algorithms The ML implementation was written in C# using.NET Core. The database used was SQL Server and the web-application was written in PHP and Javascript. Data was analysed using the R programming language. An open-source machine learning framework with nine potential candidate algorithms for multiclass classification of text data was used (ML.NET). The preferred algorithm for each task was selected during a process of semi-automated experimentation with train/test sets labelled by a single clinical expert. To minimize over-fitting, comparisons of the accuracy of all algorithms were repeated with systematic exclusions of one or more of the available supplementary features (such as age or gender) until a minimal feature set was found that did not compromise accuracy. Cases with missing supplementary features were included. Parameter experimentation was also performed for each algorithm. Parameters that were tested were: inclusion of stop words, punctuation, numbers and varying the length of n-grams and char-grams. Each experiment was performed with an 80–20 train-test split using a consistent seed. The best algorithm for each task was chosen based on micro- and macro-accuracy. Label-train-evaluate loop A custom user interface (UI) in the form of a web-application was developed that facilitated the investigators to complete the workflow remotely. The method maintained 3 separate splits of the whole dataset, these were a test set that did not change throughout the study, was not included in training and was not visible to the labellers, a training set and a prediction set. The prediction set comprised the remainder of the dataset after exclusion of the training and test sets. Training only occurred on the training set. Predictions were made against both the test set (to calculate interim accuracy metrics) and the remaining dataset, after exclusion of the training set. Balanced test sets were developed by initially labelling cases at random and then using text search on keywords to add additional cases in the rare categories. The test sets were created by two medical experts who labelled each case independently, and disagreements were resolved by consensus. If consensus could not be reached, a third independent medical expert made the final decision. The method provided a mechanism to select predictions for expert labelling. These newly labelled data were then moved to the training set and excluded from further prediction. Each of the datasets were trained and evaluated by two medical experts in a “label–train–evaluate” loop (Fig 1). PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 1. Label-train-evaluate loop.
https://doi.org/10.1371/journal.pone.0284965.g001 After each round of labelling training data, the medical experts triggered retraining of the model on the server by pressing a button on the UI. The UI then displayed new predictions and accuracy metrics, in about 12 seconds. The UI facilitated predictions to be evaluated either at random, by text search or by sorting on confidence scores and could be filtered based on predicted category. The medical experts could choose to ignore correct predictions with high confidence scores, confirm correct prediction with low confidence scores or label incorrect predictions. Confirmed and relabelled predictions were moved to the training set and were not included in future predictions. When consistent misclassifications were detected, keywords were used to find and label additional similar cases to be added to the training set. Predictions could also be sorted on their confidence score. This workflow was designed with the aim of producing rapid and maximal improvement of the model by focusing labelling efforts on major misclassifications. When high confidence errors were minimal, cases in each category were listed in ascending order of confidence allowing for more difficult cases to be added to the training set. Validation set After the ‘label-train-evaluate’ loop was concluded, a validation set was selected for labelling by two medical experts who were blinded to the category predicted by the ML model. The validation set was balanced between groups by randomly selecting equal numbers of cases based on the category inferred by the ML model.
Discussion In this study, historically collected free-text data from healthcare records was classified into customised categories for use in health services research projects using a human-in-the-loop interactive ML methodology. The HITL labelling process took 124 minutes and was 94.7% accurate in classifying death certificate data into 3 categories related to stroke diagnosis. For aeromedical retrieval triage data, the HITL labelling process took 144 minutes and was 92.4% accurate in classifying cases into 5 categories related to medical specialty type. As our study demonstrates, the challenge of coding complex natural language databases can have simple solutions using well-developed frameworks for automating ML algorithms (we used ML.NET). This allowed us to use our own understanding of the datasets that we were applying our research question to, and then iteratively train the model until we were satisfied it met our accuracy requirements. Prior to the availability of off-the-shelf software, the implementation of ML to manipulate data in such a way required high-level software engineers. With the advent of AutoML and in combination with HITL, our study proves the democratising of ML that will make such techniques available to laboratories that may not have had sufficient funding to employ ML specialists. Previous studies that classify death certificate records into International Classification of Disease (ICD-10) diagnostic codes have not yielded the accuracy of this study [11–13], and thus were not suitable for the task of identifying stroke-related cases from a jurisdictional death registry database. Challenges to accuracy in this field include the high number of categories, unbalanced class frequencies, non-conventional language, and abbreviations, and cross-over between categories. The interactive HITL approach allows the clinician to define and customise a small number of categories that are relevant to clinical, health services or research needs. Furthermore, allowing the content-expert to select cases in the training set for labelling may achieve better accuracy with smaller samples, due to the efficiency of human pattern-recognition compared to automated processes. ML techniques have previously been applied to prehospital triage data, but has utilised numerical or categorical variables, such as heart rate or temperature rather than free text in clinical notes [14]. A supervised ML tool that classified clinical notes into medical specialty types achieved a high degree of accuracy. In this study, 2 expert clinicians labelled 431 clinical notes by medical specialty category. The authors do not report on the time taken for the clinical experts to complete the labelling, nor the time for computer programmers and data scientists to compare and select the best-performing classifier technique. Furthermore, it is unknown how this MLA would perform when applied to novel datasets. Given the variations in language and abbreviations in different health care settings, machine learning tools will perform best when applied to a dataset that is derived from the same source as the training dataset [15]. For this reason, researchers may prefer to create a custom ML tool for each new project, rather than utilising a ML algorithm that was trained on a dissimilar dataset.
Conclusion An interactive approach to machine learning that uses readily available off-the-shelf ML software and medical experts as the “human-in-the-loop” for training data sets was used to rapidly and accurately classify free text from healthcare records into customised categories.
[END]
---
[1] Url:
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0284965
Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.
via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/