(C) PLOS One

(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .

BERT based natural language processing for triage of adverse drug reaction reports shows close to human-level performance [1]

['Erik Bergman', 'Swedish Medical Products Agency', 'Uppsala', 'Luise Dürlich', 'Department Of Computer Science', 'Rise Research Institutes Of Sweden', 'Kista', 'Department Of Linguistics', 'Philology', 'Uppsala University']

Date: 2024-02

Abstract Post-marketing reports of suspected adverse drug reactions are important for establishing the safety profile of a medicinal product. However, a high influx of reports poses a challenge for regulatory authorities as a delay in identification of previously unknown adverse drug reactions can potentially be harmful to patients. In this study, we use natural language processing (NLP) to predict whether a report is of serious nature based solely on the free-text fields and adverse event terms in the report, potentially allowing reports mislabelled at time of reporting to be detected and prioritized for assessment. We consider four different NLP models at various levels of complexity, bootstrap their train-validation data split to eliminate random effects in the performance estimates and conduct prospective testing to avoid the risk of data leakage. Using a Swedish BERT based language model, continued language pre-training and final classification training, we achieve close to human-level performance in this task. Model architectures based on less complex technical foundation such as bag-of-words approaches and LSTM neural networks trained with random initiation of weights appear to perform less well, likely due to the lack of robustness that a base of general language training provides.

Author summary Reports of suspected adverse drug reactions are important for drug safety. However, a high influx of reports poses a challenge for regulatory authorities as a delay in identification of previously unknown adverse drug reactions can potentially be harmful to patients. In this study, we show that such reports can be automatically classified into serious and non-serious cases through the use of natural language processing. We have compared four different system architectures for this task, and show that an approach based on a pre-trained Swedish language model (BERT) shows close to human-level performance. Prospective testing suggests that the model performance estimates are relevant for real-world deployment.

Citation: Bergman E, Dürlich L, Arthurson V, Sundström A, Larsson M, Bhuiyan S, et al. (2023) BERT based natural language processing for triage of adverse drug reaction reports shows close to human-level performance. PLOS Digit Health 2(12): e0000409. https://doi.org/10.1371/journal.pdig.0000409 Editor: Man Luo, Mayo Clinic Scottsdale, UNITED STATES Received: June 5, 2023; Accepted: November 9, 2023; Published: December 6, 2023 Copyright: © 2023 Bergman et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: Access to data is restricted by the Swedish Public access to information and secrecy act. Data access requests should be addressed to [email protected]. We cannot preclude the outcome of a data access request, but it is likely that researchers interested in accessing data would need to establish a data protection agreement with the MPA. Funding: The authors received no specific funding for this work. Competing interests: The authors have declared that no competing interests exist.

Introduction At time of authorisation, the safety profiles for new medicinal products are limited to the adverse drug reactions (ADRs) frequent enough to be captured within the clinical development programme, while knowledge of more rare side effects can be limited by the size and inclusion criteria of the pivotal trials. Hence, systematic collection of safety data post authorisation is of great importance to further develop the safety profile. In Sweden, reports of suspected ADRs are received by the Swedish Medical Products Agency (MPA) in electronic form or via paper forms. Reporters can be both healthcare professionals and patients/consumers. Along with a description of the suspected reaction, the reports also include data on whether the reaction led to death, was life threatening or caused a congenital malformation, hospitalization or prolonged hospitalization, permanent disability or damage, or other important medical events, in line with the definition of a serious ADR [1]. All incoming reports of suspected ADRs are triaged and processed in order of priority by assessors at the MPA, which allows the serious reports to be assessed with priority. One challenge is that a report describing a potentially serious event will not be prioritised for assessment, if it is not labelled as such at the time of reporting. In this study, we propose to use natural language processing (NLP) to predict whether a report is of serious nature based solely on the free-text fields and adverse event terms in the report, potentially allowing reports mislabelled at time of reporting to be detected and prioritized for assessment. We consider four different Swedish NLP models at various levels of complexity and bootstrap their train-validation data split to eliminate random effects in the performance estimates and conduct prospective testing to avoid the risk of data leakage. Recently, NLP has undergone a paradigm shift from developing dedicated feature sets and designing very specific architectures to accomplish different tasks such as sentiment analysis, syntactic parsing or part-of-speech tagging to exploiting large pre-trained models such as BERT [2] for their general language capabilities and fine-tuning them for many different tasks. In the wake of even larger generative models like GPT-3, this pre-training and fine-tuning setup is now being replaced by prompting requiring none or very few examples to obtain promising performance [3]. However, results on ScandEval, a recently published benchmark of NLP tasks in several Scandinavian languages [4], show that multiple Swedish and Norwegian BERT models on average perform better in Swedish tasks than generative models including GPT-4 [5] and that large Swedish BERT models along with GPT-4 and GPT-3.5 turbo are among the top for classification tasks in Swedish such as sentiment analysis and linguistic acceptability [6]. In addition, large generative models are typically quite demanding in terms of memory and computational requirements, and we cannot process our reports on outside servers due to privacy concerns, which is why we consider a BERT model as the largest architecture for our experiments. Specifically, we consider two transformer-based architectures using a Swedish BERT language model, one bag-of-words approach, and one based on LSTM modelling.

Related work While the idea to automatically classify incoming reports is not new, previous work has considered slightly different definitions of the classification outcome–importance or seriousness–to meet different needs: Muños et al. [7] predict pharmacovigilance utility of individual case safety reports (ICSRs), i.e., whether a report is likely to be included in a pharmacovigilance review, based on a range of report meta-data and some language-based features such as the length of the narrative and the presence of a set of curated narrative terms. Lieber et al. [8] train a bagging classifier of decision trees for the triage of Dutch ICSRs that require thorough clinical review. Their final model uses a set of 175 features including general information about patient and case, such as age, gender, weight, drug names, and seriousness information on the case as well as binary features on word occurrence for a selection of words deemed relevant by pharmacovigilance experts in the free-text fields and the length of text fields. In contrast to both these approaches, our approach predicts the seriousness of the report using only textual features and is agnostic of the medicinal product beyond any information present in the free-text field. Most closely related to our objective, Routray et al. [9] use LSTMs initialized on pre-trained GloVe embeddings on the PubMed corpus to automate binary seriousness classification, assigning specific seriousness categories and identifying terms in the reports to support the seriousness category using only the free-text narrative, the reported AEs and MedDRA preferred terms. Létinier et. al. [10] propose a pipeline for automatic identification and seriousness classification of ADRs in French free text reports from a single French pharmacovigilance centre. They test a range of different machine learning models, both more conventional models such as logistic regression, support vector machines and random forests as well as deep learning models and obtain promising results for ADR identification using gradient boosting trees, whereas the performance of all models was much lower for seriousness classification, which is likely due to the very limited amount of training data. The amount of training data is also cited as a possible reason the deep models performed worse than boosting and–along with the lack of a French BERT model pretrained on biomedical text–as one of the reasons against trying a French BERT model. On a larger corpus of French annotated ICSRs, Martin et. al. [11] present a continuation of the work by Létinier et. al. 2021, this time comparing gradient boosting trees and general-domain transformer models–XLM [12] for ADR identification and gradient boosting trees using CamemBERT [13] embeddings for seriousness classification. The gradient boosting models use TF-IDF word vectors and structured features as input for the first task and FastText word embeddings trained on French medical text for the second. They observe the models to perform near identical on both tasks and on internal as well as external evaluation data and observe an improvement in both tasks compared to their previous work (albeit on different evaluation data) and argue that both approaches are balanced in terms of their strengths and weaknesses, because the gradient boosting trees obtain additional structured features in the identification task and have domain-specific embeddings in the classification task, whereas the transformer models are more powerful, but not adapted to the biomedical domain. The less computationally demanding models using boosting and TF-IDF and FastText are in use by the French national health authorities. Applications in domains with language that differs strongly from the language represented in general pre-trained language model tend to benefit from language models adapted to that domain. For the English language, a range of such domain-specific or mixed-domain models exists for biomedical and clinical language. These include PubMedBERT [14], a BERT model pre-trained from scratch on PubMed abstracts, BioBERT [15], initialized from the original BERT-base model [2] and further pre-trained on PubMed abstracts and PubMed Central full-text articles, BlueBERT [16] also initialized from BERT-base and then pre-trained on PubMed abstracts and de-identified clinical text, ClinicalBERT [17], based on BioBERT and further pre-trained on clinical text, SciBERT [18], trained from scratch on biomedical publications and computer science publications, and PharmBERT [19] specifically trained on drug labels. These models have been shown to outperform general-domain pre-trained language models such as the original BERT base model [2] on in-domain NLP tasks such as named-entity recognition, relation extraction, question answering and document classification [14]. For the purposes of our study, however, we did not have access to an existing domain-specific model for Swedish biomedical text. Instead, we chose a BERT model that had at least part of its pretraining data consist of medicinal product information in Swedish.

Discussion In this study, we show that transformer-based language models can provide close to human-level performance to the task of triaging reports of suspected ADRs. Given the increasing number of reports globally and the need for preparedness for future surges in the number of reports–e.g., in relation to urgent public health interventions–we believe our results support implementation of AI in the field of regulatory pharmacovigilance, augmenting human assessors by providing automated detection of reports likely to be upgraded in seriousness during assessment. Such reports can then be prioritised for earlier human assessment, allowing for faster signal detection in the post-authorisation phase of medicinal products. However, as the full process includes additional steps of assessment and annotation, and sometimes collection of additional data, there is still need for a human in the loop for each case to correct for model misclassifications and to provide ground truth annotation for downstream re-training of classification models. The two BERT based models surpass both classical NLP techniques such as the bag-of-words approach and LSTM neural networks when applied to the current scale of training data, with the model using an internal classification layer slightly outperforming the XGBoost classification from BERT embeddings. The most likely reason why pre-trained models such as BERT outperform models without language pre-training, is that the BERT general language understanding capabilities with a self-attention mechanism provide a robustness that cannot be achieved when training simpler model architectures from scratch with our relatively small training data set. In summary, our findings are in line with previous learnings in the field [2]. The effect of continued language pre-training of the BERT model on performance is slightly more noticeable with the XGBoost classifier than with the BERT classifier. This is likely due to the fact that, when training the BERT classifier, the entire model is updated, which includes enhancing the domain language knowledge from the training dataset. In contrast, when training XGBoost classifier working on BERT embeddings, the BERT model itself is not updated and hence the BERT language pre-training on historic reports is more important. One challenge posed by reports of suspected ADRs datasets is that an event reported as serious is considered regulatory true and is not downgraded even if an assessor deems it non-serious. In such cases, this additional information that the assessor disagrees with the original serious label was not captured in our current report processing workflow. A non-serious report, on the other hand, is upgraded if any of the seriousness criteria are met during assessment. This means that the dataset contains a proportion of mis-labelled reports that can confuse the models during training and impair an exact measure of model performance. In summary, to further improve on the current approach, improvements in the input data workstream are needed. Increased data quality could potentially allow further increases in performance, and may be further improved through the use of larger language models such as GPT-3 that may capture the input language semantics in even higher detail although performance in the Nordic languages may not always surpass locally trained BERT models [6]. Furthermore, full-size GPT-3 models with 175B parameter require highly specialized hardware to run locally. Hence, they are currently most often accessed through a non-EU third-party provider, which limits the use to cases where data can be transferred freely. Looking into the future, the field of NLP and machine learning holds promise for supporting additional steps of the process, such as named entity recognition for matching reported terms to the MedDRA standard and for performing automatic signal detection in databases of suspected ADRs.

Acknowledgments The authors would like to thank Professor Hercules Dalianis at the Department of Computer and Systems Sciences, Stockholm University, for his contributions as an advisor during the project.

[END]
---
[1] Url: https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000409

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/