(C) PLOS One

(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .

Imbalanced class distribution and performance evaluation metrics: A systematic review of prediction accuracy for determining model performance in healthcare systems [1]

['Michael Owusu-Adjei', 'Department Of Computer Science', 'Kwame Nkrumah University Of Science', 'Technology', 'Kumasi', 'James Ben Hayfron-Acquah', 'Twum Frimpong', 'Gaddafi Abdul-Salaam']

Date: 2024-02

Key component in disease treatment is estimating outcome after treatment is initiated. An outcome is driven mainly by two critical issues; patient response and efficient treatment strategies on the part of healthcare givers. Developing effective and efficient strategies [1] for managing severely ill patients remains a major challenge for healthcare providers. Increasing morbidity and mortality as undesirable consequence of insufficient care practices of uncontrolled blood pressure by individuals. This is an important justification for adopting predictive learning technique capable of identifying important correlated factors associated with the incidence of hypertension. Predictive learning techniques assist in providing real-time solution to low detection rates among many segments of society. Increasing data generation capacity together with available tools necessary for data collection has contributed to the adoption of predictive modeling use in health care systems. Automated systems such as Internet of things (IoT) as an emerging paradigm [2,3] involving human interactions and interconnection of devices has contributed to the availability of large volumes of datasets being witnessed today. Characteristically, healthcare systems are associated with generation of large volumes of datasets brought on by connected medical device use such as remote patient monitoring and virtual assistant device for blood pressure, pulse, heart rate, diabetic monitors etc. Other connected devices include, connected contact lenses, glucose monitors, wearable, fitness tracking devices, virtual healthcare assistants, virtual dispensing assistants etc. Data generated from these applications have been explored in many research works to identify patterns of change using different predictive machine learning (ML) approach including non-clinical [4] to enhance disease diagnosis for improved treatment outcome. Assessing predictive modeling performance has become focused in many research works that includes review studies on feature selection methods and predictive model use in lung cancer radio mics [5]. This study found random forest and support vector machine useful in classification tasks in review studies investigated. Additionally, the use of environmental parameters to improve deep learning model performance for the prediction of COVID-19 daily cases in 9 cities across three countries in different climatic zones using a variety of recurrent neural networks (LSTM) concludes that the inclusion of environmental parameters resulted in improved model performance [6]. Diabetes prediction with applied data mining techniques such as random forest, support vector machines, logistic regression and naïve bayes showed that logistic regression achieved the highest prediction accuracy score of 82.46% as compared to others [7]. Comparative study on model performance in predictive modeling of cardiac arrest in smokers using heart rate variability parameter proved that applying random forest technique achieved the best prediction accuracy score of 93.61% against 88.50% for logistic regression and 92.59% for decision tree classifier [8].

Evaluation in general involves three important qualities which are systematic, assessment and the determination of value, worth and significance. Systematic connotes an interpretation which is structured to give meaning. Different predictive techniques include the use of different or same evaluation metrics [9]. Example, predictive evaluation metrics for ML techniques in classification analysis may be the same or differ from those used in regression analysis depending on the problem under consideration. The challenge here is when to use what and for what reason and to what benefit. Identifying the appropriate domain for use and for what reason such as evaluate performance for optimization or estimating the number of correctly classified patients for treatment default, number of patients with certain types of diseases etc could provide better use of predictive models. In this review, we offer a thorough discussion on various performance evaluation metrics in line with key research question: Effects of using prediction accuracy score as compared to balanced accuracy to determine appropriate machine learning model for predictive performance in datasets with unequal class distributions (imbalanced datasets) predominant in real-world applications.

1.1 Related works

It is important that the development and evaluation of ML techniques are made transparent and interpretable to allay any doubt about its usability in healthcare systems. Predictive model evaluation especially in healthcare and other real-world application systems with class distribution inequality must take into account the peculiarity of the dataset especially when assessing predictive model performance [10]. Prediction accuracy score show results obtained from both observed and predicted values. It is predominantly used in classification problems where there are no dataset class imbalance and no skewed class examples. However one of the challenges identified in many research works is its use as the main performance metric to estimate best or appropriate machine learning model technique in real world applications such as healthcare systems where dataset class distribution inequality is prevalent. The challenge of using prediction accuracy as a measure of model performance is mentioned in a related review work that examined the prospects of machine learning use in clinical outcomes [11]. Concerns regarding prediction accuracy score use is shared in a study of disease diagnosis with 20 machine learning techniques comprising Naïve Bayes, Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Perceptron, Light Gradient Boosting Machine, extreme Gradient Boosting which addressed this challenge with f1-score evaluation metric [12]. Prediction accuracy score obtained ranged between 49%-77%with various techniques but f1-score obtained ranged between 47%-82%.s

Review study of artificial intelligence in disease diagnosis mentioned prediction accuracy as one of the evaluation parameters of interest [13]. Similarly, comparative study of disease prediction with supervised ML techniques also identified prediction accuracy score as performance metric [14]. Similar use of prediction accuracy [15] in assessing best ML technique for breast cancer prediction recorded an accuracy score of 98.7% for techniques such as decision trees and other ensemble techniques. ML principles and applications in real world systems have also been explored [16]. Automatic prediction system for diabetic patients with several ML techniques for explainable artificial intelligence [17] concluded with prediction accuracy score of 81% and auc score of 84%. Additional studies to predict pressure ulcer nursing adverse event [18] using four ML techniques; decision trees, Support Vector Machines, Random Forest and Artificial Neural Networks achieved prediction accuracy score of 94.94% for Support vector machine, 97.93% for Decision trees, 99.88% for Random Forests and 79.02% for Artificial Neural Networks. Determination of appropriate ML algorithms to identify mental health problems [19] in its early stage with techniques such as Logistic Regression, Gradient Boosting, Neural Networks, K-Nearest Neighbor, Support Vector Machine and ensemble techniques showed overall prediction accuracy score of 88.80% achieved by Gradient Boosting. Additional studies to predict heart disease with ML algorithms such as K-Nearest Neighbors (KNN), Naive Bayes and Random Forest singled out Random Forest as the best performing classifier with prediction accuracy score of 95.63% [20]. Further studies for ML use in cardiovascular disease prediction with learning techniques such as support vector machine, convolutional neural networks and boosting classifiers produced prediction roc_auc score of range 81%-97% [21]. Diagnosis of breast cancer with learning techniques such as linear discriminant analysis (LDA) and Support vector machine (SVM) for various roles had prediction accuracy reading of 99.2% and 79.5% [22]. However, the prediction of breast cancer with Decision tree and Random forest techniques [23] showed prediction accuracy score of 91.18% and 95.72% respectively. Additional ML application as decision support [24] for the detection of breast cancer through feature selection with ML techniques K-Nearest Neighbor, linear discriminant analysis and probabilistic neural network yielded accuracy score of 99.17%. Furthermore [25], prediction of breast cancer with ML based framework using ML techniques; Random Forest, Gradient Boosting, Support Vector Machine, Artificial Neural Network, and Multilayer Perception to achieve better classification accuracy using correlation-based feature selection together with recursive feature elimination extraction resulted in prediction accuracy score of 99.12%. Similarly, with weighting feature and backward elimination feature selection approach [26], application of Random forest ML technique to create computer-aided diagnostic system to distinguish breast cancer tumor between malignant and benign yielded prediction accuracy score of 99.7% and 99.82% respectively. Achieving higher precision and prediction accuracy using K-fold cross-validation with all features in model 2, all features without validation in model 1, with feature selection for model 3 and feature selection together with cross-validation [27] for model 4 using ML techniques; logistic regression, support vector machines, Naive Bayes, Decision trees and k-nearest neighbor, produced different prediction accuracy score at each stage. Highest accuracy score of importance recorded were; 98.83% for support vector machine, 97.17% for K-Nearest Neighbor and 97.88% for Logistic regression. Similarly, ML based model for early stage heart disease prediction with techniques support vector machine, K-nearest neighbor, random forest, Naive Bayes and decision tree using feature selection techniques (chi-square, ANOVA, and mutual information) to determine best fit model concluded that Random forest had the highest prediction accuracy score of 94.51% [28].

Related study for choice of best ML model for prediction of [29] breast cancer also had prediction accuracy score of 98% for Artificial Neural Network, 98% for Decision tree classifier, 99% for K-Nearest Neighbor, 98% for Logistic regression and 100% for Support vector machine. Risk prediction and diagnosis [30] of breast cancer through a comparative analysis of ML techniques to assess model efficiency and effectiveness with respect to prediction accuracy, precision, sensitivity and specificity proved that support vector machine had the highest prediction accuracy performance of 97.13% with the least error rate. Related study [31] to predict and diagnose breast cancer using ML techniques and to determine best model with evaluation metrics such as confusion matrix, accuracy and precision proved that Support Vector Machine among other ML techniques (Random Forest, Logistic Regression, Decision tree (C4.5) and K-Nearest Neighbors) achieved the greatest prediction accuracy score of 97.2%. The continuous use of models such as Support vector machines, Logistic regression and Random forest and Clustering in classification problems such as chronic disease diagnosis is emphasized in a related study that found them to be useful [32]. Similarly, the prediction of treatment trend for patients suffering from hypothyroidism using sodium levothyroxine with ML techniques showed that using extra-trees achieves better prediction accuracy of 84%. [33]. Following from this [34] is a predictive study of chronic kidney disease prediction with three ML techniques namely; Random forest, Support Vector machine and Decision tree together with recursive feature elimination technique. This study showed different prediction accuracy score in situations where feature selection is used and others where feature selection is not used. Prediction accuracy recorded with feature selection techniques were as follows; 99.8% for Random forest, 95.5% for Support vector machine and 98.6% for Decision tree. Additional studies on predictive modeling of chronic diseases such as sclerosis progression over 6 and 10 year period using ML techniques [35] such as K-nearest neighbor, Support vector machine, Decision tree and Logistic regression concluded with performance evaluation metric area under the curve score (auc), sensitivity, specificity, geometric mean and f1-score for each period and auc score for disease severity in the 6th year are KNN 74%, Decision tree 74%, Linear regression 80% and Support vector machine 80%. Disease severity in the 10th year had auc score KNN 67%, Decision tree 57%, Linear regression 67% and Support vector machine 73%.

Furthermore studies [36] for the detection of chronic kidney disease to show important correlations or predictive attributes using ML techniques (k-nearest neighbors, random forest, and neural networks) and 24 features used accuracy, root mean squared error (rmse) and fi-score measure as evaluation parameters. Predicted accuracy score of 99.3%forRandom forest classifier was achieved. Additional research to identify advanced chronic kidney disease with ML techniques; generalized linear model network, random forest, artificial neural network and natural language processing [37] showed improved prediction performance in accuracy score as reported. Prediction accuracy score for ML techniques used were; both for training data and testing data: Logistic regression 81.8% and 81.9%, Random forest 91.3% and 82.1%, Decision tree 86.0% and 82.1%. Its conclusion recommends improvement on achieved prediction accuracy score. Application of deep learning technique for prediction and classification of hypertension with related variables [38] showed the following prediction accuracy scores; Deep neural network: (75%, 73.9%, 74.3%, 74.3%) and Decision tree: (67.6%, 68.4%, 69%, 68%). Related study [39] on the prediction of hypertension using features such as patient demographics, past and current patient health condition and medical records for the determination of risk factors using artificial neural network showed prediction accuracy score of 82%.

Understanding disease symptoms is one sure way of effectively controlling and managing its treatment outcome. Predictive modeling [40] of heart disease risks and its symptoms using ML techniques will ensure effective patient care. Implementation of heart disease risk prediction using six ML techniques (support vector machine, Gaussian Naive Bayes, Logistic regression, light gradient boosting model, extreme gradient boosting and Random forest) showed the following predicted accuracy score; 80.23%, 78.68%, 80.32%, 77.04%, 73.77% and 88.5% respectively.

A population level-based approach [41] for predicting hypertension using ML techniques (extreme Gradient Boosting, Gradient Boosting Machine, Logistic Regression, Random forest, Decision tree and Linear Discriminant Analysis) had predicted accuracy score of 90% for (extreme Gradient Boosting, Gradient Boosting Machine, Logistic Regression and Linear Discriminant Analysis) as compared to 89% for Random forest and 83% for Decision tree.

1.1.0 Accuracy score in non-health settings. Related research perspectives in other real-world applications such as spam message detection, fraud detection and risk estimation/forecasting are explored in this section. The risk of spam messaging and its impact on business operations are far reaching some of which include hacked systems and ransom demand payments, destruction of critical data and infrastructure and many others. Applying effective, efficient ML modeling technique that identifies important characteristics for the detection and subsequent prevention or destruction of threats posed continue to engage research attention. A study to detect spam threats [42] in emails and IoT platforms using Naıve Bayes, decision trees, neural networks and random forest together with other techniques had prediction accuracy score and precision score as follows; for Suppost Vector Machine and Naive Bayes 96.9%, precision 93.12% and Naive Bayes; 99.46%, precision 99.66%. Similarly, transformer-based embedding with ensemble learning techniques for spam detection showed prediction accuracy score of 99.91% [43]. Furthermore application [44] of hybrid algorithm for the detection of malicious spam messaging in email with ML techniques Naive Bayes, Support vector machines, Logistic Regression and Random Forest showed predicted accuracy score of 96.15% for Naive Bayes, 96.15% for support vector machine, 98.08% for Logistic regression and 95.38% for Random forest respectively. Evaluation of automatic short message service performance [45] using Naive Bayes, BayesNet, C4.5, J48, Self-organizing map and Decision tree showed predicted accuracy score of 89.64%, 91.11%, 80.24%, 79.2%, 88.24% and 75.76% respectively. Comparative performance evaluation to improve prediction accuracy [46] of two ML models; support vector machine and random forest for the detection of junk mail spam showed prediction accuracy of models as; Support vector machine 93.52% and Random forest 91.41%.Related to improving prediction accuracy is the issue of improving training time and reducing prediction error rate. ML based hybrid bagging technique application [47] using random forest and decision tree (J48) for the analysis of email spam detection showed 98% prediction accuracy score. Other performance metrics evaluated include true negative rates, false positive rate and false negative rate, precision, recall and f-measure (f1-score). Increase in online transactions including online payments has also increased the risk of credit card fraud, ML based credit card fraud detection system [48] using genetic algorithm with the following learning techniques (Decision Tree, Random Forest, Logistic Regression, Artificial Neural Network, and Naive Bayes showed that applied genetic algorithm feature selection led to a predictive accuracy score of 100% for both Decision tree and Artificial neural network. Related to study [48] is financial fraud detection system in healthcare using ML techniques such as deep learning to address the challenge of credit card fraud monitoring [49]. Applying ML techniques (Naive Bayes, Logistic Regression, K-Nearest Neighbor, Random Forest, and Sequential Convolutional Neural Network) resulted in the predicted accuracy score; 96.1%, 94.8%, 95.89%, 97.58%, and 92.3% respectively. Strategies have been adapted and adopted to deal with the challenge of fraud detection by various organizations. One such solution is provided by [50] which implemented ML based self-analyzing system to flag potential fraudulent activities for review. Case study approach [51] for a review of ML techniques (logistic regression, decision tree, random forest, K-Nearest Neighbor and extreme Gradient Boosting) in credit card fraud detection evaluated best model prediction performance using accuracy, recall, precision and f1score metrics. The study identified Logistic regression and K-nearest Neighbor as best performing classifiers. Implementation of fraud detection tools [52] to identify anomalies on financial applications using outlier detection techniques such as Local outlier factor, Isolation factor and Elliptic envelope and ML techniques (Random forest, Adaptive boosting and extreme gradient boosting) showed predicted accuracy score of 99.95%.Modeling [53] of medical visits by patients suffering from diabetes with ML techniques; logistic regression, support vector machine, linear discriminant analysis, quadratic discriminant analysis, extreme gradient boosting, neural networks and deep neural network obtained balanced accuracy score of 65.7%. Similarly, predicting length of stay [54] from admission to clinical ward with ML techniques random forest, decision trees, support vector machine, multi-layer perceptron, adaboost and gradient boost concluded with random forest as the best performing technique with balanced accuracy score of 72% at the initial stage of admission and 75% in-admission. However, an up-sampling approach [55] for breast cancer prediction using k-nearest neighbor, decision tree, random forest, neural networks, support vector machine and extreme gradient boosting obtained balanced accuracy score of 97.47%.

1.1.1 Related works summary. Systematic review of related research works had key objectives and among them was the search for literature with the following characteristics; a focus on current state of knowledge with respect to ML techniques, applications and evaluations, research works with prediction accuracy score as an evaluation metric, research works in real-world context with unequal class distributions using relevant methodologies. Excluded from this review article search were defining specific search timeline and the motivation for not specifying search period was to include as many important related works as possible irrespective of its date of publication. Of particular interest was work on healthcare systems and other real-world applications (spam detections, fraud predictions, risk predictions etc). A summary of identified characteristics among selected reviewed literature with emphasis on prediction accuracy score as performance metric is presented in Table 1.Literature search sources were; Google scholar and other online journal databases such as IEEE, puhmed, hindawi journals, BioMed central, Pmc, Elsevier, Sciencedirect, organizational websites, online libraries and many other journals. A total of 80 articles were screened for (relevancy) and determined inclusion criteria was for related works in healthcare practice that had used predictive machine learning either in disease diagnosis, prediction, risk or treatment assessment. Literature of related works with ML applications in other relevant settings such as spam detection in mails, sms spamming were also considered. No time frame exclusion criteria was used, but about 80% of selected materials were mainly published works between 2016–2022 and a handful in 2023.Observations noticed in related literature used indicate extensive use of ML techniques in real-world applications for various reasons including serving as decision support systems. Predominantly used techniques include Random forest, Support vector machine, Logistic regression, K-Nearest Neighbor, Decision trees, Gradient boosting classifier and few ensemble techniques. The use of evaluation performance metrics such as precision, recall, f1-score, prediction accuracy and in some instance predicted positive and predicted negative values is observed. Of interest is the use of prediction accuracy as a predominant metric for assessing model performance found among all the related literature reviewed. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Table 1. Reviewed literature descriptions. https://doi.org/10.1371/journal.pdig.0000290.t001

[END]
---
[1] Url: https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000290

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/