(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org.

(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org.
Licensed under Creative Commons Attribution (CC BY) license.
url:https://journals.plos.org/plosone/s/licenses-and-copyright

------------

Predicting diarrhoea outbreaks with climate change

['Tassallah Abdullahi', 'Department Of Computer Science', 'University Of Cape Town', 'Cape Town', 'Western Cape', 'South Africa', 'Geoff Nitschke', 'Neville Sweijd', 'Applied Centre For Climate', 'Earth Systems Science']

Date: 2022-05

Abstract Background Climate change is expected to exacerbate diarrhoea outbreaks across the developing world, most notably in Sub-Saharan countries such as South Africa. In South Africa, diseases related to diarrhoea outbreak is a leading cause of morbidity and mortality. In this study, we modelled the impacts of climate change on diarrhoea with various machine learning (ML) methods to predict daily outbreak of diarrhoea cases in nine South African provinces. Methods We applied two deep Learning DL techniques, Convolutional Neural Networks (CNNs) and Long-Short term Memory Networks (LSTMs); and a Support Vector Machine (SVM) to predict daily diarrhoea cases over the different South African provinces by incorporating climate information. Generative Adversarial Networks (GANs) was used to generate synthetic data which was used to augment the available data-set. Furthermore, Relevance Estimation and Value Calibration (REVAC) was used to tune the parameters of the ML methods to optimize the accuracy of their predictions. Sensitivity analysis was also performed to investigate the contribution of the different climate factors to the diarrhoea prediction method. Results Our results showed that all three ML methods were appropriate for predicting daily diarrhoea cases with respect to the selected climate variables in each South African province. However, the level of accuracy for each method varied across different experiments, with the deep learning methods outperforming the SVM method. Among the deep learning techniques, the CNN method performed best when only real-world data-set was used, while the LSTM method outperformed the other methods when the real-world data-set was augmented with synthetic data. Across the provinces, the accuracy of all three ML methods improved by at least 30 percent when data augmentation was implemented. In addition, REVAC improved the accuracy of the CNN method by about 2.5% in each province. Our parameter sensitivity analysis revealed that the most influential climate variables to be considered when predicting outbreak of diarrhoea in South Africa were precipitation, humidity, evaporation and temperature conditions. Conclusions Overall, experiments indicated that the prediction capacity of our DL methods (Convolutional Neural Networks) was found to be superior (with statistical significance) in terms of prediction accuracy across most provinces. This study’s results have important implications for the development of automated early warning systems for diarrhoea (and related disease) outbreaks across the globe.

Citation: Abdullahi T, Nitschke G, Sweijd N (2022) Predicting diarrhoea outbreaks with climate change. PLoS ONE 17(4): e0262008. https://doi.org/10.1371/journal.pone.0262008 Editor: Jie Zhang, Newcastle University, UNITED KINGDOM Received: March 8, 2021; Accepted: December 15, 2021; Published: April 19, 2022 Copyright: © 2022 Abdullahi et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: The real-world diarrhoea case data used in this study contain protected health information and cannot be published for reasons of data protection. The real-world diarrhoea data used in this study are available from Clicks Group Limited, South Africa but restrictions apply to the availability of these data, which were used under license for the current study and so are not publicly available. The data are however available from Dr. Neville Sweijd ([email protected]) for Council for Scientific Research (CSIR) or [email protected] for Clickgroup.co.za., and after receipt of permission with the Clicks Group Limited, South Africa (https://www.clicksgroup.co.za/) or the Council for Scientific and Industrial Research (CSIR) (https://www.csir.co.za/). Furthermore, I confirm that others would be able to access these data by request and permission from the Clicks Group Limited, South Africa. I also confirm that no special privileges were involved when the data was secured. All other datasets (real-world climate data and synthetic datasets for all variables including diarrhoea) used in this study can be accessed from our GitHub repository https://github.com/aminalawal/Predicting-Diarrhoea-Outbreak-with-Climate-Change. Funding: We would like to note that this study was supported by the South African National Research Foundation (NRF): 403 Human and Social Dynamics in Development Grant (Grant No. 118557). Competing interests: The authors have declared that no competing interests exist.

Introduction Diarrhoea is a major health concern and has remained among the top leading cause of global morbidity and mortality amongst all ages [1, 2]. Annually, over 2.5 million deaths attributed to diarrhoea is recorded worldwide [3]. The World Health Organization reported that the Sub-Saharan Africa (SSA) and South Asia regions account for more than 80 percent of total world records [1, 3]. Over the SSA region, South Africa is one of the most affected countries. In 2010 and 2015, diarrhoea was reported to be among the top ten leading causes of years of life lost among South African residents [4]. Diarrhoea also accounts for three percent of the total death records in individual of all ages in the country [5]. Some studies such as [6, 7] have shown that diarrhoea infections in South Africa are attributed to nosocomial infections or community acquired resulting from contaminated food and water caused by a range of pathogens. However, studies by [8, 9] reported that climate factors and weather variability influence the level of abundance and seasonality of the pathogens present in the environment, thus the prevalence of diarrhoea can be linked to extremities from weather events. South Africa is a region that experiences significant temperature and precipitation anomaly, which are factors that play a vital role in the long-term trends of diarrhoea [10, 11]. For example, in Western Cape province of South Africa, the rate of diarrhoea hospitalizations was strongly linked to increase in minimum and maximum temperature [7]. A study in Limpopo province showed that seasons when precipitation rate was below normal coincides with a high number of diarrhoea cases [9]. Thus, the development of a model with the ability to capture complex relationships and long-term dependencies between climate factors and diarrhoea may be effective for diarrhoea predictive analysis. A diarrhoea predictive model could be used for public health surveillance as it will offer timely detection and prompt notification for the control of diarrhoea outbreak. Several studies have developed models for investigating diarrhoea outbreak in various communities. A vast majority were developed with statistical models such as Auto-regressive Integrated Moving Average Model (ARIMA) [12], Poisson Regression [7], Auto-regressive Analysis of Covariance Model (ANCOVA) [13] and Time-series Log Linear Regression [8]. For instance, a study by [12] used the influence of climate variables to develop an ARIMA model that predicts the daily incidence of diarrhoea in Beijing. The Poisson Regression model was also used by [7] to assess the relationship between diarrhoea cases and temperature variability in South Africa. Although these studies have proven useful, other studies such as [14, 15] have shown that traditional statistical models and frameworks are often limited for the analysis of high dimensional, imbalanced, and non-linear data. In addition, these studies [14, 15] reported that the limitations of statistical models can be addressed using Machine Learning (ML) methods. ML methods are known for their ability to handle high-dimensional data and model complex predictive problems. Several supervised learning-based ML techniques such as Support Vector Machines (SVMs) [16] and Deep learning techniques such as Convolutional Neural Networks (CNNs) [17], Long Short-Term Memory Networks (LSTMs) [18] have been applied in medical research for developing predictive and diagnostic models for various diseases [14, 15]. For example, CNNs have been used for the detection of Malaria parasite [19] and Tuberculosis diseases [20] in individuals. LSTMs have also been used to predict the outbreak of diseases like Typhoid, Chicken Pox and Scarlet Fever [14]. SVMs were also used for Hepatitis disease detection [21]. These ML methods are widely used for modelling infectious diseases because of the numerous advantages they possess. For instance, CNNs are popular for their powerful feature extraction capabilities [17]. LSTMs are commonly used to handle sequential tasks such as time series forecasting because of their ability to capture long term dependencies [14]. SVMs are widely accepted for their ability to solve nonlinear regression estimation problems, their non-parametric nature enables them to represent complex and nonlinear functions easily [16]. Despite advances in a range of health-care applications using such predictive-based ML [14, 15, 21], there is a lack of research and data on the efficacy of such predictive ML methods for diarrhoea outbreak prediction in Sub-Saharan Africa. Additionally, the overall task performance of ML algorithms, applied to many health-care applications and more broadly to any predictive classification task, largely depends on the manual tuning and calibration by algorithm designers and experimenters of methodological parameters over the course of several experimental trials [22, 23]. Such manual tuning is often ineffective and significantly limits the full potential of task performance achieved by the ML method, especially for high-dimensional, partially observable, noisy and complex task domains [22], as are typified by the nature of data-sets in many health-care applications including diarrhoea outbreak prediction. Task performance also largely depends on the amount of available training data [24], which is a significant challenge for most predictive ML in health-care applications due to the sensitive and controlled nature of health-care data-sets [25]. The inaccessibility of data adds to the difficulty of method comparison, accuracy, and the advancement of ML as a whole [24, 26]. The overall aim of this study is to ascertain the suitability of various ML methods given various climate factors and synthetic (generative) training data for accurately predicting diarrhoea outbreaks. Specifically, the study aims to elucidate what type of ML method is most appropriate when coupled with specific training and test data-sets (that is, specific climate variables, data-sparseness, data-noise and synthetic data compliment), in order to optimise prediction efficacy. Thus, we compared task-performance of three ML methods (CNNs, LSTMs and SVMs) to ascertain the most suitable method for predicting future number of daily diarrhoea cases in nine South African provinces. The average predictive accuracy of each method was compared across multiple datasets and experiment replications. Given the sparse and noisy nature of the data-sets used for method training and testing, we necessarily augmented the available data (real-world data) with synthetic data generated using Generative Adversarial Networks (GANs). GANs were selected as they have been previously demonstrated as effective for generating different types of realistic data [24, 25]. Also, since there was a lack of previous research to guide parameter tuning and calibration for optimising such ML methods applied to diarrhoea outbreak prediction, we used the Relevance Estimation and Value Calibration (REVAC) method [27]. REVAC is an evolutionary algorithm design for meta-heuristic parameter tuning, and as such was applied to optimise methodological parameters of the ML methods used in this study. Previous work has demonstrated the effectiveness of REVAC for parameter tuning and attaining optimal algorithm performance across a range of complex, noisy and high-dimensional search spaces [28, 29].

Discussion The results of our experiments revealed that although the Deep Learning (DL) methods (Configuration of ML methods section) outperformed the SVM (SVM method section). In most tasks, there was no clear best ML method overall. The ML methods showed different levels of skill based on the availability of training data and the type of parameter tuning method used during training. Performance based on dataset type The CNN method (section CNN method) was able to generalize well and select important features to yield the most satisfactory performance when only real-world data was used for making predictions regardless of its limited training set size. Based on different metrics, some studies [19, 20] have shown results for CNNs to be more accurate than several other methods for infectious diseases prediction. We theorize this to be a result of CNNs being effective universal approximators capable of automatic feature engineering [17]. Our findings also agree with previous research which showed that deep neural networks outperform traditional ML algorithms for most disease prediction tasks [19, 20]. The prediction performance of all ML methods improved when the augmented data-sets were used for training, with the LSTM (LSTM method section) giving the overall best performance. This implies that a large training set size boosts the performance of most ML algorithms. We also surmise that the LSTM method performs better when the size of training data is large, perhaps the reason for its relatively poor performance in the first experiment where only real-world data with limited training set was used. A study conducted by [39] have shown that LSTM benefits from a large training set size. In addition, Another study by [14] reported that LSTMs are a state of the art for capturing the long-term dependencies specific to a given data-set thus their ability to learn patterns in sequential data with sufficient training size regardless of its noisy nature. Performance based on parameter tuning method With respect to the parameter tuning as a factor for task performance with the augmented data, we found that with the given grid-search parameters (Table 1), the average percentage increase in task performance of the CNN method was the least when compared to the other methods across individual provinces. The provincial instances such as in Gauteng, Eastern Cape, and Mpumalanga in Fig 8a (Provincial results of the ML methods with the augmented data-sets in Experiments II & III. figure) where SVM outperformed the CNN method is likely due to the parameter settings of CNN used during training. Therefore, we deduce that the choice of parameters greatly affects the performance of deep learning models especially when applied to noisy and augmented data-sets. Thus, we setup a different experiment with REVAC tuning strategy. With the REVAC parameter tuning implementation, the CNN method gave the highest percentage increase in performance across each province. However, the LSTM method’s prediction performance was still better than the other methods for most provinces. However, the SVM demonstrated the least average percentage increase and the highest average percentage decrease across the provinces. Therefore, we can infer from these results that the REVAC parameter tuning is not ideal for the SVM method rather it is more suited to deep learning methods. A possible explanation maybe the low dimensional search space of parameters for the SVM method considering that an SVM’s (with RBF kernel) major parameters are gamma and C only. A study by [40] have found that predefining a search space especially for few parameters can be difficult. However, [22] reported that grid-search is better suited for low dimensional search space perhaps the reason for SVM’s satisfactory performance with grid-search tuning. In Table 5, we compared the performance of the results obtained when REVAC parameter tuning was used on the upward augmented data with the results of some existing models on diarrhoea outbreak prediction with different datasests [14, 15, 41]. Although our RMSE values appear lower, we note that the difference in the error values may be due to the type/size of the dataset used in the different study as well the unit and scale of the dataset. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Table 5. Root Mean Square Error (RMSE) performance comparison with the existing diarrhoea prediction studies. https://doi.org/10.1371/journal.pone.0262008.t005 Sensitivity analysis Our parameter sensitivity analysis (Experiments setup section) demonstrated that the prediction of diarrhoea outbreak by the given ML methods is influenced by specific climate factors. The most prominent (influential) factors are precipitation, humidity, evaporation and temperature, although their levels of influence differ across South African provinces. Our findings are in agreement with studies such as [7, 8] that have shown that diarrhoea cases increase for every 1°C increase in temperature. In addition, related work by [42] reported that evaporation rate is strongly linked to high temperature. Since increase in diarrhoea cases have been associated with high temperature, perhaps diarrhoea can also be linked to evaporation rate. Other studies [9, 15] have also demonstrated that precipitation rate and humidity are strongly related to reported increases in diarrhoea-related hospitalizations. Study contributions A key contribution of this research is the first comprehensive study and application of pertinent ML methods to real-world health-care data sourced from various South African medical institutions in order to formalise an effective predictive machine learning methodology for Sub-Saharan Africa (currently, one of the most adversely affected areas, globally, by diarrhoea outbreaks [1, 3]). A second key contribution of this research is the use of evolutionary optimisation for automating parameter tuning for a given ML method and associated training data-set, as well as demonstration of data augmentation techniques, such as use of generative models to generate artificial data [24, 25] to complement training data deficiencies. While our study has demonstrated that ML can be used for diarrhoea outbreak prediction with climate factors. The results can be improved in some ways. For example, taking other human and environmental factors that cause the spread of infectious diseases into consideration may improve the accuracy of future diarrhoea prediction models. Given the different strength of each ML algorithm, developing a hybrid method that combines the advantage and benefits of at least two ML algorithms may result in a methodology that yields consistently high predictive task performance regardless of the conditions set in an experiment.

Conclusion The global burden of diarrhoea is a major public health problem that causes both personal and widespread harm. This study ascertained the applicability of various Machine Learning (ML) methods in the development of automated early warning system for predicting the outbreak of diarrhoea in South Africa given specific climate variables. We compared the predictive task performance of various ML methods, including Support Vector Machines, Long-Short Term Memory Neural Networks (LSTM) and Convolutional Neural Networks (CNNs), for predicting daily diarrhoea cases over nine South African provinces. Prediction comparisons were with respect to a specific set of climate variables and varying proportional combinations of real-world and synthetic (data augmentation) training and testing data. Results indicated that overall (for all real-world data-sets), our CNN yielded the highest accuracy predictions supporting the well established predictive capacity and efficacy of deep-learning systems. However, given synthetic training and testing data-augmentation, our LSTM yielded the most accuracy predictions overall. This also study elucidated that the climate variables: precipitation, humidity, evaporation, and temperature, yielded the greatest impact on daily diarrhoea cases across South Africa, and were thus the data-set variables integral to the predictive success of our tested methods. Thus, a key contribution of this study is the guidance it provides researchers in selecting a suitable ML method for disease outbreak prediction (diarrhoea case prediction in this study), given real-world and augmented training and testing data-sets containing specific types of climate variables. Current research is applying further predictive machine learning methods in an ongoing effort to develop automated early-warning systems for broad-spectrum disease outbreak prediction across various developing nations with deficient public health systems.

Supporting information S1 Appendix. https://doi.org/10.1371/journal.pone.0262008.s001 (PDF) S1 Fig. Violin plots showing the distribution of the upward augmented data for loperamide (diarrhoea) and climate variables across theprovinces. EC = Eastern Cape, FS = Free State, GA = Gauteng, KZ = KwaZulu Natal, LP = Limpopo, MP = Mpumalanga, NC = Northern Cape, NW = North West, WC = Western Cape. https://doi.org/10.1371/journal.pone.0262008.s002 (TIF) S2 Fig. Violin plots showing the distribution of the downward augmented data for loperamide (diarrhoea) and climate variables across theprovinces. EC = Eastern Cape, FS = Free State, GA = Gauteng, KZ = KwaZulu Natal, LP = Limpopo, MP = Mpumalanga, NC = Northern Cape, NW = North West, WC = Western Cape. https://doi.org/10.1371/journal.pone.0262008.s003 (TIF)

Acknowledgments The authors would like to extend their gratitude to the Applied Center for Climate and Earth Systems Research (ACCESS) under the Council for Scientific Research (CSIR), South Africa and Clicks Pharmaceuticals, South Africa for providing data that was relevant to this study.

[END]

[1] Url: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0262008

(C) Plos One. "Accelerating the publication of peer-reviewed science."
Licensed under Creative Commons Attribution (CC BY 4.0)
URL: https://creativecommons.org/licenses/by/4.0/

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/