(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org.

(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org.
Licensed under Creative Commons Attribution (CC BY) license.
url:https://journals.plos.org/plosone/s/licenses-and-copyright

------------

Machine learning to predict the source of campylobacteriosis using whole genome data

['Nicolas Arning', 'Big Data Institute', 'Nuffield Department Of Population Health', 'University Of Oxford', 'Li Ka Shing Centre For Health Information', 'Discovery', 'Old Road Campus', 'Oxford', 'United Kingdom', 'Samuel K. Sheppard']

Date: 2021-12

Abstract Campylobacteriosis is among the world’s most common foodborne illnesses, caused predominantly by the bacterium Campylobacter jejuni. Effective interventions require determination of the infection source which is challenging as transmission occurs via multiple sources such as contaminated meat, poultry, and drinking water. Strain variation has allowed source tracking based upon allelic variation in multi-locus sequence typing (MLST) genes allowing isolates from infected individuals to be attributed to specific animal or environmental reservoirs. However, the accuracy of probabilistic attribution models has been limited by the ability to differentiate isolates based upon just 7 MLST genes. Here, we broaden the input data spectrum to include core genome MLST (cgMLST) and whole genome sequences (WGS), and implement multiple machine learning algorithms, allowing more accurate source attribution. We increase attribution accuracy from 64% using the standard iSource population genetic approach to 71% for MLST, 85% for cgMLST and 78% for kmerized WGS data using the classifier we named aiSource. To gain insight beyond the source model prediction, we use Bayesian inference to analyse the relative affinity of C. jejuni strains to infect humans and identified potential differences, in source-human transmission ability among clonally related isolates in the most common disease causing lineage (ST-21 clonal complex). Providing generalizable computationally efficient methods, based upon machine learning and population genetics, we provide a scalable approach to global disease surveillance that can continuously incorporate novel samples for source attribution and identify fine-scale variation in transmission potential.

Author summary C. jejuni are the most common cause of food-borne bacterial gastroenteritis but the relative contribution of different sources is incompletely understood. We traced the origin of human C. jejuni infections using machine learning algorithms that compare the DNA sequences of bacteria sampled from infected people, contaminated chickens, cattle, sheep, wild birds, and the environment. This approach achieved improvement in accuracy of source attribution by 33% over existing methods that use only a subset of genes within the genome and provided evidence for the relative contribution of different infection sources. Sometimes even very similar bacteria showed differences, demonstrating the value of basing analyses on the entire genome when developing this algorithm that can be used for understanding the global epidemiology and other important bacterial infections.

Citation: Arning N, Sheppard SK, Bayliss S, Clifton DA, Wilson DJ (2021) Machine learning to predict the source of campylobacteriosis using whole genome data. PLoS Genet 17(10): e1009436. https://doi.org/10.1371/journal.pgen.1009436 Editor: Diarmaid Hughes, Uppsala University, SWEDEN Received: February 20, 2021; Accepted: August 26, 2021; Published: October 18, 2021 Copyright: © 2021 Arning et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: All data used in this study can be found on PubMLST using the accession numbers provided in S1 Table. The data has also been uploaded as a public dataset here: https://pubmlst.org/bigsdb?db=pubmlst_campylobacter_isolates&page=query&project_list=102&submit=1. Funding: N. A. is a recipient of a BBSRC scholarship and thus supported by funding from the Biotechnology and Biological Sciences Research Council (BBSRC) (grant number BB/M011224/1). SKS was supported by Wellcome Trust (088786/C/09/Z) and Medical Research Council (MR/M501608/1 and MR/L015080/1) grants. D. J. W. is supported by a Sir Henry Dale Fellowship jointly funded by the Wellcome Trust and the Royal Society (grant number: 101237/Z/13/B) and by the Robertson Foundation. DAC’s research is supported by the National Institute for Health Research (NIHR) Oxford Biomedical Research Centre (BRC). The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: I have read the journal’s policy and the authors of this manuscript have the following competing interests: DAC declares grants from GlaxoSmithKline and personal fees from Oxford University Innovation, Biobeats, and Sensyne Health, in areas unrelated to this work.

Introduction Campylobacter jejuni and Campylobacter coli are among the most common causes of gastroenteritis globally and are responsible for approximately nine million annual cases in the European Union [1,2]. These zoonotic bacteria are a common commensal constituent of the gut microbiota of bird and animal species [3,4] but cause serious infections in humans. Symptoms include nausea, fever, abdominal pain, and severe diarrhoea, with potential for the development of debilitating, and sometimes fatal, sequelae [5,6]. Various infection sources have been identified including animal faeces, contaminated drinking water and especially raw or under-cooked poultry and other meats [7]. However, effectively combating disease requires a detailed understanding of the relative contribution of different sources to human infection. As in many other bacterial species, Campylobacter populations represent diverse assemblages of strains [3,8–10]. Within this structured population, some lineages are more commonly observed in particular host species [3,4,11]. Because of this host association, DNA sequence comparisons of bacteria from human gastroenteritis and potential reservoir populations have potential to reveal the infection source. This has identified contaminated poultry as a major source of human infection [12,13]. Based on the body of evidence including DNA sequence analysis [14], targeted interventions have been implemented, including improved biosecurity measures on poultry farms, which have halved recorded campylobacteriosis cases in New Zealand [15,16]. Extending the principal of linking source-sink populations using genotype data, methods have been developed to attribute C. jejuni to the likely source based on bacterial gene frequencies in potential reservoir populations [17,18]. Among the most common genotyping approaches for C. jejuni has been multi-locus sequence typing (MLST) that catalogues DNA sequence variation across seven housekeeping genes that are common to all strains [19,20]. Isolates with identical alleles at all loci are assigned to the same sequence type (ST) and those with identical sequences at most or all loci are grouped within the same clonal complex (CC). Using these data, and allele frequencies, it has been possible to probabilistically assign clinical isolates (STs and CCs) to host source using source attribution models such as the asymmetric island model implemented in iSource [17] and the Bayesian population assignment model STRUCTURE [18,21]. Both methods have been instructive in estimating the relative contribution of a range of domestic and wild animal hosts to human infection, with poultry often identified as the principal source of human campylobacteriosis across different regions and countries [17,18,22–25]. There are two main limitations when using genotype data to for bacterial source attribution. The first is that the ability to attribute is only as good as the degree of genotype segregation. For example, in C. jejuni there are host restricted genotypes [3,26] that can be readily attributed to a given host source when observed in human infections, as well as ecological generalists [27,28] that have relatively recently transitioned between hosts and cannot therefore be attributed with confidence [29]. While host switching potentially imposes a biological constraint on quantitative attribution models, the second limitation is far more tractable. Specifically, most current source attribution methods are subject to limitations imposed by the underlying data. Reflecting the technology of the time, MLST-based source attribution is based only on a small fraction of the genome (approximately 0.2% for C. jejuni [25]) and there is considerable potential for better strain differentiation using current techniques. The increasing availability of large whole genome sequence (WGS) datasets has greatly enhanced analyses of bacterial population structure and diversity [30]. However, exploiting the full information can be challenging due to variable gene content and the complexity of interpreting the short reads produced by next generation sequencing. Notwithstanding this, some studies have attempted to overcome the limited discriminatory power of MLST in attribution studies by screening WGS data to identify elements (SNPs and genes) that segregate by host [31–34]. Using these host segregating markers as input data has improved the resolution of existing attribution models, including STRUCTURE, and provided information about potential infection reservoirs and the UK and France. However, using bespoke marker selection approaches with software designed for MLST data does not maximize the potential of WGS data for source attribution. Here, we present a machine learning approach using WGS data to predict the source of human C. jejuni infection. This has two principal advantages over existing techniques. First, building on WGS-based machine learning source attribution approaches applied to Salmonella enterica and Escherichia coli [35,36], we take an agnostic approach to identify which machine learning tool performs best from a broad range of available algorithms. Second, we use a WGS input capture approach using data types conveniently available in public databases such as PubMLST [37]) allowing the analysis of existing MLST, core-genome MLST and WGS datasets and the reuse of data for continuous updatable monitoring in a generalizable framework. Thus, we aimed to overcome limitations of the currently available methods and use the output to investigate the infective potential of C. jejuni strains.

Methods Dataset acquisition A total of 5,799 C. jejuni and C. coli genomes isolated from various sources and host species were available on the public database for molecular typing and microbial genome diversity: PubMLST (https://pubmlst.org/) with the following source distribution: (chicken: 4147, cattle: 716, sheep: 584, bird: 212, environment: 140). WGS data corresponded to MLST ST and CC designations as well as core genome (cg) MLST classes. The dataset was divided into training (75%) and testing (25%) sets, but we diverged from the more common independent random drawing of individual samples. Instead we used phylogeny-aware sorting, wherein all members of one ST were sorted entirely into either training or testing sets (S1 Table). The ST based sorting accounts for the phylogenetic non-independence of samples [38]. To allow for sufficient sample sizes per reservoir population (hereafter “class”), only the five most prevalent classes for MLST and cgMLST were used (chicken, cattle, sheep, wild bird and environment). For farm animals the classes “chicken” and “chicken offal or meat” were combined to “chicken” (likewise for sheep and cattle), whilst “environment”, “sand” and “river water” were combined into “environment”, consistent with previous studies [18,39]. Feature engineering The allelic profiles of MLST and cgMLST were used directly. MLST samples that had missing alleles on any loci and cgMLST samples with more than 10% missing loci were discarded, with the missing alleles in cgMLST encoded as -1. To potentially exploit the gradient of separation encoded in the sequences underlying the MLST allelic profiles, we downloaded the underlying allele sequences for every loci of the MLST scheme and encoded the nucleotides as dummy variables and k-mers (k = 21) using DSK [40]. DSK was also used for encoding the WGS as k-mers, as they have previously been successfully used on C. jejuni WGS analysis, namely for determining the genetic basis of C. jejuni host affinity [41] and survival [42]. Using k = 21 led to a prohibitively large input vector due to the number of unique k-mers found in all genomes (109,675,176). We reduced the number of k-mers by applying a variance threshold where k-mers which were present or absent in more than 99% of the samples were discarded, reducing the numbers of unique k-mers to 7,285,583. Furthermore, we performed feature selection by testing the dependence of the source labels on every individual k-mer using the Chi-Square statistic. To avoid data-leakage we only performed the feature selection using the training data and labels to select the 100,000 k-mers with the highest score. Algorithm training All machine learning and deep learning was performed in Python (for a list of all algorithms see Fig 1). The xgboost library [43] was used for the gradient boosting classifiers with all other machine learners implemented in scikit-learn [44]. The hyper-parameters for each classifier were chosen using Cartesian grid search on five-fold cross-validation of the training set. The Keras library (https://keras.io/) was used to construct deep learning algorithms aimed at supplying a wide range of commonly used architectures. We found this to work best, empirically, given that there is no principled means of architecture selection for such models. Specifically: (i) A recurrent neural network consisting (RNN) of a layer with 64 gated recurrent units, a 50% dropout layer and Rectified Linear Unit (ReLU) activation layer; (ii) A 1-dimensional convolutional network with two convolutional layers of kernel size 3 and 5 respectively and 30 filters, both followed by 50% dropout layers and a ReLU layer; (iii) A Long short-term memory network (LSTM) consisting of one LSTM layer with 64 units and a 50% dropout layer; (iv) A Shallow dense network with one dense layer with 64 units followed by a 50% dropout layer and a ReLU activation layer; (v) A Deep dense network with 6 dense layers starting with 128 units and halving units with each successive layer. All individual dense layers are followed by a 50% dropout layer and a ReLU layer. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 1. A heatmap showing classifier performance on the class balanced (A) and imbalanced (B) test set. The individual cells are coloured according to the average accuracy on 200 rounds of resampling with replacement with one standard error noted next to the average accuracy. The averages of accuracy per classifiers are shown in the rightmost column, whereas the bottom column shows the averages per data type. https://doi.org/10.1371/journal.pgen.1009436.g001 To all deep learning architectures, we added an output layer comprising a dense layer with soft-max activation with one unit for every class. We encoded the labels as dummy variables and used categorical cross-entropy as a loss function together with the Adam optimiser [45]. Cyclical learning rates were used with a maximum learning rate of 0.1 and a minimum learning rate of 0.0001 to overcome local minima. The accuracy on the test set was measured at every epoch and the overall best performing weights were stored as a checkpoint. The data was deployed in batches of 128 samples with every batch randomly undersampled so that each class was represented in equal proportions. The training was run for 500 generations with early stopping after 50 generations. Algorithm testing Both machine learning and deep learning were tested on the same 25% test set. The original data were skewed in source composition by ratios which did not necessarily reflect source origin of infection. We therefore used two methods to rebalance the classes in testing. The first test set featured an even distribution of classes, whereas the second undersampled the over-abundant chicken-origin genomes to emulate relative contribution to human disease. We used the ratios predicted by Wilson et al. (12), where Campylobacter genomes from chickens were 1.61 times more common than those from cattle. In both methods, rebalancing the classes was achieved by undersampling, which we repeated 200 times with replacement and averaged the accuracy over all iterations whilst also recording one standard error. As our balanced test set is limited by the number of available samples from the minority source (35 environment samples), the repeated undersampling allows us to use all available samples of the residual classes in testing. For performance metrics we registered accuracy, precision (positive predictive value), recall (sensitivity), F1, negative predictive value, specificity and speed. Speed was measured relative to other classifiers where a scale was defined with 0 being the slowest classifier and 1 being the quickest and all intermediate values being normalised within these confines. For comparison to previous methods, iSource was applied to the test dataset [17]. Having established that XGBoost on cgMLST was the best performing source attribution method, we retrained the classifier with both training and testing data and applied it to all 15,988 human cgMLST samples available on the PubMLST database. The prediction took 892 milliseconds on a Dell OptiPlex 7060 desktop using ten threads on an Intel Core i7-8700 CPU and 16 GB RAM. Our algorithm named aiSource can be found and applied from: https://github.com/narning1992/aiSource Phylogenetic analysis We defined the generalist index as the number of sources the ST was found in across all isolates in the dataset, which included additional samples for which only MLST data was available (S1 Table). We built a phylogeny of CC21 genomes from both source-associated and human isolates using Neighbour Joining, based on pairwise hamming distances of k-mer presence/absence in the WGS dataset, as described by Hedge and Wilson [46]. We used TreeBreaker [47] to infer the evolution of phenotypes across the phylogenetic tree of ST-21 and the most closely related sequence types. The known labels of the source-associated samples were used as phenotypic information for input into TreeBreaker together with the phylogeny of CC21. TreeBreaker was run for 5,500,000 iterations with 500,000 iterations as burn-in and 1000 iterations between sampling. The phylogenetic trees were visualised with Microreact [48] and arranged alongside the results of TreeBreaker in Inkscape.

Outlook and conclusions The increasing availability of large pathogen genome datasets, algorithms and resources for analysing them, has created possibilities for investigating the transmission of zoonotic diseases that are incompletely understood. It is clear from the data presented here that tree-based ensemble methods for machine learning classification using bacterial genomic data provide considerable utility for improving the accuracy host source attribution for human campylobacteriosis. Key to the effectiveness of this approach is leveraging the full gradient of genomic differentiation afforded by WGS or cgMLST analysis. Host associated genetic variation can be observed in both core and accessory genes [41] but using these data presents practical considerations. With more computational resources available, it may be possible to analyse all k-mers present in the WGS samples (here 109,675,176 unique kmers) with multiple algorithms accompanied by cross-validation and bootstrap replication. Beyond simple attribution to host source, resolving the fine-grained structure of genomic signatures of association has considerable potential to account for differences in the relative frequency of sub-lineages in samples taken from reservoir hosts and human disease. This can provide important clues about the propensity of strains to survive outside of the host for long enough to transmit to humans as well as the capacity to colonize the human gut given the opportunity [42,57]. This of course leads to questions about the genomic basis of bacterial adaptation, specifically the extent to which ‘associated’ genetic elements represent adaptations and whether the same genes and alleles enable colonisation of different host animals. Improving on the approaches described here, better sampling and incremental training of aiSource, which is available under https://github.com/narning1992/aiSource, has considerable potential. The low computational requirements of aiSource and its high prediction speed make it an excellent tool for analysing large genome datasets. Furthermore, by using phylogeny-aware train/test splitting for measuring performance, prediction remains accurate when new genetic variants are introduced because the algorithm can be incrementally trained with new data. This has considerable potential for developing automated and continuous disease surveillance systems to reduce campylobacteriosis that remains one of the most common food-borne illness in the world.

Supporting information S1 Table. Table containing all samples used in this study and their corresponding PubMLST accession IDs, sequence types, clonal complexes, source labels, predicted labels, generalist index, country of isolation, year of sampling, Campylobacter species and whether they have been used in either training or testing the machine learner. https://doi.org/10.1371/journal.pgen.1009436.s001 (TSV)

Acknowledgments N.A. would like to thank David Eyre, Christophe Fraser and Alexandra Casey for insightful comments. Computation used the Oxford Biomedical Research Computing (BMRC) facility, a joint development between the Wellcome Centre for Human Genetics and the Big Data Institute supported by Health Data Research UK and the NIHR Oxford Biomedical Research Centre. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health.

[END]

[1] Url: https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1009436

(C) Plos One. "Accelerating the publication of peer-reviewed science."
Licensed under Creative Commons Attribution (CC BY 4.0)
URL: https://creativecommons.org/licenses/by/4.0/

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/