(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .
A deep learning approach to real-time HIV outbreak detection using genetic data [1]
['Michael D. Kupperman', 'Theoretical Biology', 'Biophysics', 'Los Alamos National Laboratory', 'Los Alamos', 'New Mexico', 'United States Of America', 'Department Of Applied Mathematics', 'University Of Washington', 'Seattle']
Date: 2022-10
Here, we propose an alternative strategy to outbreak detection using genomic data based on deep learning methods developed for image classification. The key idea is to use a pairwise genetic distance matrix calculated from viral sequences as an image, and develop convolutional neutral network (CNN) models to classify areas of the images that show signatures of active outbreak, leading to identification of subsets of sequences taken from an active outbreak. We showed that our method is efficient in finding HIV-1 outbreaks with R 0 ≥ 2.5, and overall a specificity exceeding 98% and sensitivity better than 92%. We validated our approach using data from HIV-1 CRF01 in Europe, containing both endemic sequences and a well-known dual outbreak in intravenous drug users. Our model accurately identified known outbreak sequences in the background of slower spreading HIV. Importantly, we detected both outbreaks early on, before they were over, implying that had this method been applied in real-time as data became available, one would have been able to intervene and possibly prevent the extent of these outbreaks. This approach is scalable to processing hundreds of thousands of sequences, making it useful for current and future real-time epidemiological investigations, including public health monitoring using large databases and especially for rapid outbreak identification.
Pathogen genomic sequence data are increasingly made available for epidemiological monitoring. A main interest is to identify and assess the potential of infectious disease outbreaks. While popular methods to analyze sequence data often involve phylogenetic tree inference, they are vulnerable to errors from recombination and impose a high computational cost, making it difficult to obtain real-time results when the number of sequences is in or above the thousands.
The analysis of pathogen genomic data to analyze epidemics at scale is constrained by the computational cost associated with phylogenetic tree reconstruction. As a fast and efficient alternative, we employed convolutional neural networks to analyze evolutionary pairwise distance matrices as images to perform classifications of the current epidemiological situation of a growing public health sequence database. We used simulated data to train and test our model, and as validation we accurately mapped the start and end of two linked well-documented HIV-1 outbreaks in the backdrop of ongoing slower HIV spread. Thus, our new approach is efficient, accurate, scalable, and can analyze data in real time.
Funding: This study was supported by NIH/NIAID (
https://www.niaid.nih.gov/ ) grants R01-AI087520, R01-AI135946 (to T.L.) and R01-AI152703 (to R.K.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Copyright: © 2022 Kupperman et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Here,we propose an alternative to phylogenetic reconstruction and HIV-TRACE. We based our approach on the pairwise distance matrix of a sample of genetic sequences, but used a deep learning approach to analyze the matrix (instead of using a distance threshold as in HIV-TRACE). Deep learning approaches, such as the convolutional neural network (CNN) [ 18 ], has been well developed and widely used for image identification over the past decade [ 19 – 21 ]. By using many parameters in a highly nonlinear model, a deep learning model can efficiently learn complex relationships within the data to form highly accurate predictions [ 22 ]. Our rationale here is that outbreaks would lead to distinctive signatures in the pairwise distance matrix (similar to its impact on the topology of the phylogenetic tree [ 23 ]). We leveraged the advance in deep learning models by treating the matrix as an image, and developed deep learning models to identify these signatures from the pairwise distance matrix and thus detect outbreaks from a sequence database. Using simulated data, we show that our CNN model can accurately identify viral sequences belonging to an outbreak. It performed better than HIV-TRACE. In addition, we show that our CNN models made accurate predictions against historical HIV sequence data with known epidemiological history, and that they can handle many thousands of sequences within a very short time frame using a laptop computer.
One popular alternative to phylogenetic inference is HIV-TRACE [ 12 ], which aims to identify clusters of connected components by applying a threshold distance to a genetic distance matrix for a set of sequences. HIV-TRACE has been used in scientific investigations [ 13 – 16 ], as well as in public health settings in the USA [ 17 ].
Given sequence data, state-of-the-art, flexible phylogenetic methods have been developed for analysis of general evolutionary questions [ 2 – 4 ], applicable to pathogen evolution, as well as faster but less precise algorithms for large data [ 5 ]. More focused phylodynamic methods have also been developed for specific tasks, e.g., taking incidence data into account [ 6 ], including multiple evolutionary scales [ 7 ], inferring underlying transmission networks on several levels [ 8 , 9 ], and using large next generation sequencing data [ 10 ]. Motivated by pathogen evolution, advanced methods for inference of past demographic history with population size dynamics and migration that can reconstruct outbreaks have also been developed, e.g., the modular framework of BEAST [ 11 ], which takes a Bayesian approach to account for uncertainties in the tree reconstruction. However, phylogenetic tree reconstruction, interpretation, and subsequent outbreak identification requires extensive expert knowledge, and thus typically can be reliably done only by highly trained scientists.
The human population is increasingly exposed to threats of infectious disease outbreaks due to population growth, increased frequency of traveling, changing patterns of land use etc. This is exemplified by the ever-growing HIV epidemic [ 1 ] as well as recent emerging outbreaks such as the SARS-CoV-2 pandemic. A key to outbreak control is early detection when the number of infected individuals is small and the disease spread is local. One growing resource for disease control is to utilize pathogen genomic sequence data to assess epidemiological conditions and threats.
Methods
Overview of the framework Here we describe the main workflow of our approach. We first developed a forward stochastic simulator of HIV transmission to generate synthetic datasets for training and testing of our CNN models. In this simulator, the number of infected individuals initially expand exponentially, and subsequently establish a constant population size (Fig 1A). The entire transmission history was recorded; n individual samples (n = 15, 20, 30, 40, or 50 in our model) were taken either during the exponentially increasing phase (labeled as ‘epidemic’) or the constant population phase (labeled as ‘endemic’). Nucleotide substitutions were then simulated on the transmission tree/genealogy of the n sampled individuals, and a n × n pairwise distance matrix was derived from the simulated substitutions on the transmission tree (Fig 1B). We repeated the stochastic simulation many times to derive a rich synthetic dataset (i.e. a collection of matrices with labels). PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 1. (A) Cartoon representation of the biphasic forward stochastic simulation. The red line indicates the exponential phase, the blue line indicates the constant population phase. Grey boxes represent an infected individual, pink arrows indicate transmission. The reconstructed transmission dendrogram is overlaid, green nodes are internal, orange (lettered) nodes indicate sampled infections. Triple chevrons indicate different possible sampling times (0, 2, 10 years). (B) The resulting image representation of the sorted pairwise distance matrix sampled at Year 10.
https://doi.org/10.1371/journal.pcbi.1010598.g001 We then developed CNN models, trained and tested these CNN models using the synthetic dataset to identify matrices that are labeled as ‘epidemic’. These CNN models handle a small number of sequences (n = 15, 20, 30, 40, or 50) at a time; however, a pathogen database may contain tens of thousands or even hundreds of thousands of sequences. We thus developed a ‘sliding-window’ approach to utilize the CNN model to identify subsets of sequences that belong to an active outbreak, i.e. the ‘epidemic’ label, from a collection of a large number of sequences. More specifically, we first constructed a pairwise distance matrix for all the sequences in the database (m sequences in total), and reordered the matrix using a fast clustering algorithm, such as hierarchical clustering [24]. This ensures that the sequences belonging to an active outbreak are grouped together. We then used a window of size n × n, and move this window along the diagonal of the m × m matrix from the top left corner to the bottom right corner. At each position of the sliding window, we used the trained CNN model to predict the label (‘epidemic’ or ‘endemic’) for the n × n submatrix of the sliding window. This sliding window approach is similar to the well-established image identification algorithm, such as R-CNN, where a subarea/window of the entire image is sampled to identify objects of interests [25]. This approach should allow the analysis of tens of thousands of sequences very efficiently.
The HIV-1 stochastic transmission simulator The HIV-1 stochastic forward transmission model was adapted from [26]. It has two phases: 1) an exponential growth phase and, 2) a constant population phase (Fig 1A). The simulation began with one infected individual. Secondary infections were generated stochastically according to a predefined basic reproductive number R 0 (ranging between 1.5 and 5 in our simulations). It has been shown that the transmission potential is much higher during the acute infection phase in an infected individual [27, 28]. Thus, we assumed that during the first three months the rate of new infections is 20-fold higher than the remaining infectious period. In addition, we assumed that an individual will be non-infectious when on successful antiviral treatment. The time of diagnosis and antiviral treatment is assumed to be uniformly distributed between 13 to 36 months after infection [29, 30]. We also tested the robustness of model predictions by changing the shape of the distribution. Once the population size reaches the maximum number of infected individuals, assumed to be log-uniformly distributed between 103 and 104 across different simulation runs, we set the population size to be constant over time. In the constant population phase, the simulation switched from generating newly infected individuals to only replacing infected individuals who are diagnosed and treated and thus no longer infectious (Fig 1A). The transmission history of all individuals during the simulation was recorded, which allows for the reconstruction of the transmission history/tree for the entire population or any subset thereof. Samples (with a size between 15 and 50) were taken at 3 different time points. The first time point, defined as year 0, is when the population size reaches the maximum number of infected individuals, i.e., the transition time when the population changes from exponential growth to a constant size. The genealogical relationship between the samples taken at this time point reflects populations undergoing exponential growth, and therefore, we labeled the samples as ‘epidemic’. The 2nd and 3rd time points are 2 and 10 years after the first time point. The genealogical relationship between these samples therefore reflects populations that have stopped expanding, and we labelled them as ‘endemic’. Construction of pairwise distance matrices. To construct an evolutionary pairwise distance matrix, D, we first calculated the temporal distance between each pair of samples based on the transmission history/tree. We assumed that the genetic sequence data is approximately 300 nucleotides (nt) to be consistent with the real HIV-1 data we used below. We then calculated the pairwise genetic distance separating two samples by drawing the number of substitutions from a Poisson distribution with the expected number of substitutions as the mean parameter, here 0.0067 substitutions nt−1 year−1 times 300 nt [31], times the temporal distance. This was iterated for each pair of the samples to form the pairwise distance matrix. Examples of matrices sampled from each time point (Year 0, 2, or 10) are shown in Fig 2. PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 2. A collection of example images used for training from each sampling year. Each pictured training example was generated with R 0 > 4. Three example images are shown for each sampling time (Fig 1A).
https://doi.org/10.1371/journal.pcbi.1010598.g002
The Convolutional Neural Network (CNN) model We developed a deep learning model using a Convolutional Neural Network (CNN) to solve a classification task predicting the label (0, 2, or 10 years) for a given pairwise distance matrix. The pairwise distance matrix is similar to a grayscale image (Fig 2). Thus, in the context of machine learning models, we also refer to a pairwise distance matrix as an image. We constructed a CNN using Tensorflow [32] with a sequential architecture comprised of 2D convolutions, batch normalization, ReLu activations, and spatial maximum pooling. The layer structure is described in Table 1. The inputs to the first and third dense layers were regularized using dropout with probability p = 0.25. PPT PowerPoint slide
PNG larger image
TIFF original image Download: Table 1. The layer connectivity proceeding from the input layer (top) down to the output layer. Our representation of image data within the convolutional stack is ‘channels last’. The batch size is implicitly 1 for each image and is dropped. If k > 20, an additional max-pool operation is performed before layer 10. No bias weight is used in the final dense layer.
https://doi.org/10.1371/journal.pcbi.1010598.t001 We generated five variants of this model architecture to accept different input square matrices with side lengths of 15, 20, 30, 40, or 50 (corresponding to the number of sampled individuals in the pairwise distance matrix). We refer to this parameter as the input shape or the window size interchangeably. The number and size of all convolution kernels, size of max-pool filters, and dense layer output neurons were held constant across all models. Model architecture specifications are reported in Table 1. Five batches of synthetic training data of 60,000 pairwise distance matrices were generated for each number of sampled individuals (for a total of 25 training sets), each with 20,000 examples for each label. Each dataset was used independently to train a single neural network. Each neural network was trained within under an hour using two P100 GPUs with two Power8NVL CPUs. Five validation sets of 30,000 pairwise distance matrices (10,000 matrices per label) was used to compute and compare model performance, one set for each input size. Each model was trained with a batch size of 32 images using the Adam optimizer [33] for 100 epochs with a learning rate of 10−4. The first and second dense layers were regularized with an ℓ 2 penalty with weights of 0.05 and 0.01 respectively. Models were evaluated independently at training time and as an ensemble in deployment using a majority voting system.
The sliding-window approach To handle a large number of sequences, we employed a sliding-window approach. We first constructed a pairwise distance matrix D of size m × m for sequences in a large database. We then used a window of size n × n and moved this window from the top left corner of the diagonal to the bottom right corner of the diagonal of D. This method evaluates one n × n principal submatrix block in the sliding window at a time by assigning a label to the submatrix using the CNN model. For each individual represented in the pairwise distance matrix, we collected a list of the labels provided by the CNN model for each block they are represented in. The most frequently identified label within the list of predictions was assigned to an individual. To validate the performance of the sliding window approach, we generated pairwise distance matrices of dimension 1000 × 1000 using our stochastic simulator. To generate each pairwise distance matrix, we performed 10 independent simulations and joined the resulting matrices. Each simulation generated a cluster of 100 individuals, with half of the clusters being sampled at Year 0 of the simulation, i.e. labeled ‘epidemic’, and the other half sampled at Year 2 or 10, labeled ‘endemic’. We then joined these 10 clusters into a single matrix. To ensure the 10 simulations represent distinct outbreak scenarios in the matrix, we assumed that the genetic distance between the initial infections of all 10 clusters was 15 mutations. The order of individuals in the pairwise distance matrix was then randomized. A total of 60 independent 1000 × 1000 pairwise distance matrices were generated with this process to form a validation set. Accuracy was computed on a per-individual basis using the sliding-window method.
Robustness of model prediction against reordering and non-random sampling We tested the robustness of model predictions against slight reordering of elements of an image using a permutation test. We asked the question: what is the probability that randomly reordering k individuals represented in a pairwise distance matrix results in an incorrect model prediction? To generate reordered images, we first selected k elements (corresponding to k sequences and k = 2, 3, or 4) in a matrix and then reordered the k elements through permutations. We randomly sampled with replacement 1000 possible reorderings for each image in a subset of 1000 images that were sampled from each validation set, and calculated the probability that random reordering results in the same model prediction. As sampling effort typically follows active infections, we also tested the robustness of model predictions against non-random sampling by considering outbreaks where detection efforts are focused around a cluster of epidemiologically closely-related individuals (e.g. arising from contact tracing). To do this, we sampled a large portion of infected individuals (10%, 30%, or 50%) at the end of a simulation run. We then randomly selected a single individual and collected the closest 99 sampled infections (determined by the recorded evolutionary distance) in our larger sample. This group of 100 sub-sampled individuals is collected and represented in a pairwise distance matrix. We then joined matrices generated under this approach to obtain 60 pairwise distance matrices, and tested our model on this multiple-cluster data using the sliding window approach. We repeat this procedure at each level of sampling intensity (i.e. 10%, 30%, or 50%).
Predicting outbreak sequences using HIV-TRACE HIV-TRACE identifies clusters of sequences using a distance threshold [12]. These clusters are often categorized as belonging to an active outbreak. We used the simulated multiple cluster data (where each simulated cluster has 100 individuals) as described above to compare the performance of HIV-TRACE [12] with our model in identifying sequence clusters (and thus sequences belonging to outbreaks). We first normalized the simulated number of mutations by the sequence length (into substitutions/site), and set the distance threshold in HIV-TRACE at 0.015 substitutions/site. Other threshold values can give other clusters, typically higher threshold values give fewer clusters [12], a more detailed fitting procedure may be appropriate for some study populations [34]. HIV-TRACE provided a list of clusters of varying sizes. The choice of the minimal cluster size to categorize sequences belonging to an outbreak is arbitrary in general. Here we used different minimal cluster sizes, 2, 10, 25, 50, or 75 sequences (out of outbreaks with 100 sequences), and computed the sensitivity (i.e. correctly identify sequences belonging to an outbreak) and the specificity (i.e. correctly identify sequences that does not belong to an outbreak) for predictions of both the HIV-TRACE and our approach.
Outbreak identification from real data Applying the sliding-window approach to real HIV sequences led to predictions of sequences belonging to either an ‘epidemic’ or an ‘endemic’ phase. The prediction of a sequence collected at an earlier time may change when newly sampled sequences are added to the database, and this change may indicate an epidemiological link between an older sequence and newly sampled sequences. Therefore, for the sequences labeled as ‘epidemic’, we further considered three possible, inferred epidemiological situations based on the time a sequence was sampled relative to the time of the most recent samples (e.g. the current year). For sequences sampled within 2 years of the current year, they were labeled as part of an ‘active epidemic’. For sequences sampled more than 2 years before the current year, we categorized them according to whether they were in the same sliding window as (i.e. close to) any sequences sampled in the current year that are labeled as ‘epidemic’. If so, we labeled these sequences ‘reactivated epidemic’; otherwise, we labeled them ‘inactive epidemic’. ‘Reactivated epidemic’ indicates a situation where the newly identified outbreak may be linked to previously sampled sequences (i.e. individuals) in the database. ‘Inactive epidemic’ indicates that we predict that the sequence used to belong to an outbreak at the time when the sequence is added to the database; however, it may not be relevant to current outbreaks. The overall procedures of model prediction on an expanding database is represented in Fig 3. PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 3. Cartoon representation of the evaluation of HIV-1 genetic data in a database in real time. As the outbreak progresses and more sequences are collected (left), sequences are added to the database. Nodes in shaded areas represent infected individuals from whom HIV-1 sequences are sampled at each time in a growing transmission network. At each time point, we construct a pairwise distance representation of the sampled sequence data and apply a sorting algorithm (middle). Collecting the predictions from each matrix gives a representation of the history and progression of an outbreak (right).
https://doi.org/10.1371/journal.pcbi.1010598.g003
[END]
---
[1] Url:
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010598
Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.
via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/