(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org.

(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org.
Licensed under Creative Commons Attribution (CC BY) license.
url:https://journals.plos.org/plosone/s/licenses-and-copyright

------------

Identifying and prioritizing potential human-infecting viruses from their genome sequences

['Nardus Mollentze', 'Medical Research Council-University Of Glasgow Centre For Virus Research', 'Glasgow', 'United Kingdom', 'Institute Of Biodiversity', 'Animal Health', 'Comparative Medicine', 'College Of Medical', 'Veterinary', 'Life Sciences']

Date: 2021-10

Determining which animal viruses may be capable of infecting humans is currently intractable at the time of their discovery, precluding prioritization of high-risk viruses for early investigation and outbreak preparedness. Given the increasing use of genomics in virus discovery and the otherwise sparse knowledge of the biology of newly discovered viruses, we developed machine learning models that identify candidate zoonoses solely using signatures of host range encoded in viral genomes. Within a dataset of 861 viral species with known zoonotic status, our approach outperformed models based on the phylogenetic relatedness of viruses to known human-infecting viruses (area under the receiver operating characteristic curve [AUC] = 0.773), distinguishing high-risk viruses within families that contain a minority of human-infecting species and identifying putatively undetected or so far unrealized zoonoses. Analyses of the underpinnings of model predictions suggested the existence of generalizable features of viral genomes that are independent of virus taxonomic relationships and that may preadapt viruses to infect humans. Our model reduced a second set of 645 animal-associated viruses that were excluded from training to 272 high and 41 very high-risk candidate zoonoses and showed significantly elevated predicted zoonotic risk in viruses from nonhuman primates, but not other mammalian or avian host groups. A second application showed that our models could have identified Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) as a relatively high-risk coronavirus strain and that this prediction required no prior knowledge of zoonotic Severe Acute Respiratory Syndrome (SARS)-related coronaviruses. Genome-based zoonotic risk assessment provides a rapid, low-cost approach to enable evidence-driven virus surveillance and increases the feasibility of downstream biological and ecological characterization of viruses.

We aimed to develop machine learning models that use features engineered from viral and human genome sequences to predict the probability that any animal-infecting virus will infect humans given biologically relevant exposure (here, zoonotic potential). Using a large dataset of viruses that had previously been assessed for human infection ability based on published reports, we first build machine learning models that assign a probability of human infection based on virus taxonomy and/or phylogenetic relatedness to known human-infecting viruses and contrast these models to alternatives based on hypothesized selective pressures on viral genome composition that favor human infectivity. We then apply the best performing model to explore patterns in the predicted zoonotic potential of additional virus genomes sampled from a range of species.

Empirical and theoretical evidence suggests that generalizable signals of human infectivity might exist within viral genomes. Viruses associated with broad taxonomic groups of animal reservoirs (e.g., primates versus rodents) can be distinguished using aspects of their genome composition, including dinucleotide, codon, and amino acid biases [ 10 ]. Whether such measures of viral genome composition are specific enough to distinguish host range at the species level remains unclear, but their specificity might arise through several commonly hypothesized mechanisms. First, aspects of antiviral immunity that target nucleotide motifs in viral genomes might select for common mutations in diverse human-associated viruses [ 11 , 12 ]. For example, the depletion of CpG dinucleotides in vertebrate-infecting RNA virus genomes may have arisen to evade zinc-finger antiviral protein (ZAP), an interferon-stimulated gene (ISG) that initiates the degradation of CpG-rich RNA molecules [ 12 ]. While ZAP occurs widely among vertebrates, increasingly recognized lineage specificity in vertebrate antiviral defenses opens the possibility that analogous, undescribed nucleic acid targeting defenses might be human (or primate) specific [ 13 ]. Second, the frequencies of specific codons in virus genomes often resemble those of their reservoir hosts, possibly owing to increased efficiency and/or accuracy of mRNA translation [ 14 ]. By driving genome compositional similarity to human-adapted viruses or to the human genome, such processes may preadapt viruses for human infection [ 15 , 16 ]. Finally, even in the absence of mechanisms that exert common selective pressures on divergent viral genomes, the phylogenetic relatedness of viruses could allow prediction of the potential for human infectivity since closely related viruses are generally assumed to share common phenotypes and host range. However, despite being a common rule of thumb for virus risk assessment, to our knowledge, whether evolutionary proximity to viruses with known human infection ability predicts zoonotic status remains untested.

Current models can identify well-characterized human-infecting viruses from genomic sequences [ 7 , 8 ]. However, by training algorithms on very closely related viruses (i.e., strains of the same species) and potentially omitting secondary characteristics of viral genomes linked to infection capability, such models are less likely to find signals of zoonotic status that generalize across viruses. Consequently, predictions may be highly sensitive to substantial biases in current knowledge of viral diversity [ 3 , 9 ].

Most emerging infectious diseases of humans are caused by viruses that originate from other animal species. Identifying these zoonotic threats prior to emergence is a major challenge since only a small minority of the estimated 1.67 million animal viruses may infect humans [ 1 – 3 ]. Existing models of human infection risk rely on viral phenotypic information that is unknown for newly discovered viruses (e.g., the diversity of species a virus can infect) or that vary insufficiently to discriminate risk at the virus species or strain level (e.g., replication in the cytoplasm), limiting their predictive value before the virus in question has been characterized [ 4 – 6 ]. Since most viruses are now discovered using untargeted genomic sequencing, often involving many simultaneous discoveries with limited phenotypic data, an ideal approach would quantify the relative risk of human infectivity upon relevant exposure from sequence data alone. By identifying high-risk viruses warranting further investigation, such predictions could alleviate the growing imbalance between the rapid pace of virus discovery and lower throughput field and laboratory research needed to comprehensively evaluate risk.

Results

We collected a single representative genome sequence from 861 RNA and DNA virus species spanning 36 viral families that contain animal-infecting species (S1 Fig). We labeled each virus as being capable of infecting humans or not using published reports as ground truth and trained models to classify viruses accordingly. These classifications of human infectivity were obtained by merging 3 previously published datasets that reported data at the virus species level and therefore did not consider potential for variation in host range within virus species [5,9,17]. Importantly, given diagnostic limitations and the likelihood that not all viruses capable of human infection have had opportunities to emerge and be detected, viruses not reported to infect humans may represent unrealized, undocumented, or genuinely nonzoonotic species. Identifying potential or undocumented zoonoses within our data was an a priori goal of our analysis.

We first evaluated whether phylogenetic proximity to human-infecting viruses elevates zoonotic potential. Gradient boosted machine (GBM) classifiers trained on virus taxonomy or the frequency of human-infecting viruses among close relatives identified by sequence similarity searches (“phylogenetic neighborhood,” defined using nucleotide BLAST [10]) outperformed chance (median area under the receiver operating characteristic curve [AUC m ] = 0.604 and 0.558, respectively), but were no better than manually ranking novel viruses by the proportion of human-infecting viruses in each family (“taxonomy-based heuristic,” AUC m = 0.596, Fig 1A). This indicates that relatedness-based models were not only unable to identify novel zoonoses that are not close relatives of known human-infecting viruses, but were also largely unable to accurately distinguish risk among closely related viruses (S2 Fig). Moreover, the performance of these models depended on the data available for model training, sometimes performing worse than chance, making them highly sensitive to current knowledge of viral diversity.

PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 1. Machine learning prediction of human infectivity from viral genomes. (A) Violins and boxplots show the distribution of AUC scores across 100 replicate test sets. (B) Receiver operating characteristic curves showing the performance of the model trained on all genome composition feature sets across 1,000 iterations (gray) and performance of the bagged model derived from the top 10% of iterations (green). Points indicate discrete probability cutoffs for categorizing viruses as human infecting. (C and D) show binary predictions and discrete zoonotic potential categories from the bagged model, using the cutoff that balanced sensitivity and specificity (0.293). (C) Heatmap showing the proportion of predicted viruses in each category. (D) Cumulative discovery of human-infecting species when viruses are prioritized for downstream confirmation in the order suggested by the bagged model. Dotted lines highlight the proportion of all viruses in the training and evaluation data that need to be screened to detect a given proportion of known human-infecting viruses. Background color highlights the assigned zoonotic potential categories of individual viruses encountered (red: very high, orange: high, yellow: medium, and green: low). Numerical data underlying this figure can be found at https://github.com/nardus/zoonotic_rank/tree/main/FigureData (doi: 10.5281/zenodo.4271479). AUC, area under the receiver operating characteristic curve. https://doi.org/10.1371/journal.pbio.3001390.g001

We next quantified the performance of GBMs trained on genome composition (i.e., codon usage biases, amino acid biases, and dinucleotide biases), calculated either directly from viral genomes (“viral genomic features”) or based on the similarity of viral genome composition to that of 3 distinct sets of human gene transcripts (“human similarity features”): ISGs, housekeeping genes, and all other genes. We hypothesized that if viruses need to adapt to either evade innate immune surveillance for foreign nucleic acids or to optimize gene expression in humans, they should resemble ISGs since both tend to be expressed concomitantly in virus-infected cells. We selected 2 additional sets comprising non-ISG housekeeping genes and all remaining genes to explore whether signals were specific to ISGs. GBMs trained using genome composition feature sets performed similarly when tested separately (AUC m = 0.688 to 0.701) and consistently outperformed models based on relatedness alone (both the taxonomy-based heuristic and machine learning models trained on virus taxonomy or phylogenetic neighborhood, Fig 1A). Combining all 4 genome composition feature sets further improved, and reduced variance in, performance (AUC m = 0.740, Fig 1A), suggesting that measures of similarity to human transcripts contained information unavailable from viral genomic features alone. In contrast, adding relatedness features to this combined model reduced accuracy (AUC m = 0.726) and increased variance (Fig 1A). Averaging output probabilities over the best 100 out of 1,000 iterations of training on random test/train splits of the data (a process akin to bagging, using ranking performance on nontarget viruses to select high performing models) further improved the combined genome feature–based model (area under the receiver operating characteristic curve [AUC] = 0.773, Fig 1B).

To estimate model sensitivity and specificity, we converted the mean of predicted probabilities of human infection from the bagged model into binary classifications (i.e., human infecting or not), predicting viruses with predicted probabilities >0.293 as human infecting. This cutoff balanced sensitivity and specificity (both 0.705, Fig 1C), although in principle, higher or lower cutoffs could be selected to prioritize reduction of false positives or false negatives, respectively (Fig 1B). These binary predictions correctly identified 71.9% of viruses that predominately or exclusively infect humans and 69.7% of zoonotic viruses as human infecting, although performance varied among viral families (Fig 1C, S3 Fig). Since binary classifications ignore both the variability between iterations and the rank of viruses relative to each other, we further converted predicted probabilities of zoonotic potential into 4 zoonotic potential categories, describing the overlap of confidence intervals (CIs) with the 0.293 cutoff from above (low: entire 95% CI of predicted probability ≤ cutoff; medium: mean prediction ≤ cutoff, but CI crosses it; high: mean prediction > cutoff, but CI crosses it; very high: entire CI > cutoff). Under this scheme, the majority (92%) of known human-infecting viruses were predicted to have either medium (21.5%), high (47.1%), or very high (23.4%) zoonotic potential, while only 8% (N = 21) had low zoonotic potential (S4 Fig, S1 Table). A total of 18 viruses not currently considered to infect humans by our criteria were predicted to have very high zoonotic potential (S5 Fig), although at least 3 of these (Aura virus, Ndumu virus, and Uganda S virus) have serological evidence of human infection [5,17], suggesting that they may be valid zoonoses rather than model misclassifications. Across the full dataset, 77.2% of viruses predicted to have very high zoonotic potential were known to infect humans (S1 Table). Consequently, studies aimed at confirming human infectivity (e.g., by attempting to infect human-derived cell lines or by serological testing of humans in high-risk populations) while screening viruses in the order suggested by our ranking would have found 23.4% of all known human-infecting viruses in this dataset after screening just the very high zoonotic potential viruses (9.2% of all viruses). More generally, 50% of known human-infecting viruses would have been found after screening the top-ranked 23.3% of viruses and 75% after screening the top 48% of viruses (Fig 1D). In contrast, if relying only on relatedness to known zoonoses, confirming the first 50% of currently known zoonoses would have required screening either 40.2% (taxonomy-based model) or 41.5% (phylogenetic neighborhood–based model) of viruses, a 1.7- to 1.8-fold increase in effort compared to our best model (S6 Fig).

Since genome composition features partly track viral evolutionary history [10], it is conceivable that our models made predictions by reconstructing taxonomy more accurately than the phylogenetic neighborhood estimator or in more detail than available to the taxonomy-based model. We therefore compared dendrograms that clustered viruses by either taxonomy, raw genomic features, or the relative influence of each genomic feature on the model prediction for each virus. The relative influence of each genomic feature on prediction outcomes was measured using the SHapley Additive exPlanations (SHAP) algorithm, which computes the Shapley value for each feature and is increasingly used to improve the interpretability of the decisions made by machine learning models [18]. Shapley values derive from game theory and represent the average marginal contribution of a feature to a prediction across all possible combinations of features [19]. SHAP thus represents complex models as a more interpretable linear combination of values that add up to the final model prediction. As such, SHAP values give a model agnostic measure of how important features are relative to each other when predicting the human infection-ability of a given virus. Here, high levels of similarity in SHAP values between viruses would indicate that they were predicted to have the same human infection status because of the same patterns in their genomic features [20]. Our analysis therefore asked to what extent such similar uses of the same genomic features followed established taxonomic relationships among viruses. While dendrograms using raw feature values closely correlated with virus taxonomy for both human-infecting and other viruses (Baker’s [21] γ = 0.617 and 0.492, respectively, p < 0.001), dendrograms of SHAP similarity had 10.28- and 2.07-fold reduced correlations with virus taxonomy (γ = 0.060 and 0.238, although this was still more correlated than expected by chance, p ≤ 0.008; S7 Fig). Among human-infecting viruses, correlations between SHAP similarity-based clustering and virus taxonomy weakened at deeper taxonomic levels, even though the input genomic features provided sufficient information to partially reconstruct virus taxonomy at the realm, kingdom, and phylum levels (S8 Fig). These results indicate that more taxonomic information was available than was utilized by the trained model to predict human infection ability. Interestingly, dendrograms of SHAP similarity showed that even viruses with different genome types—indicating ancient evolutionary divergence or separate origins—clustered together (Fig 2A). Alongside earlier observations on classifier performance (Fig 1A), this suggests that the genome composition-based model outperformed relatedness-based approaches because it found common viral genome features that increase the capacity for human infection across diverse viruses.

PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 2. Genomic determinants of human-infecting viruses. (A) SHAP value clustering of viruses known to infect humans (primarily human associated, dark purple, and zoonotic, pink) and those with no known history of human infection (blue) shows that similar features predicted human infection across viruses with different genome types (rows). A second set of panels shows the predicted probability of infecting humans for each virus, with the dashed line indicating the cutoff that balances sensitivity and specificity. (B) Relative importance of individual features in shaping predictions, determined by ranking features by the mean of absolute SHAP values across all viruses. Gray lines represent individual features; boxplots show the median, 25th/75th percentiles, and range of ranks for each feature set. (C) Difference in ranks of features when both unreferenced (“Unref.”) and similarity to human genomes (“Sim.”) forms were retained in the final model. Lines are colored according to the highest ranked representation in each pairwise comparison; colors as in B. (D) Composition of the top 25 most important clusters of correlated features shaping predictions. Discrete clusters of correlated features were identified by affinity propagation clustering. Clusters are shown ranked by the combined effect magnitude of constituent features, defined as the sum of mean absolute SHAP values for all features in the cluster, and the exemplar feature of each cluster is provided on the right axis. Bars represent means (± SEM) across 1,000 iterations and are shaded by the proportion of the cluster from each feature set; colors as in B. Numerical data underlying this figure can be found at https://github.com/nardus/zoonotic_rank/tree/main/FigureData (doi: 10.5281/zenodo.4271479). SHAP, SHapley Additive exPlanations. https://doi.org/10.1371/journal.pbio.3001390.g002

Although our analysis was not designed to conclusively identify biological mechanisms underlying genomic predictors of human infection, we nevertheless were able to explore emergent patterns relating to how specific genome composition features and groups of features relate to human infectivity. We first compared the relative influence of features from different genome composition categories (i.e., genomic features versus the 3 sets of human similarity features). Representatives of all genome composition categories were retained in the final model, although we found some evidence that compositional similarity to human housekeeping genes and ISGs influenced predictions more strongly than unreferenced viral genomic features (Fig 2B and 2C, S1 Text). We next explored the influence of individual features on model predictions in more detail. Unsurprisingly, given that GBMs are designed to make predictions from large numbers of weakly informative features [22], no single feature stood out as the driving force, and many features formed correlated clusters (Fig 2D, S9 Fig). More interestingly, many features had complex, nonlinear relationships with human infection (S10 Fig), such that increased similarity to human gene transcripts did not always increase the likelihood of infecting humans (S1 Text). We speculate that this might reflect trade-offs between different features within viral genomes or context dependencies whereby both mimicry of human transcripts (e.g., for improved translation efficiency) or divergence from human transcripts (e.g., for evasion of nucleotide motif-targeting defenses) may occur for different features (S1 Text).

Finally, we carried out 2 case studies to illustrate the utility of our prediction framework. First, we used the combined genome feature–based model to rank 758 virus species that were not present in our training data. We included all species in the most recent International Committee on Taxonomy of Viruses (ICTV) taxonomy release (#35, April 24, 2020) belonging to animal-infecting virus families and which were originally discovered or sequenced from mammals (including humans), birds, 2 insect orders containing common virus vectors (Diptera and Ixodida), or where the sampled host was not reported. This dataset contained representatives from 38 viral families, including 2 (Anelloviridae and Genomoviridae), which were not present in data used to train our model. In total, 70.8% of viruses sampled from humans were correctly identified as having either very high (N = 36) or high zoonotic potential (N = 44; Fig 3A). The remaining human-associated viruses were primarily classified as medium zoonotic potential (N = 30), with 3 species predicted to have low zoonotic potential (Mammalian orthoreovirus and Human associated gemykibivirus 2 and 3; Fig 3A). Within the viral families never previously seen by our model, the majority of human-associated anelloviruses (39/45, 86.6%) were correctly identified as having either very high or high zoonotic potential, consistent with the conclusion that viral genomic features that enhance human infectivity can generalize across viral families. In contrast, all 6 human-associated genomoviruses were classified as either medium or low zoonotic potential. The lower performance on genomoviruses may reflect the unusual genomic structure of this family (circular, single-stranded DNA), which was poorly represented in training (only 2 representatives from the Circoviridae family; S1 Fig) and may impose different selective forces. Further, the small genome sizes of genomoviruses (2.2 to 2.4 kb) may complicate calculation of genomic features due to the low number of nucleotides, dinucleotides, and codons available (cf. S3 Fig). Among the 645 viruses with unknown human infectivity that were sequenced from nonhuman animal or potential vector samples, 45.0% were predicted to have either very high (N = 41) or high zoonotic potential (N = 272; S11 Fig, S1 Table). The very high zoonotic potential category was dominated by Papillomaviridae (34.1%) and Peribunyaviridae (19.5%).

PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 3. Probability of human infection predicted from holdout viral genomes. (A) Predicted probability of human infection for 758 virus species that were not in the training data. Colors show the assigned zoonotic potential categories, with an additional panel showing the host or vector group each virus genome was sampled from. Tick marks along the top edge of the first panel show the location of virus genomes sampled from humans, while a dashed line shows the cutoff that balanced sensitivity and specificity in the training data. The top 25 viruses that were not sampled from humans (contained within the gray box) are illustrated in more detail in (B). Bars show the 95% interquartile range of predicted probabilities across the best performing 10% of iterations (based on the training data), while a solid line (A) or circles (B) show the mean predicted probability from these iterations. Numerical data underlying this figure can be found in S1 Table and at https://github.com/nardus/zoonotic_rank/tree/main/FigureData (doi: 10.5281/zenodo.4271479). https://doi.org/10.1371/journal.pbio.3001390.g003

We next used a beta regression model to explore how predictions of zoonotic potential varied among host and viral groups. As expected given the performance on our training and evaluation data (Fig 1), the 113 virus species that were sequenced from human samples scored consistently higher than those detected in other hosts (p < 0.001; Figs 3A and 4D). Although viruses from putatively high-risk host groups including bats, rodents, and artiodactyls formed a large fraction of our holdout data (with viruses from bats outnumbering even those from humans, S11 Fig), they did not have elevated predicted probabilities of being zoonotic (Fig 4C), and no differences were detected at higher host taxonomic levels (Fig 4A and 4B). This highlights a potential disparity between current sampling efforts for virus discovery/reporting and the distribution of zoonotic risk. In contrast, viruses linked to primates had higher predicted probabilities of infecting humans, even after accounting for human-associated viruses and the effects of virus family (Figs 3 and 4, S11 Fig). That genome composition-based models predicted elevated zoonotic potential in nonhuman primate–associated viruses despite receiving no information on sampled host further supports host-mediated selective processes as a biological basis for our model’s predictions. In addition to relatively rare and small host effects, we observed more pervasive positive and negative effects of virus family on predicted zoonotic status (Fig 4E). Taken together, our results are consistent with the expectation that the relatively close phylogenetic proximity of nonhuman primates may facilitate virus sharing with humans and suggest that this may in part reflect common selective pressures on viral genome composition in both humans and nonhuman primates. However, broad differences among other animal groups appear to have less influence on zoonotic potential than virus characteristics [9].

PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 4. Factors correlated with the probability of human infection predicted from holdout viral genomes. Partial effects plots are shown for a beta regression model attempting to explain the mean probability assigned by the bagged model to all viruses in Fig 3A, accounting for whether or not the genome predicted was sequenced from arthropods (as opposed to chordates), (A), random effects for the taxonomic class and order of sampled hosts (B and C), whether the sequence derived from a human sample (D), and a random effect for the virus family represented (E). Points indicate partial residuals, while lines and shaded areas respectively show the maximum likelihood and 95% CI of partial effects. CIs that do not include 0 are highlighted in blue. CI, confidence interval. https://doi.org/10.1371/journal.pbio.3001390.g004

Our second case study used coronaviruses to explore the ability of our combined genome feature–based model to distinguish different virus species within the same family and different genomes within a single virus species. Specifically, we predicted the zoonotic potential of all currently recognized coronavirus species, along with 62 human and animal-derived Sarbecovirus genomes all currently classified by the ICTV as Severe Acute Respiratory Syndrome (SARS)-related coronavirus [23]. All known human-infecting coronaviruses were classified as either medium or high zoonotic potential (Fig 5A). We also identified 2 additional animal-associated coronaviruses—Alphacoronavirus 1 and the recently described Sorex araneus coronavirus T14—as being at least as, or more likely to be capable of infecting humans than known, high-ranking, human-infecting coronaviruses; these should be considered high priority for further research. While this manuscript was in revision, a recombinant Alphacoronavirus 1 was detected in nasopharyngeal swabs from pneumonia patients, further strengthening the case that this species may be zoonotic [24]. We further observed variation in predicted zoonotic potential within coronavirus genera, which was consistent with our current understanding of these viruses. Alphacoronavirus and Betacoronavirus (the genera that contain known human-infecting species) also contained nonzoonotic species that were correctly predicted to have low zoonotic potential, while the majority of delta- and gammacoronaviruses received relatively low predictions (Fig 5A). These findings further illustrate the capacity of our models to discriminate risk below the virus family or genus levels.

PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 5. Probability of human infection predicted from coronavirus genomes. (A) Predictions for currently recognized Coronaviridae species and for 3 variants of SARS-related coronavirus: SARS-CoV (isolate HSZ-Cc, sampled early in the 2003 pandemic), SARS-CoV-2 (isolate Wuhan-Hu-1, sampled early in the current pandemic), and the closely related RaTG13 (sampled from Rhinolophus affinis in 2013). A dendrogram illustrates taxonomic relationships, with abbreviated genus names annotated on the right. Arrows highlight known human-infecting species. Asterisks indicate species absent from the training data, also present in Fig 3A. (B) Predictions for different representatives of SARS-related coronavirus. The isolation source of animal-associated genomes is indicated in parentheses. A maximum likelihood phylogeny illustrates relationships and was created as described in [6]. The outgroup, BtKy72 (sampled in Kenya in 2007), is not shown. In both panels, bars show the 95% interquartile range of predicted probabilities across the best performing 10% of iterations excluding the species being predicted, while circles show the mean predicted probability from these iterations. Numerical data underlying this figure can be found in S1 Table (panel A) and at https://github.com/nardus/zoonotic_rank/tree/main/FigureData (panel B; doi: 10.5281/zenodo.4271479). MERS-CoV, Middle East Respiratory Syndrome–related Coronavirus; M. ricketti CoV Sax-2011, Myotis ricketti alphacoronavirus Sax-2011; NL63-related bat CoV, NL63-related bat coronavirus strain BtKYNL63-9b; N. velutinus CoV SC-2013, Nyctalus velutinus alphacoronavirus SC-2013; R. ferrumequinum CoV HuB-2013, Rhinolophus ferrumequinum alphacoronavirus HuB-2013; SARS, Severe Acute Respiratory Syndrome; SARS-CoV-2, Severe Acute Respiratory Syndrome Coronavirus 2. https://doi.org/10.1371/journal.pbio.3001390.g005

Among sarbecoviruses, most genomes (85.5%) were classified as having medium zoonotic potential, including the causal agent of the 2003 SARS outbreak (Fig 5B). Interestingly, however, Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2; the causative agent of the current Coronavirus Disease 2019 [COVID-19] pandemic), the closely related RaTG13 from a rhinolophid bat, and all 5 closely related pangolin-associated isolates tested were predicted to have high zoonotic potential (although CIs between all sarbecoviruses tested overlapped, Fig 5B). Importantly, these predictions were made using iterations of our model that excluded the 2003 SARS-CoV genome or any other sarbecovirus from training. This finding, together with our observation that relatively few other animal-infecting, allegedly nonzoonotic coronaviruses had similarly high scores, suggests that the elevated risk of SARS-CoV-2 and closely related genomes discovered in animals could have been anticipated via sequencing-based surveillance and might have led to actionable research or surveillance prior to the zoonotic emergence of any sarbecovirus (Fig 5).

[END]

[1] Url: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001390

(C) Plos One. "Accelerating the publication of peer-reviewed science."
Licensed under Creative Commons Attribution (CC BY 4.0)
URL: https://creativecommons.org/licenses/by/4.0/

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/