(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .
Prediction of virus-host associations using protein language models and multiple instance learning [1]
['Dan Liu', 'Mrc University Of Glasgow Centre For Virus Research', 'Glasgow', 'United Kingdom', 'Francesca Young', 'School Of Computing Science', 'University Of Glasgow', 'Kieran D. Lamb', 'David L. Robertson', 'Ke Yuan']
Date: 2024-12
Predicting virus-host associations is essential to determine the specific host species that viruses interact with, and discover if new viruses infect humans and animals. Currently, the host of the majority of viruses is unknown, particularly in microbiomes. To address this challenge, we introduce EvoMIL, a deep learning method that predicts the host species for viruses from viral sequences only. It also identifies important viral proteins that significantly contribute to host prediction. The method combines a pre-trained large protein language model (ESM) and attention-based multiple instance learning to allow protein-orientated predictions. Our results show that protein embeddings capture stronger predictive signals than sequence composition features, including amino acids, physiochemical properties, and DNA k-mers. In multi-host prediction tasks, EvoMIL achieves median F1 score improvements of 10.8%, 16.2%, and 4.9% in prokaryotic hosts, and 1.7%, 6.6% and 11.5% in eukaryotic hosts. EvoMIL binary classifiers achieve impressive AUC over 0.95 for all prokaryotic hosts and range from roughly 0.8 to 0.9 for eukaryotic hosts. Furthermore, EvoMIL identifies important proteins in the prediction task, capturing key functions involved in virus-host specificity.
Being able to predict which viruses can infect which host species, and identifying the specific proteins that are involved in these interactions, are fundamental tasks in virology. Traditional methods for predicting these interactions rely on identifying common features among proteins, overlooking the structure of the protein “language” encoded in individual proteins. We have developed a novel method that combines a protein language model and multiple instance learning to allow host prediction directly from protein sequences, without the need to extract features manually. This method significantly improved prediction accuracy and revealed key proteins involved in virus-host interactions.
Funding: DL is funded by European Union’s Horizon 2020 research and innovation program, under the Marie Skłodowska-Curie Actions Innovative Training Networks grant agreement no. 955974 (VIROINF). The authors also acknowledge support from the following grants: the Medical Research Council (MRC, MC_UU_12014/12, MC_UU_00034/5, MR/V01157X/1) to DLR, a Doctoral Training Programme in Precision Medicine studentship for KDL, MR/N013166/1, the Biotechnology and Biological Sciences Research Council (BBSRC, BB/V016067/1) to DLR, FY and KY, and Engineering and Physical Sciences Research Council (EPSRC, EP/R018634/1) to KY. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Copyright: © 2024 Liu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
(A) Protein sequences of viruses and virus-host associations are collected from the VHDB [ 10 ]. For each host, we collect the same number of positive and negative viruses, and then embeddings of protein sequences from viruses are obtained by the pre-trained transformer model [ 7 ], which are features for host predictions based on attention-based MIL; (B) Protein sequences of viruses are split to sub-sequences, which are used as input to the pre-trained transformer model to obtain the corresponding embeddings; (C) There is a host label for a set of protein sequences on each virus, and attention-based MIL is applied to train the model for each host dataset by protein embeddings of viruses. Finally, we predict the host label for each virus and assign an instance weight that represents the importance of each protein for the virus.
In this paper, we introduce EvoMIL a method for predicting virus-host associations by combining the (Evo)lutionary Scale Modeling with (M)multiple (I)instance (L)earning, Fig 1 . EvoMIL uses the model ESM-1b [ 7 ] to transform viral protein sequences into embeddings (i.e., numerical vectors) that are then used as features for virus-host classification. Multiple-instance learning allows us to consider each virus as a “bag” of proteins. We demonstrate that the embeddings capture the host signal from the viral sequences achieving high prediction scores at the species level of both prokaryotic and eukaryotic hosts. Furthermore, attention-based MIL enables us to identify which proteins are highly important in driving prediction and by implication are important to virus-host specificity.
The combination of the two approaches is particularly suited for virus-host prediction, as virus proteins collectively contribute to the association with a host. Instead of relying on predefined features, protein language models provide automatically learned features, free from design biases and limitations of the previous approaches. The ability to measure similarity and differences between protein sequences further boosts prediction performance through multiple instance learning, where viral proteins enabling interaction with hosts are highlighted through unbiased weight estimation.
Here, we present a virus-host prediction model combining protein language models (PLMs) and multiple instance learning (MIL). Transformers are self-supervised deep learning models [ 6 ] that learn the relationships among words within a sentence, and now dominate the field of natural language processing. More recently, the same architecture has been applied in biology, where words are replaced by amino acids and sentences by protein sequences. These transformer-based protein language models generate protein embeddings that encode structural features inferred from amino acid sequences based on large-scale protein databases [ 7 ]. Protein language models are trained on publicly available protein sequence archives and learn biological information from physiochemical properties of the individual amino acids to structural and functional information about proteins. Multiple instance learning (MIL) is a form of supervised learning that was developed for image processing tasks [ 8 ]. Instead of using individually labelled instances for classification, multiple instances are arranged together in a bag with a single label and classified together. We use attention-based MIL [ 9 ], which has the additional advantage of weighting instances in a bag, thereby indicating the importance of each instance in prediction.
A number of computational approaches have been developed to predict unknown virus-host species associations. The coevolution of a virus and its host leave signals in virus genomes arising from the virus-host interaction. These signals have been exploited for in silico prediction of virus-host associations from virus genomes alone and fall into two broad types: 1) alignment-based approaches that search for homology such as prophage [ 2 ], CRISPR-cas spacers [ 3 , 4 ]; 2) Alignment-free methods that use features such as k-mer composition, codon usage, CpG content etc. to measure the similarity between viral and host sequences or to other viruses with a known host [ 5 ]. To date, no computational approaches consider the structure of proteins from viruses for host species prediction purposes.
Advances in sequencing technologies, particularly metagenomics, have resulted in the identification of many new viruses. However, more than 90% of the virus sequences held in publicly available databases are not annotated with any host information [ 1 ]. Currently, there are no high-throughput experimental methods that can definitively assign a host to these uncultivated viruses. With a growing number of viruses being discovered, relying only on experiments to identify virus-host associations is a limiting step in this important challenge.
Results
Dataset for predicting virus-host association Balanced binary datasets were generated from known virus-host associations documented in the Virus-Host database (VHDB) [10] for all hosts with a minimum threshold number of associations. These datasets consist of either all prokaryotic or all eukaryotic viruses. ‘Positive’ viruses are those that are reported to be associated with the given host species. A matching number of ‘negative’ viruses are randomly sampled from all other prokaryotic or eukaryotic viruses. The prokaryote datasets consist of nearly all dsDNA (double-stranded DNA) viruses which have 45 to 212 proteins coded in their genomes (S1 Table), while the eukaryotic datasets include many RNA viruses that contain fewer proteins, ranging from 2 to 23 protein sequences (S2 Table). The performance of MIL improves with higher numbers of instances in each bag, therefore we need to increase the threshold for the number of viruses in the eukaryotic training datasets to achieve similar performance with MIL. Accordingly, we set a threshold for minimum positive dataset size to 50 and 125 viruses for constructing prokaryotic and eukaryotic binary datasets, respectively. The aim of setting the threshold is to generate a sufficient number of training samples for MIL training on prokaryotic and eukaryotic hosts, respectively. Finally, we generated 15 prokaryotic host datasets and 5 eukaryotic host datasets for the binary classification tasks. To evaluate the performance of binary models, we created a balanced set of positive and negative samples. For negative samples, we used two different strategies to sample the negative viruses from those with no known association with each host identified above to create balanced binary datasets. Given that the actual association is unknown, this is susceptible to false negative labels. Strategy 1 was used to establish the concept of EvoMIL. We sampled the negative viruses from all viruses that are in different genera than viruses in the positive dataset, with the aim of minimising the false negative viruses in the dataset. Strategy 2 aimed to make the task progressively more challenging using the fact that as a result of coevolution and co-speciation similar viruses tend to infect similar hosts. Here we selected the negative viruses from those that infect hosts in the same taxonomic rank as the positive host, from phylum to genus, thereby meaning the classifier had to distinguish between more and more similar viruses. For Strategy 2 the negative samples and positive samples are more likely to share proteins exhibiting structural mimicry [11], so it will be challenging to train classifier models and make the binary models sufficiently sensitive to capture the difference between positive and negative samples. The number of viruses related to each host is shown in S1 Table. The largest prokaryote dataset is Mycolicibacterium smegmatis with 838 known viruses, followed by Escherichia coli with approximately half the number of viruses. For the eukaryotic datasets, Homo sapiens have by far the largest number of known virus species (1321) with the next highest being the tomato (Solanum lycopersicum) at 277, see S2 Table. The distribution of the top 10 virus families can be found in S1 Fig. Approximately 60% of viruses associated with prokaryotes belong to the Siphoviridae family (see S1A Fig), whereas the Geminiviridae, Picornaviridae and Papillomaviridae families are the top three families in eukaryotic hosts, each accounting for roughly 18% of viruses associated with eukaryotic hosts (see S1B Fig). Note that viruses associated with eukaryotes are more diverse than those associated with prokaryotes.
EvoMIL achieves high performance for binary virus-host prediction Embedding vectors for each of the proteins of a virus generated with the protein language model, ESM1b, were used as an instance in a “virus bag” for MIL. These labelled bags were used to train the MIL model using 5-fold cross-validation on 80% of the datasets, then each 5-fold model performance was evaluated on the remaining 20% of the datasets. Each model is evaluated with a range of metrics: AUC, accuracy, F1 score, sensitivity, specificity, and precision. We evaluated the predictive performance of EvoMIL for binary classification using the datasets generated with both Strategy 1 and Strategy 2 above, training a prediction model for each host. Prokaryotic and eukaryotic host performance. The heatmaps of evaluation indices for the prokaryotic and eukaryotic host classifiers are presented in Fig 2A and 2C. Here, evaluation indices are calculated based on the best-performing model with the highest AUC chosen in 5-fold cross-validation. In Fig 2A, the accuracy is higher than 0.9 except for two hosts, which are 0.86. The ROC curves in Fig 2B show that all prokaryotic classifiers perform very strongly with each host achieving an AUC greater than 0.95 and 8 achieving an AUC of 1. PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 2. Performance of binary classification tasks. This figure separately shows the heatmap of AUC, accuracy, F1 score, sensitivity, specificity, and precision on 15 prokaryotic (A) and 5 eukaryotic host binary classifiers (C), negative samples are selected by strategy 1; ROC curves of 15 prokaryotic hosts (B) and 5 eukaryotic hosts (D) corresponding with heatmap plots A and C; AUC values of different taxonomies on prokaryotic (E) and eukaryotic hosts (F) where negative samples are selected using strategy 2.
https://doi.org/10.1371/journal.pcbi.1012597.g002 We also obtained the mean and standard deviation of each host, by testing 5-fold cross-validation models on the host test dataset (see S3 Table). EvoMIL shows good performance, with 14/15 hosts achieving a mean AUC greater than 0.9. Overall, our results demonstrated that EvoMIL shows an impressive performance in the binary classification tasks of viruses associated with prokaryotic hosts. More evaluation metrics are included in S3 Table. The accuracy of each eukaryotic host classifier is shown in Fig 2C, it is clear that all hosts perform with an accuracy higher than 0.8 except for two hosts which are roughly 0.7. H. sapiens obtained the highest accuracy with 0.84; Mus musculus has the lowest accuracy with 0.69. The ROC curves of the 5 eukaryotic hosts classifiers are presented in Fig 2D. Although the eukaryote classifiers achieve good performance with AUCs above 0.77, they perform less well than the prokaryote classifiers, with only 3/5 datasets scoring an AUC above 0.85. There may be several explanations for the lower performance. Firstly, the average number of proteins per virus is much lower, resulting in small “bags” for MIL. Secondly, there is a much higher diversity of virus types in the datasets of the eukaryotic hosts, often containing viruses from multiple Baltimore classes. The virus of these different classes is polyphyletic meaning they will have no common ancestor and therefore have no shared genes and interact with different host pathways. The mean and standard deviation of each host are obtained by testing five trained cross-validation models on the test data set (see S4 Table). Here, the mean AUC is higher than 0.85 except for the classifiers of two hosts that perform less well, with AUC scores of 0.761±0.01 for Sus scrofa and 0.762±0.02 for M. musculus. Overall, our results demonstrate that EvoMIL performs well in binary classification tasks of viruses associated with eukaryotic hosts. Sampling negative samples from similar viruses makes binary host classification more challenging. Next, we test our model with more challenging tasks. Using the second strategy of selecting negative viruses that are associated with hosts sharing the same taxonomic rankings as the hosts associated with positive viruses, we observe that the classification task becomes increasingly challenging as we move from the phylum to the genus level. Results show that our EvoMIL models achieve high AUC scores but that distinguishing between viruses of similar hosts is more difficult with a noticeable drop in performance at family and genus levels. In Fig 2, the box plots show AUC values of prokaryotic (E) and eukaryotic (F) hosts based on negative selection strategy 2 with five taxonomies genus/family/order/class/phylum. Phylum level (lime colour) presented significant improvement compared with lower taxonomies, especially the genus level. Note, at the lower taxonomic ranks there are only sufficient numbers of negative viruses to meet our threshold of 50 for 4 hosts at the genus level, 8 hosts at the family level and 13 hosts at the order level. To quantify the difficulty of the task, we computed the sequence similarity scores between all pairs of positive and negative sets on strategy 2 using MMSeq2 [12] (see S2 and S3 Figs). Most of the scores are above 0.6. Looking at the scores across taxonomy levels, the phylum (purple) level tends to have lower similarities, while the family (orange) and order (green) levels tend to exhibit higher identity scores. These suggest a good degree of sequence similarity between positive and negative viruses, therefore a challenging classification task.
[END]
---
[1] Url:
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1012597
Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.
via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/