(C) PLOS One

(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .

Deep mining of the Sequence Read Archive reveals major genetic innovations in coronaviruses and other nidoviruses of aquatic vertebrates [1]

['Chris Lauber', 'Institute For Experimental Virology', 'Twincore Centre For Experimental', 'Clinical Infection Research', 'A Joint Venture Between The Hannover Medical School', 'Mhh', 'The Helmholtz Centre For Infection Research', 'Hzi', 'Hannover', 'Cluster Of Excellence Resist']

Date: 2024-05

Virus discovery by genomics and metagenomics empowered studies of viromes, facilitated characterization of pathogen epidemiology, and redefined our understanding of the natural genetic diversity of viruses with profound functional and structural implications. Here we employed a data-driven virus discovery approach that directly queries unprocessed sequencing data in a highly parallelized way and involves a targeted viral genome assembly strategy in a wide range of sequence similarity. By screening more than 269,000 datasets of numerous authors from the Sequence Read Archive and using two metrics that quantitatively assess assembly quality, we discovered 40 nidoviruses from six virus families whose members infect vertebrate hosts. They form 13 and 32 putative viral subfamilies and genera, respectively, and include 11 coronaviruses with bisegmented genomes from fishes and amphibians, a giant 36.1 kilobase coronavirus genome with a duplicated spike glycoprotein (S) gene, 11 tobaniviruses and 17 additional corona-, arteri-, cremega-, nanhypo- and nangoshaviruses. Genome segmentation emerged in a single evolutionary event in the monophyletic lineage encompassing the subfamily Pitovirinae. We recovered the bisegmented genome sequences of two coronaviruses from RNA samples of 69 infected fishes and validated the presence of poly(A) tails at both segments using 3’RACE PCR and subsequent Sanger sequencing. We report a genetic linkage between accessory and structural proteins whose phylogenetic relationships and evolutionary distances are incongruent with the phylogeny of replicase proteins. We rationalize these observations in a model of inter-family S recombination involving at least five ancestral corona- and tobaniviruses of aquatic hosts. In support of this model, we describe an individual fish co-infected with members from the families Coronaviridae and Tobaniviridae. Our results expand the scale of the known extraordinary evolutionary plasticity in nidoviral genome architecture and call for revisiting fundamentals of genome expression, virus particle biology, host range and ecology of vertebrate nidoviruses.

Research in virology is primarily motivated by human pathogens, such as SARS-CoV-2 in the case of the family Coronaviridae in the order Nidovirales. Studies of these and few model viruses describe virus-host interactions on the molecular level and are essential for developing virus control measures, but they must accommodate a vast range of viral natural diversity to allow generalizations. Here, we redefine our understanding of the genetic and genomic diversity in corona- and other nidoviruses of poorly sampled hosts. We mine more than 269,000 publicly accessible raw sequencing datasets for the presence of viral sequences using high-performance computing and discover 40 nidoviruses including 13 coronaviruses from a wide range of vertebrates. Some of the novel viruses from aquatic hosts have extraordinary features such as segmented genomes and recombinant genes coding for structural proteins. Our study suggests that gene exchange between diverse nidovirus species from different virus families might be more frequent than previously thought and can result in abrupt genomic innovations that in turn might facilitate host jumps even across vertebrate class borders. The growing list of newly discovered (corona)viruses enables an evolutionary perspective across virus divergency scales in different hosts on the wet lab-acquired knowledge about few viruses.

Funding: CL and TP are supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy - EXC 2155 - project number 390874280. CL, RB and SS acknowledge support of the Project “Virological and immunological determinants of COVID-19 pathogenesis – lessons to get prepared for future pandemics (KA1-Co-02 “CoViPa”)”, a grant from the Helmholtz Association’s Initiative and Network Fund. The coronavirus research in the lab of TP is supported by the Niedersächsisches Ministerium für Wissenschaft und Kultur (Ministry for Science and Culture of Lower Saxony) (grant 14-76103-184 CORONA-13/20). BWN acknowledges funding from the Texas A&M-Grants program OR is supported by the European Research Council (ERC) under the European Union’s Horizon research and innovation program (MALEPREG: eu-repo/grantAgreement/EC/H2020/855659) and the German Research Foundation (RO-4628/9-1; RO4628/3-2; RO 4628/3-3). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Here we introduce an original, highly parallelized computing workflow that has a sequence homology search with advanced sensitivity at its core and implements a targeted assembly approach to reconstruct full-length viral genome sequences. By applying this approach to more than 269,000 SRA datasets we reconstructed genome sequences of numerous vertebrate nidoviruses. A subset thereof are prototype members of 18 tentative novel genera of nidoviruses, including novel coronavirus genera as well as five tentative novel nidovirus subfamilies. The newly discovered viruses include 11 coronaviruses with bisegmented genomes that form a monophyletic lineage and are members of subfamily Pitovirinae, family Coronaviridae infecting aquatic hosts. We recover the sequences of both segments from newly generated RNA samples of 69 infected fishes. Moreover, we describe a new coronavirus with a giant genome of 36.1 kb that encodes two genes with significant sequence similarity to the S gene of other corona- and tobaniviruses. The identification of leader and body transcription regulatory sequences (TRSs) provide evidence for the production of subgenomic RNAs in two of the discovered nidoviruses. Our comparative genomics and phylogenetic analyses suggest a possible swapping of structural proteins between ancestors of subsets of corona- and tobaniviruses.

The presence of characteristic nidoviral protein domains, ZBD and NiRAN, and phylogenetic clustering using RdRp and HEL1 allow reliable identification of nidoviruses by comparative genomics. Following the emerging trend in virology, recently described nidoviruses have been discovered by bioinformatics analysis of next or third generation sequencing data from meta-genomic and -transcriptomics studies of diverse specimens [ 9 , 10 , 37 – 50 ]. These datasets are composed of overlapping sequence fragments of variable lengths (so-called reads) and various origins that can be assembled into contigs, some of which may represent full-length or partial viral genomes. Discrimination of viral from non-viral contigs is typically achieved by sequence-based comparisons involving known reference organisms, including viruses. Depending on the sensitivity of the method and the degree of divergence of the sequences in a sample, a fraction of sequences usually remains unclassified; this sequence space is often called ‘dark matter’ and may include undescribed highly divergent viruses [ 51 ]. Both assembled and unprocessed sequencing data are deposited in public databases, making them available for (re-)analysis by the scientific community. Examples include the Transcriptome Shotgun Assembly (TSA) database, the Whole Genome Shotgun (WGS) database and the Sequence Read Archive (SRA) whose sizes grow with a non-linear rate. The latter stores unprocessed, primary sequencing data along with often detailed metadata annotation, and it has been demonstrated that the SRA and similar data repositories are a rich source of hitherto unknown novel viral sequences [ 33 , 52 – 55 ].

There is accumulating evidence for homologous recombination involving various genomic regions including the S ORF within coronavirus species and between coronaviruses of closely related species [ 26 – 29 ]. We are not aware of comparable evidence for homologous recombination between members of different nidovirus families, although such studies may be complicated by the limited inter-family conservation. On the other hand, a pivotal role of heterologous recombination in the generation of nidovirus diversity is documented. It is evident in the duplication of protein domains and the restricted phyletic distribution of many conserved proteins to some nidovirus lineages and their phylogenetic links to non-nidovirus homologs [ 30 ]. In this respect, nidoviruses are no different than other RNA and DNA viruses [ 31 – 34 ]. The above characterization is dominated by studies involving comparative genomics which define our understanding about the type and scale of recombination. Quantifying incongruences of phylogenetic trees for different genome regions is a main approach to the characterization of homologous recombination in RNA viruses, including nidoviruses [ 27 , 29 , 35 , 36 ].

Coronaviruses express four structural proteins from their 3’ORFs encoded in the order: spike glycoprotein (S), envelope protein (E), matrix protein (M) and nucleocapsid phosphoprotein (N); this genome region may also encode non-essential accessory proteins [ 23 , 24 ]. The C-terminal half of S (S2) is well conserved within the family Coronaviridae and includes determinants of lipid association and infectivity; this conservation partly extends also to the sister family Tobaniviridae (formerly known as subfamily Torovirinae, family Coronaviridae) [ 25 ]. Otherwise, there is little to no similarity reported at the sequence level between coronaviral structural proteins and those of other nidoviruses.

Comparative genomics played a pivotal role in advancing our understanding of coronaviruses and other nidoviruses by assigning putative functions to many nidovirus proteins, which were subsequently confirmed and elaborated in experimental studies [ 14 – 19 ]. All nidoviruses express a conserved array of five protein domains in ORF1a/ORF1b that control genome expression and replication. These include i) the 3C-like main protease (3CLpro or Mpro) flanked by highly variable but ubiquitous transmembrane domains, ii) the nidovirus RNA-dependent RNA polymerase (RdRp)-Associated Nucleotidyltransferase (NiRAN), iii) the RdRp, iv) a zinc-binding domain (ZBD) and v) a superfamily 1 helicase (HEL1). NiRAN and ZBD have no known virus homologs outside the nidoviruses and are therefore considered to be genetic markers of nidoviruses [ 15 , 20 ]. Nidoviruses with genomes exceeding 20 kb additionally encode an exoribonuclease (ExoN) with proofreading activity that has been linked to genome expansion by improving the otherwise low fidelity of the RdRp [ 20 – 22 ].

A hallmark of corona- and other nidoviruses is the exceptionally large size range of their single-segment genomes that include at the upper end the 35.9 kb genome of Aplysia abyssovirus 1 (AAbV) [ 9 ] and the largest known RNA virus genome of 41.1 kb from the planarian secretory cell nidovirus (PSCNV) [ 10 ]. Most nidovirus genomes have the canonical architecture (from 5’ to 3’): 5’ untranslated region (5’UTR), open reading frame (ORF) 1a, ORF1b, 3’-proximal ORFs (3’ORFs) and 3’UTR. Products encoded in ORF1a/b are generated by translation of the genomic RNA, comprising a -1 ribosomal frameshift in the region of the ORF1a/b overlap [ 11 ]. The 3’ORFs are expressed via subgenomic RNAs whose numbers vary between nidovirus species [ 12 , 13 ].

Nidoviruses form 14 virus families in the order Nidovirales for which the International Committee on Taxonomy of Viruses (ICTV) currently recognizes 48 genera and 130 species in total [ 1 – 3 ]. The members of eight nidovirus families (Arteriviridae, Coronaviridae, Cremegaviridae, Gresnaviridae, Nangoshaviridae, Nanhypoviridae, Olifoviridae, Tobaniviridae) have vertebrate hosts while those of the remaining six families (Abyssoviridae, Euroniviridae, Medioniviridae, Mesoniviridae, Monidoviridae, Roniviridae) infect invertebrates. The nidovirus family Coronaviridae has attracted unparalleled scientific and public attention due to the emergence of the human pathogens severe acute respiratory syndrome coronavirus (SARS-CoV) in 2002, Middle East respiratory syndrome coronavirus (MERS-CoV) in 2012 and the SARS-CoV-2 in 2019 [ 4 – 7 ], although studies on coronaviruses have fueled research in virology and beyond for decades [ 8 ].

Results

A virus discovery approach targeting low sequence similarities in unprocessed SRA data Our three-stage computational approach involves the sensitive sequence similarity-based detection of viral sequence reads in a raw sequence dataset followed by the assembly of full-length viral genome sequence(s) and virus assignment. To achieve this with reasonable speed, we queried the in silico translated primary sequencing data (raw reads) in a highly parallelized fashion using profile Hidden Markov Models (pHMMs) of proteins characteristic of a virus group, often including the RdRp, and a targeted viral genome assembly method. At the first stage, called Virushunter, we identified most conserved sequences of viruses that may belong to a group of interest. We aimed for ensuring detection of divergent viral sequences with sequence identity to viral reference proteins well below 35% in the ‘twilight zone’ of protein sequence similarity [56]. The identified hits served as seeds at the next stage, called Virusgatherer. These seeds were gradually extended with overlapping reads to assemble a genomic sequence or its segment as complete as technically feasible, which was one of the main objectives of this study. At the third stage, Virusassignment, the assembled sequence was assigned to the group of interest or another related virus group. We exclusively utilized non-commercial high performance computing infrastructure that is free of charge for scientific purposes. With the aim to assure reliability of the viral genome assemblies and to enable future comparisons of assembly quality between different studies, we developed two novel metrics that quantitatively assess the per-base and contig-wide accuracy of the genome sequences (see Materials and Methods for details). These metrics are applicable across datasets and we foresee that this or similar approaches could be adopted as community standards. We considered to quantify genome completeness, defined here as coverage of the complete protein coding part of the genome, the complete 3’UTR including a poly(A) tail and a complete or partial 5’UTR. However, we found it unrealistic, because this metric would require comparisons with closely related reference sequences that are rarely available, in particular for novel divergent viruses, which dominated our dataset. To assess contig-wide sequence assembly accuracy, we computed the MInimal Coverage Depth (MICO) of a contig. First, we determined the position(s) of the contig to which the lowest number of sequencing reads align and then declaring this number of reads to be the contig mico (Fig 1A). To deduce MICO values, we then mapped the mico to deciles of the mico distribution that we computed for a reference assembly set consisting of 2350 RNA virus sequences, which were assembled from vertebrate and invertebrate datasets by others and verified by Sanger sequencing [37,57]. This resulted in possible MICO values for our nidovirus contigs in the range of 1 to 10, with MICO = 1 assigned to contigs with mico values in the lowest 10% of mico values of the reference set, while MICO = 10 is given to contigs in the highest 10% of reference mico values. To assess the per-base accuracy of our contigs, we computed the Mean Alignment Score (MEAS) of a contig by calculating the average alignment score across all sequencing reads overlapping with a sequence position and then averaging this value across all sequence positions (meas) (Fig 1A). We mapped the meas values to deciles of a reference set distribution to derive MEAS, as we did for MICO. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 1. Assembly quality assessment. (A) Toy example visualizing how meas (left) and mico (right) assembly quality metrics are calculated. Alignment scores used for meas were calculated using Bowtie2 and have a maximum value of zero corresponding to reads aligning full-length without mismatches. (B) Distribution of meas and mico values obtained for the nidoviral sequences discovered and assembled in this study (green) and for 2350 reference RNA virus sequences (gray) [37,57]. Both x-axes are in log 10 scale. https://doi.org/10.1371/journal.ppat.1012163.g001 The MICO and MEAS values are readily interpretable and comparable across datasets as a particular assembled sequence is assigned to one of ten possible quality categories defined by a reference population of assemblies. This reference population should be formed by assemblies of known and acceptable quality, and it could be continuously expanded in the future, further improving this quality assessment method.

Discovery of novel vertebrate nidoviruses With the aim to systematically screen unprocessed, primary sequencing data from the SRA databank we built on two earlier pilot studies [33,52] and analyzed 269,184 transcriptomic datasets. We included in our screen sequencing runs from vertebrates excluding those from highly over-represented model organisms like zebrafish, mouse and rat, as well as human (as of February 2022 for fishes and September 2020 for other vertebrates). For control purposes, we included several SRA transcriptome datasets of Aplysia californica, Schmidtea mediterranea and Microhyla fissipes from which three divergent nidoviruses had been discovered recently [9,10]. The total selection amounted to 428.6 terabyte of (compressed) data. We used our newly developed approach to scan the vertebrate SRA data (Virushunter stage) and obtained 1,924 significant hits (E-value < 1x10-4) in 0.6% of the analyzed sets that comprised a large variety of host taxa (Fig 2). We then conducted a targeted virus genome assembly for these SRA datasets (Virusgatherer stage) by starting with a seed formed by the respective sequencing reads identified in the first Virushunter stage and iteratively extending the contig sequence using additional reads that (partially) align to the contigs ends, until no further matching reads were found. The resulting contigs were subsequently filtered to remove any remaining non-viral sequences (blastx against non-viral portion of RefSeq-Protein with E-value cut-off of 1x10-4) and to retain sequences of at least 1000 nucleotides in length. Using a profile search specific for nidoviruses with nidovirus-wide NiRAN and RdRp pHMMs as query, we selected those contigs that showed highest sequence similarity; we considered the selected contigs as belonging to nidoviruses (Virusassignment stage; see Materials and Methods for further details). The majority of assembled viral sequences were found in samples from host orders Artiodactyla and Primates and often matched Porcine reproductive and respiratory syndrome virus (PRRSV) 1, PRRSV-2, MERS-CoV or SARS-CoV that were used in experimental infections of the respective laboratory animals (Fig 2). Likewise, we also reassembled genomes of two recently described, divergent nidoviruses with giant genome sizes from a sea slug (Aplysia californica) and a flatworm (Schmidtea mediterranea) as well as a novel corona-like virus from a frog (Microhyla fissipes) identified in our screen [9,10]. These confirmatory results could be considered a positive control of the novel viruses discovered in unrelated experiments from a wide range of hosts; they further validate our approach. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 2. Virus discovery in the SRA. All numbers in the different panels correspond to counts of SRA runs. (Left) Results of a profile hidden Markov model (pHMM) based sequence homology search in the raw read data (Virushunter). Significant hits (at least one sequencing read with E-value < 1x10-4) against one of three nidovirus pHMMs (see Materials and Methods for details) are shown if the corresponding sequences did not give better hits against other RNA viruses or against host sequences. Hits are grouped by order of the putative vertebrate host according to the annotation of the sequencing projects. Note that a detected sequence may not necessarily be from a member of the order Nidovirales but might also be from a virus of a related taxon for which no reference sequence was available by the time of analysis. (Right) Remaining hits after targeted viral genome assembly (Virusgatherer). Only contigs of at least 1000 nt in length were considered, and those with significant hits (covering at least 500 nt with E-value < 1x10-4) against nidoviruses were kept. Bars are colored according to four major groups of the putative hosts (see common legend at the bottom-right). https://doi.org/10.1371/journal.ppat.1012163.g002 Focusing our subsequent analyses on novel vertebrate nidoviruses related to members of the families Coronaviridae, Tobaniviridae, Arteriviridae, Cremegaviridae, Gresnaviridae, Olifoviridae, Nanhypoviridae and Nanghoshaviridae, we recognized novel vertebrate nidoviruses as belonging to a single population entity (putative virus species) if the respective genomes showed >90% nucleotide sequence identity. To account for the observation that a particular virus species could be identified in multiple SRA datasets, we merged the respective sequences to reduce the associated sequence redundancy and generated variant calling files for these cases (see Materials and Methods for details). If a viral contig clustered with a known reference virus under the operational virus species threshold it was assigned to be a variant of that species, otherwise it was designated to be the prototype of a novel species. This strategy resulted in the delineation of 40 tentative species of nidoviruses of which 21 were novel. The viral sequences were derived from various organs/tissues of putative vertebrate hosts (S1 Table). The 19 known nidoviruses putatively infecting vertebrates were independently discovered in another study using a different computational data mining pipeline [53], further supporting the validity of our approach and findings. When assessing viral genome assembly quality, we found that the MICO values of the novel vertebrate nidoviruses discovered in this study are in the same range as those of the reference set (Fig 1B), although slightly smaller on average (Wilcoxon rank-sum test, P = 0.020). The latter is expected as the reference sequences were from dedicated virus discovery projects favoring virus-enriched samples while our nidovirus sequences were not. At the level of an individual genome sequence, we observed almost the entire spectrum of MICO values for our novel nidoviruses (S1 Fig). Low MICO values concerned several partial fish coronavirus genomes with missing sequence. We estimated position and length of the missing pieces via comparison with the genome sequence of the most closely related virus known so far. We obtained reasonable to good coverage depth for the remaining viral contigs without internal missing sequence (S2 and S3 Figs). We again observed slightly lower MEAS values for the novel nidovirus contigs compared to the sequences of the reference set (Wilcoxon rank-sum test, P = 0.004) and a large spectrum of MEAS values (Figs 1B and S1).

Novel corona- and tobaniviruses Having described our computational approach and its accuracy, in the following we report the nidoviruses and their genomic properties that we discovered in our analysis. According to a RdRp+ZBD+HEL1-based phylogeny reconstruction and a genetics-based classification analysis by DEmARC [58], the 40 discovered putative virus species (species-like operational taxonomic units, sOTUs) can be grouped into 6 family-like operational taxonomic units (fOTUs), 13 subfamily-like OTUs (sfOTUs) and 32 genus-like OTUs (gOTUs) of which 5 sfOTUs and 18 gOTUs are novel (Fig 3A and S2 Table). All viruses with complete or nearly complete genome sequences encode a conserved array of nidovirus enzymes, 3CLpro-NiRAN-RdRp-ZBD-HEL1 and, additionally, a conserved O-Methyltransferase (OMT) domain was detected at the expected C-terminal position of ORF1b or its equivalent in viruses with large genomes (Figs 4 and S4) [59]. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 3. Maximum likelihood phylogenies of non-structural and structural proteins and tanglegram of vertebrate nidoviruses. The trees are based on protein alignments from which poorly conserved positions were manually removed. The phylogenies of non-structural proteins involving Coronaviridae and Tobaniviridae members (top) and Nangoshaviridae, Nanhypoviridae, Gresnaviridae, Olifoviridae and Arteriviridae members (bottom) are based on a concatenated alignment of RdRp, ZBD and HEL1 (A). The S protein phylogenies involving Coronaviridae and Tobaniviridae members (B) are based on conserved regions of the S2 part of the spike protein in coronaviruses or the homologous part in tobaniviruses. Two separate trees for A and three separate trees for B were constructed (see Materials and Methods for details). The branch lengths are in units of aa substitutions per site; scale bars are shown. White and black circles at internal nodes indicate branching support. Tips corresponding to reference viruses are shown as gray circles and those constituting lineages rediscovered or newly discovered from SRA data as blue and red circles, respectively. Family-like, subfamily-like and genus-like OTUs derived from a genetics-based classification using DEmARC are shown using dark gray, light gray and white rectangles, respectively; known or predicted host types are indicated by colored diamonds next to the virus names; viruses with bisegmented genomes, inferred recombinant S2 and those expressing a putative glycosidase domain are highlighted by colored squares (see legend at the bottom-right). Possible additional recombinant S2 cases are discussed in the text. https://doi.org/10.1371/journal.ppat.1012163.g003 PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 4. Genomic layout of novel coronaviruses and five reference viruses. Viruses that don’t start with an accession number in their name are discovered in this study. Predicted open reading frames (ORFs) of at least 300 nucleotides in length are shown as white rectangles; ORFs are defined to start and end at a stop codon. Protein domains predicted via profile HMM are indicated in color; transmembrane helix (TMh), macrodomain (Macro), 3C-like protease (3CLpro), RNA-dependent RNA polymerase (RdRp), RdRp-associated nucleotidyltransferase (NiRAN), zinc-binding domain (ZBD), superfamily 1 helicase (HEL1), O-methyltransferase (OMT), lamina-associated polypeptide 1C-like protein (LAP1C), family 18 glycosidase (GH18), spike/glycoprotein (S/gp), matrix protein (M), nucleocapsid protein (N), US22 protein (US22). Domain borders are drawn according to the corresponding profile search hit and the actual domains may extend beyond these borders. Black bars above a genome indicate missing sequence. https://doi.org/10.1371/journal.ppat.1012163.g004 Thirteen of the discovered viruses clustered with known viruses within the family Coronaviridae: 12 detected in experiments with fish and one in the axolotl amphibian (Ambystoma mexicanum). These viruses comprise twelve new tentative genera outside the two established subfamilies Orthocoronavirinae and Letovirinae, according to our genetics-based classification. One of the new fish coronaviruses from Hypomesus transpacificus (HTCV) had a genome size of over 36 kb, making it, to the best of our knowledge, the largest RNA virus with monopartite genome infecting vertebrates (Fig 4). The genome seems to be coding-complete and ends with a poly(A) tail indicative of the genuine 3’-terminal end. The 36kb fish virus genome encodes two consecutive ORFs immediately downstream of ORF1b that both show significant sequence similarity to coronavirus spike proteins (Fig 4) and which share around 21% local protein sequence identity when compared to each other. Eleven of the discovered coronaviruses have bisegmented genomes in which the first segment has coding regions for ORF1a and ORF1b, but no other ORFs, while the second segment encodes the structural proteins spike, matrix, and nucleocapsid as well as accessory proteins; they form an RdRp+ZBD+HEL1-based monophyletic cluster comprising subfamily Pitovirinae (Fig 4). Genome bisegmentation in these viruses was supported by the presence of a poly(A) tail at the 3’-end of many of the analyzed segments (Fig 4; see also below). Additionally, we did not identify any sequencing read pairs for which one mate aligns to the 3’-end of segment 1 and the other to the 5’-end of segment 2 in the order that would be expected if bipartition of the genome sequence was fortuitously due to sequencing or assembly artifacts (see Materials and Methods for details). The phylogenetic cluster of these bisegmented viruses includes an already described coronavirus (Pacific salmon nidovirus) from a fish species (Oncorhynchus tshawytscha) [47]. It was originally annotated as monopartite in GenBank (accession MK611985), but does employ a bisegmented genome as well, according to our analysis. This viral genome shows a unique insertion of a macrodomain locus in ORF1b between endonuclease (EndoU) and OMT that reside in nsp15 and nsp16 in experimentally characterized coronaviruses (Fig 4). We identified signal peptide cleavage sites for all in silico translated spike protein sequences of the unsegmented and bisegmented coronavirus genomes, with predicted signal peptide sizes in the range of 15 to 41 amino acids. Strikingly, for each coronavirus with bisegmented genome we observed an excess of sequencing reads mapping to segment 2 relative to segment 1 (S2 Fig), suggesting a much higher abundance of segments 2 compared to segment 1 with a fold ratio of 8.2 on average (range of 1.3 to 19.2). This difference was statistically significant (Wilcoxon signed-rank test, P = 0.008, n = 8) and may be rationalized as a means to regulate the expression of proteins encoded on segment 2 relative to those on segment 1, similar to the expression via subgenomic RNAs in the monopartite coronaviruses. Twelve of the nidovirus sOTUs identified in our screen grouped with members of the family Tobaniviridae (Figs 3A and S4). Ten of them were found in bony fish sequencing projects, one in a hagfish and a snake sample, respectively. Interestingly, in one of the samples (from Periophthalmodon schlosseri) from which we retrieved one of the fish coronaviruses with bisegmented genomes, we also found one of the novel tobani-like viruses, suggesting co-infection of that individual fish with two divergent nidoviruses at the time of sampling.

Independent validation of bisegmented coronavirus genomes To independently validate the existence of fish coronaviruses with bisegmented genomes we sequenced and analyzed 202 tissue samples of wild caught individuals of Syngnathus typhle. We identified genomic sequences of Syngnathus typhle coronavirus 1 (StyCoV-1) and Syngnathus typhle coronavirus 2 (StyCoV-2), originally discovered in our SRA screen, in, respectively, 22 and 47 of the newly generated sequencing datasets. None of the experiments was positive for both viruses. In all cases, the non-structural and structural proteins were encoded on separate contigs of segment 1 and 2, respectively, supporting genome bisegmentation in these viruses. For both genome segments of each of the two Syngnathus typhle coronaviruses we confirmed the presence of poly(A) tails by 3’RACE PCR and Sanger sequencing (Figs 5, S5A and S5B). Moreover, we performed a nested PCR that was designed to bridge the two segments of StyCoV-2 in the orientation that would be expected for an unsegmented coronavirus with canonical ORF1a-ORF1b-3’ORFs genomic organization (‘overgap’ PCR) using forward primers binding in proximity to the 3’ end of segment 1 and reverse primers binding in proximity to the putative 5’ end of segment 2. While the corresponding segment-specific control PCRs yielded amplification products of the expected sizes, we did not obtain an amplification product for the ‘overgap’ PCR, providing additional support for genome bisegmentation (S5C–S5F Fig). PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 5. Molecular validation of the 3’-termini of both segments of two bisegmented fish coronaviruses. For each segment, a multiple nucleotide sequence alignment of the 3’-ends of the SRA-based contig (Original contig), selected additional strains from different fish specimens and the product of the 3’RACE PCR (red label) is shown. The corresponding Sanger sequencing chromatogram for the 3’RACE PCR is shown below each sequence alignment. https://doi.org/10.1371/journal.ppat.1012163.g005

Emerging genetic markers of bisegmented coronaviruses In our further dissection of the putative proteome of the newly discovered viruses, we found that the coronaviruses with bisegmented genomes, but none of the other discovered viruses, encode a lamina-associated polypeptide 1C (LAP1C)-like protein. The coronavirus LAP1C-like domain is likely part of a large multidomain protein (protein size from 788 to 2023 amino acids, depending on virus) that otherwise remains unannotated. It is encoded by an apomorphic 5’-proximal ORF on segment 2 upstream of the ORF coding for the spike protein. Another intriguing observation was that segment 2 of the novel Ambystoma mexicanum coronavirus (AmCoV) from axolotl encodes an additional ORF (between the S and M ORFs) that showed significant sequence similarity with the US22 protein family (HHpred Prob = 94.7%). US22 is known to counter antiviral defense in herpesviruses [60]. Moreover, we found that sequence stretches at or near the 5’-ends of the two AmCoV segments and, similarly, at the 3’-ends of the two AmCoV segments show striking sequence similarity: 97.4% sequence identity across a stretch of 272 nt (positions 431–701 of segment 1 vs. positions 1–272 of segment 2) and 99.8% across 821 nt (positions 19446–20265 of segment 1 vs. positions 12489–13309 of segment 2), respectively. Together, these findings suggest that the two AmCoV segments are packed together in the same virion.

Dissecting incongruences of corona- and tobanivirus trees for non-structural and structural proteins Sequence similarities between corona- and tobaniviruses in the S protein are limited to a conserved region in S2. Based on analysis of this similarity in our dataset (see Materials and Methods for details), we delineated three major lineages: one formed by alpha-, beta-, gamma-, deltacoronaviruses, three fish coronaviruses and a tobanivirus (denoted S2G1), a second one uniting 10 tobaniviruses and 15 coronaviruses, the latter including those with bisegmented genomes (S2G2), and a third group formed by members of the subfamilies Serpentovirinae and Remotovirinae of the family Tobaniviridae (S2G3) (Fig 3B). The phylogenetic relationship between these three groups is left unexplored due to limited intergroup sequence similarity of S proteins. We observed complete congruence between the S2 and RdRp+ZBD+HEL1 trees only for viruses of S2G3 (Fig 3), which indicates that no recombination involving the analyzed proteins may have contributed to the divergence of this group of viruses. In contrast, we noticed multiple incongruencies between viruses of S2G1 and especially S2G2 at the subfamily or family level. Thus, we treated S2G3 viruses as a de facto negative control for further analysis of recombination of other viruses. To clarify a possible contribution of recombination to the observed incongruencies, we analyzed the pairwise evolutionary distance (PED) of each corona-and tobanivirus to its most similar virus in the S2 trees relative to the PED of the same virus pair in the RdRp+ZBD+HEL1 tree via PED ratios, denoted PEDr1 (S3 Table and M&M). We reasoned that upon a model of continuous evolution by substitution and due to stronger conservation of RdRp+ZBD+HEL1 compared to S proteins, the former must have accumulated relatively fewer mutations upon divergence of a virus pair from a common ancestor (PEDr1 < 1.0). Indeed, this expectation is satisfied for the S2G3 viruses representing a negative control (PEDr1 in the range of 0.45 to 0.70). However, if a pair of analyzed viruses has diverged more profoundly in the RdRp+ZBD+HEL1 region than in S2 (PEDr1 > 1.0), this is most compatible with S2 being acquired through recombination. To identify which virus of a pair or both viruses may have an S2 of recombinant origin, the PED ratio analysis was extended for each virus to the second most similar virus in the S2 tree, denoted PEDr2 and interpreted in the same way as PEDr1 (S3 Table and M&M). Altogether, we identified five viruses that show considerably larger PED for the RdRp+ZBD+HEL1 region compared to S2 (PEDr1 and PEDr2 in the range of 2.26 to 4.26 and 2.28 to 3.36, respectively; S3 Table and Fig 3). These viruses included Misgurnus anguillicaudatus tobanivirus and four viruses of different (putative) subfamilies of the family Coronaviridae (HTCV, Silurus sp. Nidovirus, Microhyla letovirus and Ambystoma mexicanum coronavirus). Three and two of these viruses have, respectively, fishes and amphibians as (predicted) host and all of them cluster in the S phylogeny to other corona- or tobaniviruses which also were isolated from aquatic hosts (30 in total) but not from terrestrial hosts (14 in total) (Fig 3). The viruses with inferred recombinant S2 include both single- and bi-segmented coronaviruses. The S2 of the four identified coronaviruses cluster with that of tobaniviruses of two different S2G2 subgroups whose RdRp+ZBD+HEL1 and S2 trees are congruent (Fig 3). We detail these viruses in the three paragraphs below. For two of these viruses, HTCV and Silurus sp. nidovirus, the most similar tobaniviruses are Eptatretus burgeri tobanivirus and Fathead minnow nidovirus, respectively; their PEDr1 values are larger 2.0 as well. Three other tobaniviruses separate these two pairs in the S2G2 subgroup, and the analysis of PEDr2 values identified HTCV and Silurus sp. nidovirus but not Eptatretus burgeri tobanivirus and Fathead minnow nidovirus as having S2 of plausible recombinant origin (S3 Table). For HTCV, this involved one of the two spike-like proteins that is encoded by ORF3 (HTCV p3). In contrast, another spike-like protein of HTCV is encoded by ORF2 (HTCV p2) immediately downstream of ORF1b and its S2G1 lineage position is congruent with the position of HTCV in the RdRp+ZBD+HEL1 tree (Fig 3). Microhyla letovirus and Ambystoma mexicanum coronavirus belong to yet another S2G2-based subgroup that includes five other viruses with congruent phylogenies (Fig 3). Despite being sister to each other in S2G2 tree, Microhyla letovirus and Ambystoma mexicanum coronavirus have Tachysurus fulvidraco nidovirus and Tachysurus vachellii nidovirus as the most and second-most similar viruses; these viruses were used in PEDr1 and PEDr2 analyses, respectively, to support a recombinant S2 origin in Microhyla letovirus and Ambystoma mexicanum coronavirus (S3 Table). Misgurnus anguillicaudatus tobanivirus has an S2 that belongs to the S2G1 subgroup, which otherwise includes homologs encoded by coronaviruses (Fig 3). The S2 of this virus is most similar to that of Monopterus albus nidovirus, while Neolamprologus caudopunctatus coronavirus is the second most similar virus to the pair. Only Misgurnus anguillicaudatus tobanivirus but not Monopterus albus nidovirus showed an anomalously high RdRp+ZBD+HEL1 divergence relative to S2, according to both PEDr1 and PEDr2 analyses (4.09 and 2.28 versus 4.09 and 0.78 in S3 Table).

S2 is associated with a chitinase-like domain in a subset of S2G2 During comparative sequence analysis of S proteins, we also found that six of the coronaviruses and five of the 11 tobaniviruses, which form one of three monophyletic subclusters of S2G2, encode a divergent chitinase-like domain (HHpred Prob = 100%, sequence identity = 16%) in the N-terminal part of the S ORF upstream of the region homologous to the coronavirus S2 (Figs 4 and S4). We did not detect a chitinase-like domain in the other tobani- and coronaviruses (Fig 3B), indicating a lineage-specific linkage of this protein domain. The viral chitinase-like domain belongs to glycosidase family 18 (GH18) and forms a separate monophyletic lineage in the GH18 phylogeny (S6 Fig). GH18 was found associated with S2 of only two of five vertebrate nidoviruses that may have acquired S2 through recombination (Fig 3).

[END]
---
[1] Url: https://journals.plos.org/plospathogens/article?id=10.1371/journal.ppat.1012163

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/