(C) PLOS One

(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .

OrthoSNAP: A tree splitting and pruning algorithm for retrieving single-copy orthologs from gene family trees [1]

['Jacob L. Steenwyk', 'Vanderbilt University', 'Department Of Biological Sciences', 'Nashville', 'Tennessee', 'United States Of America', 'Vanderbilt Evolutionary Studies Initiative', 'Dayna C. Goltz', 'Independent Researcher', 'Thomas J. Buida Iii']

Date: 2022-10

Abstract Molecular evolution studies, such as phylogenomic studies and genome-wide surveys of selection, often rely on gene families of single-copy orthologs (SC-OGs). Large gene families with multiple homologs in 1 or more species—a phenomenon observed among several important families of genes such as transporters and transcription factors—are often ignored because identifying and retrieving SC-OGs nested within them is challenging. To address this issue and increase the number of markers used in molecular evolution studies, we developed OrthoSNAP, a software that uses a phylogenetic framework to simultaneously split gene families into SC-OGs and prune species-specific inparalogs. We term SC-OGs identified by OrthoSNAP as SNAP-OGs because they are identified using a splitting and pruning procedure analogous to snapping branches on a tree. From 415,129 orthologous groups of genes inferred across 7 eukaryotic phylogenomic datasets, we identified 9,821 SC-OGs; using OrthoSNAP on the remaining 405,308 orthologous groups of genes, we identified an additional 10,704 SNAP-OGs. Comparison of SNAP-OGs and SC-OGs revealed that their phylogenetic information content was similar, even in complex datasets that contain a whole-genome duplication, complex patterns of duplication and loss, transcriptome data where each gene typically has multiple transcripts, and contentious branches in the tree of life. OrthoSNAP is useful for increasing the number of markers used in molecular evolution data matrices, a critical step for robustly inferring and exploring the tree of life.

Citation: Steenwyk JL, Goltz DC, Buida TJ III, Li Y, Shen X-X, Rokas A (2022) OrthoSNAP: A tree splitting and pruning algorithm for retrieving single-copy orthologs from gene family trees. PLoS Biol 20(10): e3001827. https://doi.org/10.1371/journal.pbio.3001827 Academic Editor: Andreas Hejnol, University of Bergen, NORWAY Received: November 4, 2021; Accepted: September 13, 2022; Published: October 13, 2022 Copyright: © 2022 Steenwyk et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: All results and data presented in this study are available from figshare (doi: 10.6084/m9.figshare.16875904). Funding: J.L.S. and A.R. were funded by the Howard Hughes Medical Institute through the James H. Gilliam Fellowships for Advanced Study program. Research in A.R.’s lab is supported by grants from the National Science Foundation (DEB-2110404), the National Institutes of Health/National Institute of Allergy and Infectious Diseases (R56 AI146096 and R01 AI153356), and the Burroughs Wellcome Fund. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: I have read the journal’s policy and the authors of this manuscript have the following competing interests: Antonis Rokas is a scientific consultant for LifeMine Therapeutics, Inc. Jacob L. Steenwyk is a scientific consultant for Latch AI Inc.

Introduction Molecular evolution studies, such as species tree inference, genome-wide surveys of selection, evolutionary rate estimation, measures of gene–gene coevolution, and others typically rely on single-copy orthologs (SC-OGs), a group of homologous genes that originated via speciation and are present in single copy among species of interest [1–6]. In contrast, paralogs—homologous genes that originated via duplication and are often members of large gene families—are typically absent from these studies (Fig 1). Gene families of orthologs and paralogs often encode functionally significant proteins such as transcription factors, transporters, and olfactory receptors [7–10]. The exclusion of SC-OGs from gene families has not only hindered our understanding of their evolution and phylogenetic informativeness but is also artificially reducing the number of gene markers available for molecular evolution studies. Furthermore, as the number of species and/or their evolutionary divergence increases in a dataset, the number of SC-OGs decreases [11,12]; case in point, no SC-OGs were identified in a dataset of 42 plants [11]. As the number of available genomes across the tree of life continues to increase, our ability to identify SC-OGs present in many taxa will become more challenging. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 1. Cartoon depiction of 3 classes of paralogs: outparalogs, inparalogs, and coorthologs. (A) Paralogs refer to related genes that have originated via gene duplication, such as genes M, N, and O. (B) Outparalogs and inparalogs refer to paralogs that are related to one another via a duplication event that took place prior to or after a speciation event, respectively. With respect to the speciation event that led to the split of taxa A, B, and C from D, genes M, N, and O are outparalogs because they arose prior to the speciation event; genes O1 and O2 in taxa A, B, and C are inparalogs because they arose after the speciation event. Species-specific inparalogs are paralogous genes observed only in 1 species, strain, or organism in a dataset, such as gene N1 and N2 in species A. Species-specific inparalogs N1 and N2 in species A are also coorthologs of gene N in taxa B, C, and D; the same is true for inparalogs O1 and O2 from species A, which are coorthologs of gene O from species D. (C) Cartoon depiction of SNAP-OGs identified by OrthoSNAP. https://doi.org/10.1371/journal.pbio.3001827.g001 In light of these issues, several methods have been developed to account for paralogs in specific types of molecular evolution studies—for example, in species tree reconstruction [13]. Methods such as SpeciesRax, STAG, ASTRAL-PRO, and DISCO can be used to infer a species tree from a set of SC-OGs and gene families composed of orthologs and paralogs [11,14–16]. Other methods such as PHYLDOG [17] and guenomu [18] jointly infer the species and gene trees but require abundant computational resources, which has hindered their use for large datasets. Other software, such as PhyloTreePruner, can conduct species-specific inparalog trimming [19]. Agalma, as part of a larger automated phylogenomic workflow, can prune gene trees into maximally inclusive subtrees wherein each species, strain, or organism is represented by 1 sequence [20]. Similarly, OMA identifies subgroups of SC-OGs using graph-based clustering of sequence similarity scores [21]. Although these methods have expanded the numbers of gene markers used in species tree reconstruction, they were not designed to facilitate the retrieval of as broad a set of SC-OGs as possible for downstream molecular evolution studies such as surveys of selection. Furthermore, the phylogenetic information content of these gene families remains unknown, calling into question their usefulness. To address this need and measure the information content of subgroups of single-copy orthologous genes, we developed OrthoSNAP, a novel algorithm that identifies SC-OGs nested within larger gene families via tree decomposition and species-specific inparalog pruning. We term SC-OGs identified by OrthoSNAP as SNAP-OGs because they were retrieved using a splitting and pruning procedure. The efficacy of OrthoSNAP and the information content of SNAP-OGs was examined across 7 eukaryotic datasets, which include species with complex evolutionary histories (e.g., whole-genome duplication) or complex gene sequence data (e.g., transcriptomes, which typically have multiple transcripts per protein-coding gene). These analyses revealed OrthoSNAP can substantially increase the number of orthologs for downstream analyses such as phylogenomics and surveys of selection. Furthermore, we found that the information content of SNAP-OGs was statistically indistinguishable from that of SC-OGs suggesting the inclusion of SNAP-OGs in downstream analyses is likely to be as informative. These analyses indicate that SNAP-OGs identified by OrthoSNAP hold promise for increasing the number of markers used in molecular evolution studies, which can, in turn, be used for constructing and interpreting the tree of life.

Discussion Molecular evolution studies typically rely on SC-OGs. Recently, developed methods can integrate gene families of orthologs and paralogs into species tree inference but are not designed to broadly facilitate the retrieval of gene markers for molecular evolution analyses. Furthermore, the phylogenetic information content of gene families of orthologs and paralogs remains unknown. This observation underscores the need for algorithms that can identify SC-OGs nested within larger gene families, which can, in turn, be incorporated into diverse molecular evolution analyses, and a comprehensive assessment of their phylogenetic properties. To address this need, we developed OrthoSNAP, a tree splitting and pruning algorithm that identifies SNAP-OGs, which refers to SC-OGs nested within larger gene families wherein species-specific inparalogs have also been pruned. Comprehensive examination of the phylogenetic information content of SNAP-OGs and SC-OGs from 7 empirical datasets of diverse eukaryotic species revealed that their content is similar. Inclusion of SNAP-OGs increased the size of all 7 datasets, sometimes substantially. We note that our results are qualitatively similar to those reported recently by Smith and colleagues [37], which retrieved SC-OGs nested within larger families from 26 primates and examined their performance in gene tree and species tree inference. Three noteworthy differences are that we also conduct species-specific inparalog trimming, provide a user-friendly command-line software for SNAP-OG identification, and evaluated the phylogenetic information content of SNAP-OGs and SC-OGs across 7 diverse phylogenomic datasets. We also note that our algorithm can account for diverse types of paralogy—outparalogs, inparalogs, and species-specific inparalogs—whereas other software like PhyloTreePruner, which only conducts species-specific inparalog trimming [19], and Agalma, which identifies single-copy outparalogs and inparalogs [20], can account for some, but not all, types of paralogs (S10 Table). Another difference between OrthoSNAP and other approaches is that Agalma and PhyloTreePruner both require rooted phylogenies. In contrast, OrthoSNAP will automatically midpoint root phylogenies or accept prerooted phylogenies as input. Furthermore, these algorithms are not designed to handle transcriptomic data wherein multiple transcripts per gene will be interpreted as multicopy orthologs. Thus, OrthoSNAP allows for greater user flexibility and accounts for more diverse scenarios, leading to, at least in some instances, the identification of more loci for downstream analyses (S8 Fig). Notably, these software are also different from sequence similarity graph-based inferences of subgroups of single-copy orthologous genes—such as the algorithm implemented in OMA [21]. In other words, OrthoSNAP identifies subgroups of single-copy orthologous genes by examining evolutionary histories, rather than sequence similarity values. Moreover, examination of evolutionary histories facilitates the identification of species-specific inparalogs. Finally, our results, together with other studies, demonstrate the utility of SC-OGs that are nested within larger families [15,20,37,38]. Despite the ability of OrthoSNAP to identify additional loci for molecular evolution analyses, there were instances wherein SNAP-OGs were not identified in multicopy orthologous groups of genes. We discuss 3 reasons that contribute to why SNAP-OGs could not be identified among some genes—specifically, gene families with sequence data from <50% of the taxa; gene families with complex evolutionary histories (for example, HGT and duplication/loss patterns); and gene families with evolutionary histories that differ from the species tree (for example, due to analytical factors, such as sampling and systematic error, or biological factors, such as lineage sorting or introgression/hybridization [39–41]). Notably, the first reason can, but does not always, result in inability to infer SNAP-OGs and can be, to a certain extent, addressed (e.g., by lowering the orthogroup occupancy threshold in OrthoSNAP), whereas the other 2 reasons are more challenging because they often result in a genuine absence of SC-OGs. Furthermore, the actual number of SC-OGs (either those nested within multicopy orthologs or not) for any given group of organisms is not known, making it difficult to determine how many SNAP-OGs and SC-OGs one should expect to recover. Notably, this issue has long challenged researchers, even when ortholog identification is performed by also taking genome synteny into account [27]. Next, we discuss some practical considerations when using OrthoSNAP. In the present study, we inferred orthology information using OrthoFinder [42], but several other approaches can be used upstream of OrthoSNAP. For example, other graph-based algorithms such as OrthoMCL and OMA [21,43] or sequence similarity-based algorithms such as orthofisher [44] can be used to infer gene families. Similarly, sequence similarity search algorithms like BLAST+ [45], USEARCH [46], and HMMER [47] can be used to retrieve homologous sets of sequences that are used as input for OrthoSNAP. Other considerations should also be taken during the multicopy tree inference step. For example, inferring phylogenies for all orthologous groups of genes may be a computationally expensive task. Rapid tree inference software—such as FastTree or IQTREE with the “-fast” parameter [48,49]—may expedite these steps (but users should be aware that this may result in a loss of accuracy in inference; [50]). We suggest employing “best practices” when inferring groups of putatively orthologous genes, including SNAP-OGs. Specifically, orthology information can be further scrutinized using phylogenetic methods. Orthology inference errors may occur upstream of OrthoSNAP; for example, SNAP-OGs may be susceptible to erroneous inference of orthology during upstream clustering of putatively orthologous genes. One method to identify putatively spurious orthology inference is by identifying long terminal branches [51]. Terminal branches of outlier length can be identified using the “spurious_sequence” function in PhyKIT [52]. Other tools, such as PhyloFisher, UPhO, and other orthology inference pipelines employ similar strategies to refine orthology inference [53–55]. Lastly, we acknowledge that future iterations of OrthoSNAP may benefit from incorporating additional layers of information, such as sequence similarity scores or synteny. Even though OrthoSNAP did identify SNAP-OGs in some complex datasets where synteny has previously been very helpful, such as the budding yeast dataset, other ancient and rapidly evolving lineages may benefit from synteny analysis to dissect complex relationships of orthology [51,56–58]. Taken together, we suggest that OrthoSNAP is useful for retrieving single-copy orthologous groups of genes from gene family data and that the identified SNAP-OGs have similar phylogenetic information content compared to SC-OGs. In combination with other phylogenomic toolkits, OrthoSNAP may be helpful for reconstructing the tree of life and expanding our understanding of the tempo and mode of evolution therein.

Methods OrthoSNAP availability and documentation OrthoSNAP is available under the MIT license from GitHub (https://github.com/JLSteenwyk/orthosnap), PyPi (https://pypi.org/project/orthosnap), and the Anaconda cloud (https://anaconda.org/JLSteenwyk/orthosnap). OrthoSNAP is also freely available to use via the LatchBio (https://latch.bio/) cloud-based console (dedicated interface link: https://console.latch.bio/explore/65606/info). Documentation describes the OrthoSNAP algorithm, parameters, and provides user tutorials (https://jlsteenwyk.com/orthosnap). OrthoSNAP algorithm description and usage We next describe how OrthoSNAP identifies SNAP-OGs. OrthoSNAP requires 2 files as input: one is a FASTA file that contains 2 or more homologous sequences in 1 or more species and the other the corresponding gene family phylogeny in Newick format. In both the FASTA and Newick files, users must follow a naming scheme—wherein species, strain, or organism identifiers and gene sequences identifiers are separated by a vertical bar (also known as a pipe character or “|”)—which allows OrthoSNAP to determine which sequences were encoded in the genome of each species, strain, or organism. After initiating OrthoSNAP, the gene family phylogeny is first midpoint rooted (unless the user specifies the inputted phylogeny is already rooted) and then SNAP-OGs are identified using a tree-traversal algorithm. To do so, OrthoSNAP will loop through the internal branches in the gene family phylogeny and evaluate the number of distinct taxa identifiers among children terminal branches. If the number of unique taxon identifiers is greater than or equal to the orthogroup occupancy threshold (default: 50% of total taxa in the inputted phylogeny; users can specify an integer threshold), then all children branches and termini are examined further; otherwise, OrthoSNAP will examine the next internal branch. Next, OrthoSNAP will collapse branches with low support (default: 80, which is motivated by using ultrafast bootstrap approximations [59] to evaluate bipartition support; users can specify an integer threshold) and conduct species-specific inparalog trimming wherein the longest sequence is maintained, a practice common in transcriptomics. However, users can specify whether the shortest sequence or the median sequence (in the case of 3 or more sequences) should be kept instead. Users can also pick which species-specific inparalog to keep based on branch lengths (the longest, shortest, or median branch length in the case of having 3 or more sequences). Species-specific inparalogs are defined as sequences encoded in the same genome that are sister to one another or belong to the same polytomy [19]. The resulting set of sequences is examined to determine if 1 species, strain, or organism is represented by 1 sequence and ensure these sequences have not yet been assigned to a SNAP-OG. If so, they are considered a SNAP-OG; if not, OrthoSNAP will examine the next internal branch. When SNAP-OGs are identified, FASTA files of SNAP-OG sequences are outputted. Users can also output the subtree of the SNAP-OG using an additional argument. The principles of the OrthoSNAP algorithm are also described using the following pseudocode: FOR internal branch in midpoint rooted gene family phylogeny: > IF orthogroup occupancy among children termini is greater than or equal to orthogroup occupancy threshold; >> Collapse poorly supported bipartitions and trim species-specific inparalogs; >> IF each species, strain, or organism among the trimmed set of species, strains, or organisms is represented by only one sequence and no sequences being examined have been assigned to a SNAP-OG yet; >>> Sequences represent a SNAP-OG and are outputted to a FASTA file >> ELSE >>> examine next internal branch > ELSE >> examine next internal branch ENDFOR To enhance the user experience, arguments or default values are printed to the standard output, a progress bar informs the user of how of the analysis has been completed, and the number of SNAP-OGs identified as well as the names of the outputted FASTA files are printed to the standard output. Development practices and design principles to ensure long-term software stability Archival instabilities among software threatens the reproducibility of bioinformatics research [60]. To ensure long-term stability of OrthoSNAP, we implemented previously established rigorous development practices and design principles [44,52,61,62]. For example, OrthoSNAP features a refactored codebase, which facilitates debugging, testing, and future development. We also implemented a continuous integration pipeline to automatically build, package, and install OrthoSNAP across Python versions 3.7, 3.8, and 3.9. The continuous integration pipeline also conducts 57 unit and integration tests, which span 95.90% of the codebase and ensure faithful function of OrthoSNAP. Dataset generation To generate a dataset for identifying SNAP-OGs and comparing them to SC-OGs, we first identified putative groups of orthologous genes across 4 empirical datasets. To do so, we first downloaded proteomes for each dataset, which were obtained from publicly available repositories on NCBI (S1 and S7 Tables) or figshare [51]. Each dataset varied in its sampling of sequence diversity and in the evolutionary divergence of the sampled taxa. The dataset of 24 budding yeasts spans approximately 275 million years of evolution [51]; the dataset of 36 filamentous fungi spans approximately 94 million years of evolution [63]; the dataset of 26 mammals spans approximately 160 million years of evolution [64]; and the dataset of 28 eutherian mammals—which was used to study the contentious deep evolutionary relationships among eutherian mammals—concerns an ancient divergence that occurred approximately 160 million years ago [65]. Putatively orthologous groups of genes were identified using OrthoFinder, v2.3.8 [42], with default parameters, which resulted in 46,645 orthologous groups of genes with at least 50% orthogroup occupancy (S8 Table). To infer the evolutionary history of each orthologous group of genes, we first individually aligned and trimmed each group of sequences using MAFFT, v7.402 [66], with the “auto” parameter and ClipKIT, v1.1.3 [61], with the “smart-gap” parameter, respectively. Thereafter, we inferred the best-fitting substitution model using Bayesian information criterion and evolutionary history of each orthologous group of genes using IQ-TREE2, v2.0.6 [49]. Bipartition support was examined using 1,000 ultrafast bootstrap approximations [59]. To identify SNAP-OGs, the FASTA file and associated phylogenetic tree for each gene family with multiple homologs in 1 or more species was used as input for OrthoSNAP, v0.0.1 (this study). Across 40,011 gene families with multiple homologs in 1 or more species in all datasets, we identified 6,630 SNAP-OGs with at least 50% orthogroup occupancy (S1 Fig and S8 Table). Unaligned sequences of SNAP-OGs were then individually aligned and trimmed using the same strategy as described above. To determine gene families that were SC-OGs, we identified orthologous groups of genes with at least 50% orthogroup occupancy and each species, strain, or organism was represented by only 1 sequence—6,634 orthologous groups of genes were SC-OGs. Measuring and comparing information content among SC-OGs and SNAP-OGs To compare the information content of SC-OGs and SNAP-OGs, we calculated 9 properties of multiple sequence alignments and phylogenetic trees associated with robust phylogenetic signal in the budding yeasts, filamentous fungi, and mammalian datasets (S4 Table). More specifically, we calculated information content from phylogenetic trees such as measures of tree certainty (average bootstrap support), accuracy (Robinson–Foulds distance; [67]), signal-to-noise ratios (treeness; [68]), and violation of clock-like evolution (degree of violation of a molecular clock or DVMC; [69]). Information content was also measured among multiple sequence alignments by examining alignment length and the number of parsimony-informative sites, which are associated with robust and accurate inferences of evolutionary histories [70] as well as biases in sequence composition (RCV; [68]). Lastly, information content was also evaluated using metrics that consider characteristics of phylogenetic trees and multiple sequence alignments such as the degree of saturation, which refers to multiple substitutions in multiple sequence alignments that underestimate the distance between 2 taxa [71], and treeness/RCV, a measure of signal-to-noise ratios in phylogenetic trees and sequence composition biases [68]. For tree accuracy, phylogenetic trees were compared to species trees reported in previous studies [51,63,64]. All properties were calculated using functions in PhyKIT, v1.1.2 [52]. The function used to calculate each metric and additional information are described in S4 Table. Principal component analysis across the 9 properties that summarize phylogenetic information content was used to qualitatively compare SC-OGs and SNAP-OGs in reduced dimensional space. Principal component analysis, visualization, and determination of property contribution to each principal component was conducted using factoextra, v1.0.7 [72], and FactoMineR, v2.4 [73], in the R, v4.0.2 (https://cran.r-project.org/), programming environment. Statistical analysis using a multifactor ANOVA was used to quantitatively compare SC-OGs and SNAP-OGs using the res.aov() function in R. Information theory-based approaches were used to evaluate incongruence among SC-OGs and SNAP-OGs phylogenetic trees. More specifically, we calculated tree certainty and tree certainty-all [74–76], which are conceptually similar to entropy values and are derived from examining support among a set of gene trees and the 2 most supported topologies or all topologies that occur with a frequency of ≥5%, respectively. More simply, tree certainty values range from 0 to 1 in which low values are indicative of low congruence among gene trees and high values are indicative of high congruence among gene trees. Tree certainty and tree certainty-all values were calculated using RAxML, v8.2.10 [77]. To examine patterns of support in a contentious branch concerning deep evolutionary relationships among eutherian mammals, we calculated gene support frequencies and ΔGLS. Gene support frequencies were calculated using the “polytomy_test” function in PhyKIT, v1.1.2 [52]. To account for uncertainty in gene tree topology, we also examined patterns of gene support frequencies after collapsing bipartitions with ultrafast bootstrap approximation support lower than 75 using the “collapse” function in PhyKIT. To calculate gene-wise log likelihood values, partition log-likelihoods were calculated using the “wpl” parameter in IQ-TREE2 [49], which required as input a phylogeny in Newick format that represented either hypothesis 1, 2, or 3 (Fig 4A) and a concatenated alignment of SC-OGs and SNAP-OGs with partition information. Thereafter, the log likelihood values were used to assign genes to the topology they best supported. Inconclusive genes, defined as having a gene-wise log likelihood difference of less than 0.01, were removed. The same methodologies—orthology inference, multiple-sequence alignment, trimming, tree inference, SNAP-OG identification, and phylogenetic information content calculations—were also applied to 3 additional datasets that represent complex datasets. Specifically, 30 plants (with a history of extensive gene duplication and loss events), 30 budding yeast species (15 of which experienced whole-genome duplication), and 20 choanoflagellate transcriptomes (where typically multiple transcripts correspond to a single protein-coding gene) [31,32].

Acknowledgments We thank the Rokas lab for helpful discussion and feedback.

[END]
---
[1] Url: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001827

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/