(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .
The era of the ARG: An introduction to ancestral recombination graphs and their significance in empirical evolutionary genomics [1]
['Alexander L. Lewanski', 'Department Of Integrative Biology', 'Michigan State University', 'East Lansing', 'Michigan', 'United States Of America', 'W.K. Kellogg Biological Station', 'Hickory Corners', 'Ecology', 'Evolution']
Date: 2024-01
In the following section, we will incrementally develop an intuition for what ARGs are by starting with the fundamentals of sexual reproduction and genealogical relatedness, which will help clarify how ARGs emerge from these first principles of biology. To simplify our discussion, we will focus on the nuclear genome of sexual, diploid organisms and meiotic recombination throughout the paper. However, the ideas covered here are relevant to any organism across the tree of life as well as viruses whose genomes undergo any type of recombination (e.g., gene conversion, bacterial conjugation). For more technical treatments of ARGs, we direct interested readers to [22–25].
The genetic perspective of relatedness is further complicated by another feature of meiosis: recombination. Meiotic recombination, the shuffling of genetic material in the genome during meiosis, occurs via 2 processes: (1) exchange of genetic material between homologous chromosomes via crossing over during prophase I; and (2) random assortment of homologous chromosomes during anaphase I. These recombinational processes can produce a mosaic of genetic ancestry across the haploid genome of the gamete so that a particular gametic genome potentially contains genetic material inherited from different parents both between non-homologous chromosomes and within chromosomes. Recombination therefore results in different histories of inheritance (and thus different genealogies) across the genome, with topological changes to the genealogy associated with recombination breakpoints and different chromosomes [ 30 ].
If we concentrate on a particular position in an individual’s genome, we see that each DNA copy traverses just one of the manifold possible paths (i.e., series of connected nodes and edges) in the pedigree. The specific pedigree paths through which copies at a particular position in contemporary individuals were transmitted from their ancestors represent the genetic genealogy at that position [ 28 , 29 ]. Similar to a pedigree, each edge in the genealogy represents a transmission event of genetic material from parent to offspring. However, in a pedigree, each node is a diploid individual, while in a genetic genealogy, each node represents 1 of 2 haploid sequences within a diploid individual—the specific genomic copy sampled to create a gamete that passes genetic material from a parent to the current individual. This genetic genealogy is embedded in the pedigree ( Fig 1A ; portions in dark gray and color). The sequence of relationships defined by the pedigree constrains the possible nodes and edges that can exist in the genealogy, but does not fully dictate the identities of these nodes and edges. The structure of a genetic genealogy is determined by both the pedigree structure and the outcome of the gametogenic genome sampling at each reproduction event in the pedigree.
This discussion of the pedigree highlights multiple key ideas in our build-up to ARGs. First, because each parent contributes only 1 DNA copy at a particular genomic position to its offspring, each copy experiences its own unique history of inheritance through the pedigree. Second, because a parent only contributes half of its genome to each offspring and not all individuals reproduce, only a subset of the genetic material possessed by historical individuals in the pedigree end up in contemporary individuals. As you travel further back in the pedigree, despite the geometric increase in the number of expected genealogical ancestors (a maximum of 2 n ancestors where n equals the number of generations back in time), an increasing proportion of these ancestors contributes no genetic material to their contemporary descendants [ 26 , 27 ].
By itself, the pedigree can provide coarse estimates of genetic ancestry, such as the expected genetic relatedness between individuals (e.g., 0.50 between full siblings; 0.125 between first cousins), or the expected proportion of the genome inherited from a particular genealogical ancestor. However, for any region of the genome, we are unable to ascertain from the pedigree alone whether it is the parent’s maternal or paternal copy that has been transmitted. Thus, we are restricted to calculating expected quantities. We could therefore gain more in-depth knowledge of ancestry in the genome by explicitly tracking the transmission of DNA sequences down the pedigree from specific parental to offspring chromosomes.
Overview of ARGs. In all ARG depictions (A, B, D), nodes are indicated by small circles, and each node represents a single set of one or more chromosomes (a haploid genome) of an individual. The node coloration indicates whether or not it is involved in recombination, and the specific pattern (shading and outline) of the node indicates its type: nonsample, unary (nonsample), sample. The genome is divided into 3 non-recombining regions (orange, blue, and green). (A) The relationships of multiple individuals can be organized into a pedigree (light gray portions). An ARG is embedded in a pedigree (portions in dark gray and color) and represents the set of pedigree paths through which genetic material is transmitted. (B) The graphical representation of an ARG. Edges (the connections between nodes) are colored and annotated with the non-recombining region(s) that they transmit. (C) A plot recording the lineage count through time in the ARG. Backward in time, coalescent events, which occur at the dark gray points, merge lineages and thus reduce the lineage count. The red points highlight the times at which recombination occurs, which splits lineages backward in time and therefore increases the lineage count. (D) An ARG can be formulated as a series of local trees that share nodes and edges. Each non-recombining region possesses its own local tree. The regions are separated by a recombination event, which, when moving between regions, prunes a portion of the tree and regrafts it to another node. This action means that nearby trees are generally quite similar in structure. The arrows in the left 2 trees show how recombination relocates a branch in the tree (reconnecting to the small, light gray node) to form the tree of the region immediately to the right. The dashed lines in the second and third trees highlight each tree’s shared structure with its leftward neighbor.
In sexual, diploid organisms, haploid gametes are generated by the sampling of a single DNA copy of every position in the genome during meiosis. During reproduction, the parents’ gametes fuse, which leads to a diploid offspring. The relationships between a set of individuals can be represented by a genealogical pedigree ( Fig 1A ; light gray portions), in which each individual has 2 parents, from each of whom it has inherited exactly half of its genome. The pedigree consists of nodes, which represent individual organisms, and edges, which connect a subset of the nodes and signify parent–offspring relationships.
Ancestral recombination graphs
The complex web of genetic genealogies across the genome is recorded in a graphical structure known as an ARG, which provides extensive information regarding the history of inheritance for a set of sampled genomes. Each node in an ARG represents a haploid genome (a haplotype) in a real individual that exists now or in the past [25]. Each diploid individual therefore contains 2 haploid genomes and is represented by 2 nodes. We refer to nodes corresponding to sampled genomes (often, though not necessarily [31–33], sampled in the present) as sample nodes and all other nodes as nonsample nodes. If sample nodes have no sampled descendants, they constitute the tips of an ARG. Sample nodes are particularly salient because ARGs are generally specified in terms of the genetic ancestry of these genomes. Edges in an ARG indicate paths of inheritance between nodes. ARGs are technically described as “directed graphs” because genetic material flows unidirectionally from ancestors to descendants.
Assuming that sample nodes are sourced from contemporary individuals, the present time in an ARG (the bottom of the vertical axes in Fig 1B and 1D) contains a lineage (i.e., sets of one or more edges connected by nodes forming continuous paths of inheritance) for each sample. Tracing the lineages back in time, some nodes have 2 edges enter on the future-facing side but only a single outbound edge on the past-facing side (e.g., node Ⓡ in Fig 1B). These nodes represent haplotypes in which 2 lineages find common ancestry and thus merge into a single lineage, which reduces the lineage count by one (the dark gray points in Fig 1C). Common ancestry events additionally represent coalescence when (backward in time) the 2 merging edges contain the same portion of the genome (note that all nodes corresponding to common ancestry events in Fig 1 (Ⓚ, Ⓟ, Ⓡ, Ⓦ, and Ⓧ) also correspond to coalescence). From an organismal perspective, nodes corresponding to coalesence represent an instance in which a parent provides the same (portion of a) haploid genome to multiple offspring and thus splits a lineage into multiple lineages forward in time.
Conversely, other nodes have a single edge enter on the future-facing side but 2 edges exit the past-facing side (e.g., node Ⓠ in Fig 1B), which represents the outcome of recombination [2]. Backward in time, the node with 2 outbound edges on the past-facing side is the recombinant offspring node whose genome is inherited from 2 parental nodes (e.g., node Ⓒ in Fig 1). The 2 nodes that each receive one of the outbound edges are the parental nodes whose genomes are recombined in the offspring node. For example, in Fig 1, Ⓖ and Ⓗ are the parental nodes of Ⓒ. From an organismal perspective, these nodes occur when an offspring receives one of its haploid genomes from a parent and that haploid genome represents the outcome of recombination between the parent’s 2 haploid genomes. Recombination splits the genome into separate lineages and thus each portion of the genome experiences a distinct history of inheritance between (traversing an ARG from present to past) the recombination event from which they split to the coalescence event in which they join back up. Consequently, each recombination event increases the number of lineages in an ARG by one (the red points in Fig 1C; [34]). From a forward-in-time perspective, recombination fuses portions of 2 parental genomes into a single haplotype (in the recombinant offspring), and thus unites separate lineages into a single lineage. Nodes through which genomic material that is eventually inherited by a sample node (hereafter ancestral material) is transmitted but are involved in neither common ancestry nor recombination for the ancestral material do not determine the topology of an ARG and thus are frequently omitted (we retain several of these nodes in Fig 1 to highlight the effects of recombination). More generally, nodes with only 1 descendant (unary nodes; e.g., node Ⓢ in Fig 1) do not directly influence genealogical relationships between the sample nodes. In simulations, unary nodes are often removed via a process called simplification [35].
ARGs generally record the timing of each node and the ancestral material that each edge transmits between ancestors and descendants. To trace the genealogy for a particular position in the genome, you follow the edges through the ARG that contain the focal position [22]. For example, in Fig 1B, to extract the genealogy for a position in the orange region (between positions L 0 and L 1 ) of sample node Ⓑ, you would follow the edges that transmit the orange region between nodes (i.e., Ⓑ → Ⓚ → Ⓡ → Ⓦ → Ⓧ).
The fact that each genomic region bracketed by recombination breakpoints (hereafter non-recombining region) possesses its own genealogy and that a non-recombining region in a single sample node traces only 1 path back to the MRCA of the entire sample suggests an alternative representation of an ARG: an ordered set of genealogical trees along the genome with labeled sample and nonsample nodes to specify how nodes are shared between trees (Fig 1D; [22]). Considering this representation of an ARG, which we refer to as the tree representation, is worthwhile because ARGs are often formulated (see Box 3) and operationalized in inference (e.g., [36–38]) based on this representation. In the tree representation, each non-recombining region has its own local tree that represents the region’s evolutionary history. If each recombination breakpoint occurs at a unique position in the genome, as you shift from one local tree to the next (amounting to traversing one recombination breakpoint), the structure of the new tree is identical to its neighbor except for a single edge that is removed and then affixed to a (potentially) new node (Fig 1D). In computational parlance, this action is called a subtree-prune-and-regraft operation [39]. When all recombination events occur at unique locations and each event involves only 1 breakpoint, the total number of local trees will equal one more than the number of recombination events defining the evolutionary relationships in the genome. For example, in Fig 1 and 2, recombination events generate 3 trees. If recombination events occur at the same location (a breakpoint represents >1 recombination event), then moving between adjacent trees will involve a corresponding number of subtree-prune-and-regraft operations (one representing each recombination event), and the tree count will be less than one plus the number of recombination events.
PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 2. The encoding of local trees and genotype data in the succinct tree sequence format. (A) Depiction of the local trees shown in Fig 1 with timing and location of mutation events mapped onto the branches and the location of each site shown on the genome. The black, dashed lines represent the invariant sites and the thicker, solid lines represent variant sites corresponding to each mutation. The trees are annotated with horizontal, dashed lines (labeled T 0 −T IX ) that denote either the timing of coalescence or mutation events. (B) The trees and genotype data in the succinct tree sequence format. The trees are specified with the nodes and edges tables. The nodes table contains an ID and age for each node. The edges table contains the left (Genome start) and right (Genome end) positions of the genome over which each edge persists, while the Parent column contains the nodes that transmit material to the nodes in the Child column. The genotypic information is included in the sites [genomic position of each site (Position), ancestral state (Ancestral)] and mutations [derived state (Derived), mutation timing (Age)] tables. (C) The equivalent genotype data for the 4 sample nodes stored in a more conventional matrix format with the rows representing each sample node and the columns representing each genomic site. Note that with small amounts of genetic data such as this simple example, the tree sequence may require more storage space than a standard genotype matrix format. However, when considering realistic genomes, the tree sequence rapidly becomes much more efficient at storing genetic data with growing sample sizes [41].
https://doi.org/10.1371/journal.pgen.1011110.g002
Box 3: The succinct tree sequence The correlated nature of an ARG’s local trees can be exploited to compactly encode the trees in a data structure termed the succinct tree sequence or tree sequence for short (Fig 2A and 2B; [35,40]). The tree sequence defines the trees using 2 tables. The node table contains an identifier and the timing of each node (first table in Fig 2B). The edge table documents the edges shared between adjoining trees by recording the parent and offspring nodes of each edge and the contiguous extent of the genome that each edge covers (second table in Fig 2B). The key innovation here is that the data structure eliminates substantial redundancy. Instead of storing each tree independently, which would necessitate duplication of shared nodes and edges, the tree sequence records each shared component just once. The basic tree sequence technically does not encode the full ARG, which includes all coalescent and recombination events. The basic tree sequence only explicitly contains information on the coalescent events and does not detail the timing and specific changes that differentiate adjacent trees; [41] explain this distinction as follows: the full ARG “encodes the events that occurred in the history of a sample” while the set of local trees recorded in the tree sequence “encodes the outcome of those events.” Nonetheless, the tree sequence can be elaborated with recombination information to more exhaustively document genetic ancestry (e.g., [47,91]). Several properties of the tree sequence have revolutionized ARG-based research. First, its concise nature means that an immensity of genealogical information can be stored in a highly compressed manner. The tree sequence is also a flexible format that can be augmented with additional tables to store other information such as location metadata and DNA data (e.g., third and fourth tables in Fig 2B; Fig 2A). Notably, relative to conventional genotype matrix formats (Fig 2C), DNA data can be represented much more efficiently using the tree sequence. For example, [41] estimated that the tree sequence format could store genetic variant data for 10 billion haploid human-like chromosomes in approximately 1 TB, which is many orders of magnitude smaller than the approximately 25 PB required to store these data in a VCF [67]. The efficiency of the tree sequence also permits significant speed-ups in computation (e.g., through the implementation of fast algorithms). These features have enabled advancements in the scale and scope of ARG-based analyses and are increasingly accessible given that the tree sequence underpins a growing ecosystem of methods and software including tsinfer [41], sc2ts [95], ARGinfer [91], msprime [47], and tskit [35] built to infer, simulate, and analyze ARGs. Further details on the tree sequence can be found in the papers introducing and expanding the tree sequence [35,40,91] and in the documentation of tskit [35].
With inclusion of all nodes involved in recombination and coalescence relevant to the sample nodes, it is straightforward to switch between the 2 ARG representations. As previously discussed, the local tree for a particular non-recombining region can be extracted from the graphical representation by starting at each sample node and tracing the lineages that transmit the region through the ARG until all lineages meet in the MRCA. Conversely, you can recover the graphical representation from the local trees by starting with the tree at one end of the set and then sequentially working across the trees, combining the shared nodes and edges, adding the nodes and edges that are not yet included in the graphical structure, and annotating each edge with the non-recombining region(s) that it transmits. As a brief illustration, in Fig 1D, the first 2 trees both contain nodes Ⓢ and Ⓠ with a connecting edge. In the graphical representation, these shared components would be merged and the edge would be annotated with the transmission of the regions between positions L 0 and L 2 (as shown in Fig 1B).
A recombination event can have several consequences for the structure of adjacent trees. First, it could alter the topology (i.e., the specific branching structure) if the new edge joins to a node on a different edge (e.g., the first and second trees in Fig 1D). However, if the new edge joins to a different node on the same edge, the topology will remain unchanged, and only the edge lengths (i.e., coalescent times) will be modified (e.g., the second and third trees in Fig 1D). It is also possible for the lineage to coalesce back into the same node, which would result in no change to the tree structure. Each local tree contains every sample node because all samples possess the entire genome (and thus every non-recombining region represented by each tree). However, the collection of nonsample nodes can differ across trees. If an ARG includes all nodes (i.e., every nonsample node is retained), the absence of a node in a local tree signals that it does not represent a genetic ancestor for that region. If an ARG has been simplified (unary nodes removed), the absence of a node either means that it is not a genetic ancestor or that the node does not represent a genome in which coalescence occurred that involved the sample nodes.
There are several key characteristics of an ARG’s tree representation. First, the subtree-prune-and-regraft operations that differentiate adjacent trees highlight that nearby trees are generally quite similar and frequently share many nodes and edges [28,30]. A series of shared nodes and edges between trees indicates that the corresponding non-recombining regions were found in the same lineage in that portion of the ARG. The correlated nature of the trees can be exploited for highly efficient tree storage and computation (Fig 2A and 2B; see Box 3 for further details; [35,40]). Second, although local trees can overlap in structure, a tree can contain components that are not universally found across the entire set of trees (e.g., in Fig 1D, node Ⓢ in the first tree is not found in the third tree). One feature that can frequently differ between trees is the node in which all sample nodes first find common ancestry (i.e., all lineages coalesce into a single lineage), which represents the region’s most recent common ancestor (MRCA). When these local MRCAs exist at different times in the past, the trees will vary in height [28]. If all genomic regions trace their ancestry back to the same ancestor(s) in an ARG, the first node in which this occurs represents the Grand MRCA (GMRCA). It is possible for the same node to represent the GMRCA and one or more local MRCAs. For example, in Fig 1, node Ⓧ is the GMRCA and the local MRCA for the first 2 non-recombining regions. However, this is not always the case. In fact, the GMRCA frequently predates any of the local MRCAs, which would result in it being absent from all of the (simplified) local trees.
Although the information contained in the graphical and tree representations of an ARG is the same, many readers, especially those with a background in phylogenetics, may prefer to think about ARGs via their tree representations. Unlike the graphical representation, each local tree is a familiar object: it is strictly bi- or multi-furcating, meaning that each node has exactly 1 ancestor and 2 or more descendants, and that therefore the tree contains no loops (i.e., it is non-reticulate), and is the desired result of a phylogenetic analysis run on a multiple sequence alignment of the DNA in the tree’s non-recombining region. Building off this intuition, a phylogeneticist may draw on experience and imagine the set of local trees as analogous to a Bayesian posterior distribution of phylogenies. However, although this intuition may be initially useful, it is important to remember that each local tree is not independent of the others, both because each is generally separated from its neighbors by a small number of recombination events (so is therefore highly correlated), and because the same nodes and edges may appear across multiple local trees. The shared structure of trees imbues the nodes and edges with different properties relative to the analogous components in a standard phylogeny. For example, in a standard phylogeny, branches depict ancestor–descendant relationships through time and thus are one-dimensional. In contrast, edges in an ARG exist both through time and across the genome, and thus can be conceptualized as two-dimensional [42]. This two-dimensionality can be seen in Fig 1B where edges extend along the vertical, time dimension and also along different extents of the genome (edges contain different sets of genomic regions). Equivalently, the genome dimension of edges manifests in an ARG’s tree representation (Fig 1D) through edges persisting across different sets of local trees. The overlapping nature of local trees (i.e., shared nodes and edges) underlies much of an ARG’s utility and facilitates the power of ARG-based inference, which we discuss later in the review.
[END]
---
[1] Url:
https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1011110
Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.
via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/