(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org.
Licensed under Creative Commons Attribution (CC BY) license.
url:
https://journals.plos.org/plosone/s/licenses-and-copyright
------------
A spatially aware likelihood test to detect sweeps from haplotype distributions
['Michael Degiorgio', 'Department Of Electrical Engineering', 'Computer Science', 'Florida Atlantic University', 'Boca Raton', 'Florida', 'United States Of America', 'Zachary A. Szpiech', 'Department Of Biology', 'Pennsylvania State University']
Date: 2022-06
In this section we begin by developing a new likelihood ratio test statistic, termed Λ, that evaluates spatial patterns in the distortion of the HFS as evidence for sweeps. We then demonstrate that Λ has substantially higher power than competing single-population haplotype-based approaches, across a number of model parameters related to the underlying demographic and adaptive processes. Similar to the T statistic implemented in the LASSI framework of [13], we also show that Λ is capable of approximating the softness of a sweep by estimating the current number of high-frequency haplotypes . We then apply the Λ statistic to whole-genome sequencing data from two human populations from the 1000 Genomes Project [31] and a population of brown rats from NYC [32].
Assume that in window , there is a K-truncated vector of counts which are the observed counts of the K most-frequent haplotypes, with x i1 ≥ x i2 ≥ ⋯ ≥ x iK ≥ 0 and normalized such that , where n i is the total number of sampled haplotypes in window i. Following [ 33 ] and [ 13 ], we then compute the log composite likelihood ratios for null hypothesis of neutrality at target window i ⋆ as and for the alternative hypothesis of m sweeping haplotypes at target window i ⋆ as Using these log likelihoods, we follow [ 13 ] and construct a log likelihood ratio test statistic of a sweep at target window i ⋆ as where We note that this approach treats windows as independent in the null and alternative hypotheses, thus making it a composite likelihood method that ignores recombination.
Following a similar powerful framework introduced by [ 33 ] for modeling balancing selection, we employ a mixture model to model the K-haplotype truncated frequency spectrum in window i, with a proportion deriving from a sweep model and a proportion 1 − α i (A) deriving from the genome-wide background haplotype spectrum to represent neutrality. Here, A is a parameter that we optimize over, describing the rate of decay of the effect of the sweep at target window i ⋆ on the flanking windows a certain distance away. Specifically, we model the K-truncated haplotype spectrum in window i as the vector where for k = 1, 2, …, K and . Note here that for target window i ⋆ , , and hence —i.e., the target window is on top of the sweep, and so it is entirely determined by the distorted m-sweeping haplotype spectrum. However given a fixed A value, for windows i far enough away from the central window i ⋆ , we have the α i (A) = 0, and therefore —i.e., the expectation of a neutral window. Based on these trends, windows far from the putatively selected target window are modeled as neutral, and windows close to the target window are heavily distorted due to the sweep. Moreover, because α i (A) tends to zero for windows far enough away for the central window, the model of neutrality is nested within our proposed sweep model. The schematic in Fig 1B illustrates the saltiLASSI framework of generating the spatially-distorted haplotype spectra.
To incorporate the spatial distribution haplotypic variation into the LASSI framework, consider an index set of contiguous (potentially overlapping) windows such that window has position along a chromosome denoted z i . This position could be in physical units (such as bases), in genetic map units (such as centiMorgans), in number of polymorphic sites (such as employed by nS L in [ 7 ]), or in window number. We model the relative contribution of a sweep with m sweeping haplotypes at target window with index by a parameter α i ∈ [0, 1] on window and the relative contribution of neutrality by 1 − α i .
(A) Generation of distorted haplotype frequency spectra (HFS) for m = 1 (red), 2 (blue), and 4 (purple) sweeping haplotypes from a genome-wide (gray) neutral HFS under the LASSI framework of [ 13 ]. (B) Generation of spatially-distorted HFS under the saltiLASSI framework for a window i (white circles) with increasing distance from the sweep location (yellow star). When the window is on top of the sweep location, the HFS is identical to the distorted LASSI HFS, and α i (A) = 1. When a window is far from the sweep location, the HFS is identical to the genome-wide (neutral) HFS, and α i ⋆ (A) = 0. For windows at intermediate distances from the sweep location, the HFS is a mixture of the distorted and genome-wide HFS, with the distorted HFS contributing α i (A) and the genome-wide HFS contributing 1 − α i (A). We show example spectra at windows a, b, c, and d that are of increasing distances from the sweep location i ⋆ , with i ⋆ < a < b < c < d.
Here we extend the LASSI maximum likelihood framework for detecting sweeps based on haplotype data [ 13 ], by incorporating the spatial pattern of haplotype frequency distortion in a statistical model of a sweep. Recall that [ 13 ] defined a genome-wide background K-haplotype truncated frequency spectrum vector which they assume represents the neutral distribution of the K most-frequent haplotypes, with p 1 ≥ p 2 ≥ ⋯ ≥ p K ≥ 0 and normalization such that . [ 13 ] then define the vector with and . This represents a distorted K-haplotype truncated frequency spectrum vector in a particular genomic region with a distortion consistent with m sweeping haplotypes. To create the these distorted haplotype spectra, [ 13 ] used the equation where f k ≥ 0 for k ∈ {1, 2, …, m} and , defines the way by which mass is distributed to the m “sweeping” haplotypes from the K − m non-sweeping haplotypes with frequencies p m+1 , p m+2 , …, p K . The variables U and ε are associated with the amount of mass from non-sweeping haplotypes that are converted to the m sweeping haplotypes (see [ 13 ]). We choose to set U = p K , and then vary ε ≤ U during optimization. [ 13 ] propose several reasonable choices of f k , and for all computations here we use . The schematic in Fig 1A illustrates the LASSI framework of generating the distorted haplotype spectra.
To apply the saltiLASSI method, we compute Λ at each window in the genome, where each window is considered the target window i ⋆ in turn, and the likelihood is maximized independently for each target window. That is, all parameters (m, A, and ε) are optimized at each target window i ⋆ , thereby permitting the footprint size A of the sweep to vary across the genome, adjusting for initial linkage disequilibrium and local recombination rates that could impact sweep signals. Similar to the way SweepFinder [ 17 ], SweepFinder2 [ 21 ], and LASSI [ 13 ] approach maximization, we optimize the likelihood via a grid search across m ∈ {1, 2, …, K}, ε ∈ [1/(100K), U], and A ∈ {A min , …, A max }. Here, A min = −ln 0.99999/d min , representing a value of A with a slow decay with distance; A max = −ln 0.00001/d min , representing a value of A with a fast decay with distance; and d min is the smallest distance between any two windows genome-wide. We make 100 equally spaced (in log-space) steps between A min and A max . Furthermore, in order to reduce computational burden, we pre-compute values across this grid for all windows.
Power to detect sweeps
The power to detect sweeps will depend on a number of factors, including window size used to compute a statistic, whether phasing information for genotypes is used, the selection strength of the beneficial mutation s, the age of the sweep t (i.e., time at which the selected mutation became beneficial), the number of selected haplotypes ν, and the underlying demographic history. To explore the power of Λ, we evaluate its power to detect sweeps of varying strengths, softness, and ages. For sweep settings, we considered only simulations in which the beneficial mutation established by reaching a frequency of at least 0.1, but we did not condition on fixation. Under each setting, we interrogated its robustness to demographic history, both through idealized constant-size histories and histories with recent severe bottlenecks. Moreover we gauged whether Λ yields false sweep signals under settings of background selection. Furthermore, for each setting described, we investigated the power and robustness of using unphased multilocus genotypes as input to Λ instead of phased haplotypes. In addition, we evaluated the effect of sample size n, number of haplotypes K to truncate the HFS, and recombination rate variation on the power of Λ to detect sweeps. Finally, we compared Λ to competing contemporary methods that use the same type of input data, using the T statistic of [13] for phased and unphased input data, and also considered the H12 [8], nS L [7], and iHS [5] statistics for phased data and the G123 statistic [10] for unphased data. The simulation protocol for all settings is described in the Methods section.
To begin, we compare the performance of Λ to T, H12, nS L , and iHS under a constant-size demographic history with diploid effective size of N = 104 diploid individuals. The Λ, T, and H12 statistics were computed for different window sizes, consisting of 51, 101, or 201 SNPs per window. Fig 2A and S1 Fig show that across sweeps of varying degrees of softness (beneficial mutation on ν ∈ {1, 2, 4, 8, 16} distinct haplotypes) and for sweeps of varying per-site per-generation strengths of s ∈ {0.01, 0.1}, the method with highest power regardless of time of selection (t ∈ {500, 100, 1500, 2000, 2500, 3000} generations prior to sampling) is Λ, thereby outperforming the competing methods. Interestingly, Λ applied to 51 SNP windows has generally higher power than with 101 and 201 SNP windows. Furthermore, smaller window sizes enable Λ to achieve high power even for old sweeps—with this elevated power often substantially higher than the closest competing method. This result recapitulates a finding of [13], where they observed that if the spatial distribution of the T statistic was used within a machine learning framework, computing the T statistic in a greater number of small windows yielded higher power for ancient sweeps than when a smaller number of large windows was used. This is an intriguing result, because smaller windows have poorer estimates of the distortion of the HFS, yet it appears that for detecting ancient sweeps what matters is capturing the overall spatial trend of the distortion of the HFS. That is, when using too large of windows, Λ is averaging the HFS across too large of a region, which has likely been broken up over time due to recombination for ancient sweeps. Instead, smaller windows focus on genomic segments with less shuffling of haplotype variation due to recombination events, such that distortions in the HFS are due to the effect of a sweep at a nearby selected site.
PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 2. Performance of detecting and characterizing sweeps. Performance for applications of Λ, T, and H12 with windows of size 51, 101, and 201 SNPs, as well nS L and iHS under simulations of (A) a constant-size demographic history or (B) the human central European (CEU) demographic history of [34]. Results are based on a sample of n = 50 diploid individuals and the haplotype frequency spectra for the Λ and T statistics truncated at K = 10 haplotypes. (Top row) Power at a 1% false positive rate as a function of selection start time. (Middle row) Estimated sweep width illustrated by mean estimated genomic size influenced by the sweep ( ) as a function of selection start time. Gray solid, dashed, and dotted horizontal lines are the corresponding mean values for Λ applied to neutral simulations. (Bottom row) Estimated sweep softness illustrated by mean estimated number of sweeping haplotypes ( ) as a function of selection start time. Gray solid, dashed, and dotted horizontal lines are the corresponding mean values for Λ applied to neutral simulations, and the red solid horizontal lines correspond to the number of sweeping haplotypes ν ∈ {1, 2, 4} assumed in sweep simulations. Sweep scenarios consist of hard (ν = 1) and soft (ν ∈ {2, 4}) sweeps with per-generation selection coefficient of s = 0.1 that started at t ∈ {500, 1000, 1500, 2000, 2500, 3000} generations prior to sampling. Results expanded across wider range of simulation settings can be found in S1–S3 and S7–S9 Figs as well as results for application to unphased multilocus genotype data in S4–S6 and S10–S12 Figs.
https://doi.org/10.1371/journal.pgen.1010134.g002
S1 Fig also highlights a key distinction among sweeps of different strengths. Specifically, regardless of method considered, each achieves its highest power when sweeps of strength s = 0.1 are recent, whereas for sweeps of strength s = 0.01, highest power for each method is shifted farther in the past toward more ancient sweep. This pattern was also found previously for H12 [10] and T [13]. The likely reason for this result is that sweeps of strength s = 0.01 require more time for the beneficial allele to reach high frequency and leave a conspicuous genomic footprint, with this greater time to reach high frequency associated with increased chance that recombination and mutation act to break up high-frequency haplotypes. In contrast, sweeps of strength s = 0.1 create an immediate selection signature to appear in the genome due to the rapid rise in frequency of a beneficial mutation, but traces of this sweep pattern erode over time due to recombination, mutation, and drift. However, regardless, the Λ statistic paired with a small window size yields uniformly better or comparable sweep detection ability than the other approaches we examined. We also found that all methods performed poorly when selection strength was s = 0.001.
During a scan with Λ, the composite likelihood ratio is optimized over the number of high frequency (sweeping) haplotypes m and the footprint size of the sweep A, leading to respective estimates and . Therefore, at a genomic location with evidence for a sweep (high Λ value), we may better understand properties of the putative sweep by evaluating its softness through and its strength or age through . S2 Fig shows that for sweeps of strength s = 0.01, the estimated number of sweeping haplotypes is considerably different from the actual number of initially-selected haplotypes ν, regardless of window size used or age of the sweep. In contrast, Fig 2A and S2 Fig reveal that for hard sweeps (ν = 1) of strength s = 0.1, the estimate of the number of sweeping haplotypes when using 51 SNP windows is often consistent with hard sweeps ( ) provided that the sweep is recent enough (within the last 500 generations). Similarly, under these same settings but with soft sweeps of ν ∈ {2, 4, 8, 16} selected haplotypes (Fig 2A and S2 Fig), the estimated number of sweeping haplotypes tends to be underestimated ( ) but is still consistent with a soft sweep ( ). Therefore, provided that a sweep is recent enough, when using 51 SNP windows the value of the estimated number of sweeping haplotypes can be used to lend evidence of a hard ( ) or a soft ( ) sweep.
Similarly, the other parameter estimate may also help characterize identified sweeps. Specifically, Fig 2A and S3 Fig show that the footprint size of the sweep (measured as ) is substantially elevated compared to expectation for neutral simulations for sweep times at which there is high power to detect sweeps (Fig 2A and S1 Fig). Interestingly, the shape of the curves relating the mean sweep footprint size over time mirror the power of the Λ statistic with corresponding window size as a function of sweep initiation time (t), sweep softness (ν), and sweep strength (s). These results suggest that the estimate of the sweep footprint size ( ) can be used to learn about the age or strength of a candidate sweep (the signatures of which appear to be confounded between the two parameters). Coupled with an estimate of the sweep softness ( ), our saltiLASSI framework provides a means to not only detect sweeps with high power, but to also learn the underlying parameters that may have shaped the adaptive evolution of candidate sweep regions.
Obtaining phased haplotypes for input to Λ represents an error-prone step that, without sufficient reference panels or high-enough quality genotypes, may make identification of sweeps difficult or potentially impossible for a number of diverse study systems. It is therefore beneficial if the favorable performance of Λ transfers to datasets that have not been phased. Similar to prior studies (e.g., [10, 13, 29, 32], we sought to evaluate the power of Λ when applied to unphased multilocus genotype data, and to compare its performance with the T statistic and G123 (analogue of H12 for use with unphased data) [10], both of which are also applied to unphased multilocus genotypes. S4 Fig shows that Λ maintains high power to detect sweeps of differing ages, strengths, and softness. Consistent with the results on haplotype data (Fig 2A and S1 Fig), Λ generally displays higher power than, or comparable power to, T and G123, with the best performance deriving from Λ with a small window size of 51 SNPs, and with substantially higher power for old sweeps compared to other approaches. An exception is that for recent (t ≤ 1000 generations) and highly soft (ν = 16) sweeps, using a window size of 101 SNPs for Λ had substantially higher power than using the smaller 51 SNP window. Moreover, for highly soft (ν = 16) and ancient (t ≥ 2000) sweeps with strength s = 0.1, the power of Λ is much lower with unphased multilocus genotypes compared to phased haplotypes (compare S1 and S4 Figs). Interpretation of is more difficult for multilocus genotypes compared to haplotypes. However, consistent with the results for haplotypes (S2 and S5 Figs) shows that when using 51 SNP windows, Λ tends to estimate a small number of sweeping multilocus genotypes (smaller ) for harder sweeps (smaller ν) than for softer sweeps (larger ν).
While adaptive processes generally affect variation locally in the genome, neutral processes such as demographic history influence overall levels of genome diversity. Specifically, it is common to consider that demographic processes impact the mean value of genetic diversity, and numerous likelihood approaches for detecting sweeps [13, 16–24] and other forms of natural selection [33, 35, 36] have been created to specifically account for this average effect of demographic history on genome diversity. However, demographic processes, such as recent severe bottlenecks, not only alter mean diversity but also influence higher-order moments of diversity, potentially making it insufficient to account solely for the mean effect of diversity [37–39]. Given that Λ does not account for higher moments than the mean effect of demographic history on the HFS, we sought to evaluate its properties under recent strong bottlenecks—a setting that has proven challenging for other sweep statistics in the past.
The Λ statistic generally exhibits superior power to T, H12, nS L , and iHS when applied to haplotype data (Fig 2B and S7 Fig) or to T and G123 when applied to unphased multilocus genotype data (S10 Fig). Moreover, the general trends in method power as a function sweep strength, softness, and age observed for the constant-size history (Fig 2A, S1 and S4 Figs) hold for this complex demographic setting (Fig 2B, S7 and S10 Figs), with the caveat that, as expected, power for all methods is generally lower under the bottleneck compared to the constant-size history. A clear difference between these two demography settings is that, whereas Λ had exhibited uniformly superior or comparable power with smaller 51 SNP windows compared to larger 101 or 201 SNP windows (Fig 2A and S1 Fig), under the bottleneck model the best window size depends on age of the sweep (Fig 2B and S7 Fig). In particular, recent sweeps often had highest power with 201 SNP windows, sweeps of intermediate age with 101 SNPs, and ancient sweeps with 51 SNPs. Therefore, under complex demographic histories, choice of window size for Λ is more nuanced than with constant-size histories. This result is consistent with those of [13] who demonstrated that, when accounting for the spatial distribution of the T statistic in a machine learning framework (referred to as T-Trendsetter), power to detect recent sweeps is higher for larger windows and power to detect ancient sweeps is higher for smaller windows under the bottleneck history considered here.
In addition to demographic history, a pervasive force acting to reduce variation across the genome is background selection [40–43], which is the loss of genetic diversity at neutral sites due to negative selection at nearby loci [44–46]. Background selection has been demonstrated to alter the neutral SFS [44, 47–49], and masquerade as false signals of positive selection [19, 44–47, 50–54]. However, because this process does not generally lead to haplotypic variation consistent with sweeps [55–57], like prior studies developing haplotype approaches for detecting sweeps [10, 13] we sought to evaluate the robustness of Λ to background selection. We find that under both simple and complex demographic histories, using either phased haplotype or unphased multilocus genotype data, all methods considered here demonstrate robustness to background selection by not falsely attributing genomic regions evolving under background selection as sweeps (S19 Fig).
Throughout our experiments, we have considered a per-site per-generation recombination rate of r = 10−8 for each simulation replicate. However, recombination rate is known to vary across the genome [58], and it is therefore important to evaluate the performance of Λ compared to other methods when recombination rate varies across genomic regions. To evaluate the effect of recombination rate variation on method performance, we drew per-site per-generation recombination rate from an exponential distribution with mean 10−8 (see Methods) for reach replicate neutral and sweep simulation under the bottleneck demographic history [34]. S13 and S16 Figs indicate that the Λ statistic generally has greater power than T, H12 (or G123), nS L , and iHS under phased haplotypes and unphased multilocus genotypes settings. These results further highlight the robustness of the Λ statistic to realistic genomic characteristics often encountered in empirical studies.
Finally, the number n of sampled individuals as well the number K of haplotypes used to truncate the HFS should affect the resolution at which we can model the distortion of the HFS due to a sweep, and thus would likely result in alterations of power of Λ to detect sweeps. As expected, Fig 3 shows that increasing sample size generally increases power of Λ to detect sweeps, with highest power typically obtained with the largest n and smallest window size combination (i.e., n = 50 with 51-SNP windows) and the lowest power with the smallest n and largest window size combination (i.e., n = 10 with 201-SNP windows). Moreover, as sample size increases, Λ is better able to detect sweeps of older age, and for extremely small samples (i.e., n = 10), the estimates of the number ν of sweeping haplotypes are poor. In contrast to changing sample size n, changing the number of haplotypes K to truncate the HFS does not have a substantial effect on the power of Λ to detect sweeps (Fig 4, with the power curves for a specific window size mostly the same across K ∈ {5, 10, 20}. This result mirrors that in S5 Fig of [13] for the T statistic, whereby changing K had little effect on method power. Instead, choice of K seems to more strongly influence the estimates of the number ν of sweeping haplotypes, with larger values of K permitting a wider range of estimates of m. This result mimics those observed for the T statistic by [13], in that the choice of K has a larger effect on the resolution to classify sweeps as hard or soft than it did on the ability to detect sweeps.
PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 3. Performance of detecting and characterizing sweeps. Performance for applications of Λ with windows of size 51, 101, and 201 SNPs under simulations of (A) a constant-size demographic history or (B) the human central European (CEU) demographic history of [34] and sample size of n ∈ {10, 25, 50} diploid individuals. Results are based on the haplotype frequency spectra for the Λ statistic truncated at K = 10 haplotypes. (Top row) Power at a 1% false positive rate as a function of selection start time. (Middle row) Estimated sweep width illustrated by mean estimated genomic size influenced by the sweep ( ) as a function of selection start time. (Bottom row) Estimated sweep softness illustrated by mean estimated number of sweeping haplotypes ( ) as a function of selection start time. Sweep scenarios consist of hard (ν = 1) and soft (ν ∈ {2, 4}) sweeps with per-generation selection coefficient of s = 0.1 that started at t ∈ {500, 1000, 1500, 2000, 2500, 3000} generations prior to sampling. Results expanded across wider range of simulation settings can be found in S20–S22 and S26–S28 Figs as well as results for application to unphased multilocus genotype data in S23–S25 and S29–S31 Figs.
https://doi.org/10.1371/journal.pgen.1010134.g003
[END]
[1] Url:
https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1010134
(C) Plos One. "Accelerating the publication of peer-reviewed science."
Licensed under Creative Commons Attribution (CC BY 4.0)
URL:
https://creativecommons.org/licenses/by/4.0/
via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/