(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org.
Licensed under Creative Commons Attribution (CC BY) license.
url:https://journals.plos.org/plosone/s/licenses-and-copyright

------------



Recommendations for improving statistical inference in population genomics

['Parul Johri', 'School Of Life Sciences', 'Arizona State University', 'Tempe', 'Arizona', 'United States Of America', 'Charles F. Aquadro', 'Department Of Molecular Biology', 'Genetics', 'Cornell University']

Date: 2022-06

Constructing an appropriate baseline model for population genomic analysis

The somewhat disheartening exercise of fitting incorrect models to data (as depicted in Fig 1) naturally raises the questions of whether, and if so how, accurate evolutionary inferences can be extracted from DNA sequences sampled from a population. The first point of importance is that the starting point for any genomic analysis should be the construction of a biologically relevant baseline model, which includes the processes that must be occurring and shaping levels and patterns of variation and divergence across the genome. This model should include mutation, recombination, and gene conversion (each as applicable), purifying selection acting on functional regions and its effects on linked variants (i.e., background selection [21,68,69]), as well as genetic drift as modulated by, among other things, the demographic history and geographic structure of the population. Depending on the organism of interest, there may be other significant biological components to include, such as mating system, progeny distributions, ploidy, and so on (although, for certain questions of interest, some of these biological factors may simply be included in the resulting effective population size). It is thus helpful to view this baseline model as being built from the ground up for any new data analysis. Importantly, the point is not that these many parameters need to be fully understood in a given population in order to perform any evolutionary inference, but rather that they all require consideration, and that the effects of uncertainties in their underlying values on downstream inference can be quantified.

However, even prior to considering any biological processes, it is important to investigate the data themselves. First, there exists an evolutionary variance associated with the myriad of potential realizations of a stochastic process, as well as the statistical variance introduced by finite sampling. Second, it is not advisable to compare one’s empirical observations, which may include missing data, variant calling or genotyping uncertainty (e.g., effects of low coverage), masked regions (e.g., regions in which variants were omitted due to low mappability and/or callability) and so on, against either an analytical or simulated expectation that lacks those considerations and thus assumes optimal data resolution [70]. The dataset may also involve a certain ascertainment scheme, either for the variants surveyed [71], or given some predefined criteria for investigating specific genomic regions (e.g., regions representing genomic outliers with respect to a chosen summary statistic [72]). For the sake of illustration, Fig 2 follows the same format as Fig 1, but considers 2 scenarios: population growth with background selection and selective sweeps and the same scenario together with data ascertainment (in this case, an undercalling of the singleton class). As can be seen, due to the changing shape of the frequency spectra, neglecting to account for this ascertainment can greatly affect inference, considerably modifying the fit of both the incorrect demographic and incorrect recurrent selective sweep models to the data.

PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 2. Ascertainment errors may amplify mis-inference, if not corrected. As in Fig 1, the scenarios are given in the first column, here population growth with background selection and recurrent selective sweeps (“Growth + BGS + Pos”), as well as the same scenario in which the imperfections of the variant-calling processes are taken into account—in this case, one-third of singletons are not called (“Growth + BGS + Pos + Ascertainment”). The middle columns present the resulting SFS and LD distributions, and the final columns provide the joint posterior distributions when the data are fit to 2 incorrect models: a demographic model that assumes strict neutrality and a recurrent selective sweep model that assumes a constant population size. All exonic (i.e., directly selected) sites were masked prior to analysis. Red crosses indicate the true values. As shown, unaccounted for ascertainment errors may contribute to mis-inference. The scripts underlying this figure may be found at https://github.com/paruljohri/Perspective_Statistical_Inference/tree/main/SimulationsTestSet/Figure2. LD, linkage disequilibrium; SFS, site frequency spectrum. https://doi.org/10.1371/journal.pbio.3001669.g002

Hence, if sequencing coverage is such that rare mutations are being excluded from analysis, due to an inability to accurately differentiate genuine variants from sequencing errors, the model used for subsequent testing should also ignore these variants. Similarly, if multiple regions are masked in the empirical analysis due to problems such as alignment difficulties, the expected patterns of LD that are observable under any given model may be affected. Furthermore, while the added temporal dimension of time series data has recently been shown to be helpful for various aspects of population genetic inference [73–76], such data in no way sidestep the need for an appropriate baseline model, but simply requires the development of a baseline that matches the temporal sampling. In sum, as these factors can greatly affect the power of planned analyses and may introduce biases, the precise details of the dataset (e.g., region length, extent and location of masked regions, the number of callable sites, and ascertainment) and study design (e.g., sample size and single time point versus time series data) should be directly matched in the baseline model construction.

Once these concerns have been satisfied, the first biological addition will logically be the mutation rate and mutational spectrum. For a handful of commonly studied species, both the mean of, and genomic heterogeneity in, mutation rates have been quantified via mutation accumulation lines and/or pedigree studies [77]. However, even for these species, ascertainment issues remain complicating [78], variation among individuals may be substantial [79], and estimates only represent a temporal snapshot of rates and patterns that are probably changing over evolutionary timescales and may be affected by the environment [31,80]. In organisms lacking experimental information, often the best available estimates come either from a distantly related species or from molecular clock-based approaches. Apart from stressing the importance of implementing either of the experimental approaches in order to further refine mutation rate estimates for such a species of interest, it is noteworthy that this uncertainty can also be modeled. Namely, if proper estimation has been performed in a closely related species, one may quantify the expected effect on observed levels of variation and divergence of higher and lower rates. The variation in possible data observations induced by this uncertainty is thus now part of the underlying model.

The same logic follows for the next parameter addition(s): crossing over/gene conversion, as applicable for the species in question. For example, for a subset of species, per-generation crossover rates in cM per Mb have been estimated by comparing genetic maps based on crosses or pedigrees with physical maps [81–83]. In addition, recombination rates scaled by the effective population size have also been estimated from patterns of LD (e.g., [84,85])—although this approach typically requires assumptions about evolutionary processes that may be violated (e.g., [42]). As with mutation, the effects on downstream inference arising from the variety of possible recombination rates—whether estimated for the species of interest or a closely related species—can be modeled.

The next additions to the baseline model construction are generally associated with the greatest uncertainty—the demographic history of the population, and the effects of direct and linked purifying selection. This is a difficult task given the virtually infinite number of potential demographic hypotheses (e.g., [86]); furthermore, the interaction of selection with demography is inherently nontrivial and difficult to treat (e.g., [55,87,88]). This realization continues to motivate attempts to jointly estimate the parameters of population history together with the DFE of neutral, nearly neutral, weakly deleterious, and strongly deleterious mutations—a distribution that is often estimated in both continuous and discrete forms [89]. One of the first important advances in this area used putatively neutral synonymous sites to estimate changes in population size based on patterns in the SFS and conditioned on that demography to fit a DFE to nonsynonymous sites, which presumably experience considerable purifying selection [90–92]. This stepwise approach may become problematic, however, for organisms in which synonymous sites are not themselves neutral [93–95] or when the SFS of synonymous sites is affected by background selection, which is probably the case generally given their close linkage to directly selected nonsynonymous sites ([41] and see [96,97]).

In an attempt to address some of these concerns, Johri and colleagues [44] recently developed an ABC approach that relaxes the assumption of synonymous site neutrality and corrects for background selection effects by simultaneously estimating parameters of the DFE alongside population history. The posterior distributions of the parameters estimated by this approach in any given data application (i.e., characterizing the uncertainty of inference) represent a logical treatment of population size change and purifying/background selection for the purposes of inclusion within this evolutionarily relevant baseline model. That said, the demographic model in this implementation is highly simplified, and extensions are needed to account for more complex population histories. In particular, estimation biases that may be expected owing to the neglect of cryptic population structure and migration, and indeed the feasibility of co-estimating population size change and the DFE together with population structure and migration within this framework, all remain in need of further investigation. While such simulation-based inference (see [98]), including ABC, provides one promising platform for joint estimation of demographic history and selection, progress on this front has been made using alternative frameworks as well [99,100], and developing analytical expectations under these complex models should remain as the ultimate, if distant, goal. Alternatively, in functionally sparse genomes with sufficiently high rates of recombination, such that assumptions of strict neutrality are viable for some genomic regions, multiple well-performing approaches have been developed for estimating the parameters of much more complex demographic models (e.g., [101–104]). In organisms for which such approaches are applicable (e.g., certain large, coding sequence sparse vertebrate, and land plant genomes), this intergenic demographic estimation assuming strict neutrality may helpfully be compared to estimates derived from data in or near coding regions that account for the effects of direct and linked purifying selection [41,44,105]. For newly studied species lacking functional annotation and information about coding density, following the joint estimation procedure would remain as the more satisfactory strategy in order to account for possible background selection effects.

[END]

[1] Url: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001669

(C) Plos One. "Accelerating the publication of peer-reviewed science."
Licensed under Creative Commons Attribution (CC BY 4.0)
URL: https://creativecommons.org/licenses/by/4.0/


via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/