(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .
Interpreting population- and family-based genome-wide association studies in the presence of confounding [1]
['Carl Veller', 'Department Of Ecology', 'Evolution', 'University Of Chicago', 'Chicago', 'Illinois', 'United States Of America', 'Graham M. Coop', 'Department Of Evolution', 'Ecology']
Date: 2024-04
Abstract A central aim of genome-wide association studies (GWASs) is to estimate direct genetic effects: the causal effects on an individual’s phenotype of the alleles that they carry. However, estimates of direct effects can be subject to genetic and environmental confounding and can also absorb the “indirect” genetic effects of relatives’ genotypes. Recently, an important development in controlling for these confounds has been the use of within-family GWASs, which, because of the randomness of mendelian segregation within pedigrees, are often interpreted as producing unbiased estimates of direct effects. Here, we present a general theoretical analysis of the influence of confounding in standard population-based and within-family GWASs. We show that, contrary to common interpretation, family-based estimates of direct effects can be biased by genetic confounding. In humans, such biases will often be small per-locus, but can be compounded when effect-size estimates are used in polygenic scores (PGSs). We illustrate the influence of genetic confounding on population- and family-based estimates of direct effects using models of assortative mating, population stratification, and stabilizing selection on GWAS traits. We further show how family-based estimates of indirect genetic effects, based on comparisons of parentally transmitted and untransmitted alleles, can suffer substantial genetic confounding. We conclude that, while family-based studies have placed GWAS estimation on a more rigorous footing, they carry subtle issues of interpretation that arise from confounding.
Citation: Veller C, Coop GM (2024) Interpreting population- and family-based genome-wide association studies in the presence of confounding. PLoS Biol 22(4): e3002511.
https://doi.org/10.1371/journal.pbio.3002511 Academic Editor: Priya Moorjani, University of California Berkeley, UNITED STATES Received: March 22, 2023; Accepted: January 19, 2024; Published: April 11, 2024 Copyright: © 2024 Veller, Coop. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: The authors confirm that all data underlying the findings are fully available without restriction. Simulation code is available at
https://doi.org/10.5281/zenodo.10520811 and
https://github.com/cveller/confoundedGWAS. Funding: Funding was provided by the National Institutes of Health (NIH R35 GM136290 awarded to GC) and a Branco Weiss fellowship to CV. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist.
1 Introduction Genome-wide association studies (GWASs) have identified thousands of genetic variants that are associated with a wide variety of traits in humans. In the standard “population-based” approach, the GWAS is conducted on a set of “unrelated” individuals. The associations that are detected can arise when a variant causally affects the trait or when it is in tight physical linkage with causal variants nearby. Central to the aims of GWASs is the estimation of variants’ effect sizes on traits of interest. These effect-size estimates are important for identifying and prioritizing variants and implicated genes for functional followup, and may be used to form statistical predictors of trait values or to understand the causal or mechanistic role of genetic variation in traits. Understanding sources of error and bias in GWAS effect-size estimates is therefore crucial. The interpretation of GWAS effect-size estimates is complicated by 4 broad factors [1,2]. First, the causal pathways from an allele to phenotypic variation need not reside in the individuals who enrolled in the GWAS, but can also reflect causal effects on the individual’s environment of the genotypes of their siblings, parents, other ancestors, and neighbors (indirect genetic effects or dynastic effects) [3]. Second, a phenotypic association can result from correlations between the allele and environmental causes of trait variation (environmental confounding) [4]. Third, a phenotypic association can be generated at a locus if it is genetically correlated with causal loci outside of its immediate genomic region (genetic confounding) [1]. Fourth, an allele’s effect on a trait might depend on the environment and the allele’s genetic background (gene–environment and gene–gene interactions, or G×E and G×G) [5–7]. Since our primary interest here will be genetic confounding, we briefly describe some potential sources of the long-range allelic associations that drive it: population structure, assortative mating, and selection on the GWAS trait. Population structure leads to genetic correlations across the genome when allele frequencies differ across populations or geographic regions: sampled individuals from particular populations are likely to carry, across their genomes, alleles that are common in those populations, which induces correlations among these alleles, potentially across large genomic distances. Such genetic correlations persist even after the populations mix, as alleles that were more common in a particular source population retain their association until uncoupled by recombination. Assortative mating brings alleles with the same directional effect on a trait (or on multiple traits, in the case of cross-trait assortative mating) together in mates, and therefore, bundles these alleles in offspring and subsequent generations. This bundling manifests as positive genetic correlations among alleles with the same directional effect [8,9], which can confound effect-size estimates in a GWAS on the trait. Finally, natural selection on a GWAS trait can result in genetic correlations by favoring certain combinations of trait-increasing and trait-decreasing alleles. A form of selection that is expected to be common for many traits of interest is stabilizing selection, which penalizes deviations from an optimal trait value. By favoring compensating combinations of trait-increasing and trait-decreasing alleles, stabilizing selection generates negative correlations among alleles with the same directional effect [10,11], and therefore can confound effect-size estimates in a GWAS performed on the trait under selection or on a genetically correlated trait. The potential for dynastic, environmental, and genetic confounds to bias GWAS effect-size estimates has long been recognized [4,12], and so a major focus of the literature has been to develop methods to control for these confounds [13,14]. Standard approaches include using estimates of genetic relatedness as covariates in GWAS regressions [15,16] or downstream analyses such as LD-Score regression [17–19]. Such methods aim to control for both environmental and genetic confounding, but do so imperfectly (e.g., [20,21]). Further, it is often unclear what features of genetic stratification are being addressed [1,2]: assortative mating in particular may not be well accounted for by these methods [22]. Additionally, dynastic confounding will be particularly difficult to control for using population-level covariates. One promising way forward is to estimate allelic effects within families, either by comparing the separate associations of parentally transmitted and untransmitted alleles with trait values in the offspring [23–27], or by associating differences in siblings’ trait values with differences in the alleles they inherited from their parents [28–30]. The idea is that, by controlling for parental genotypes, within-family association studies control for both environmental stratification and indirect/dynastic effects, while mendelian segregation randomizes alleles across genetic backgrounds. In principle, this allows the “direct genetic effect” of an allele—the causal effect of an allele carried by an individual on their trait value—to be estimated. Recognizing that a variant detected in a GWAS will usually not itself be causal for the trait variation but instead will only be correlated with true causal variants, the direct effect of a genotyped variant is usually interpreted as reflecting the direct causal effects of nearby loci that are genetically correlated with the focal locus [2]—but not the effects of more distant loci that might also be correlated with the focal genotyped locus (e.g., because of population structure or assortative mating). Consistent with both the presence of substantial confounds in some population-based GWASs and the mitigation of these confounds in within-family GWASs, family-based estimates of direct effect sizes and aggregate quantities based on these estimates (e.g., SNP-based heritabilities) are substantially smaller than population GWAS estimates for a number of traits, most notably social and behavioral traits [30–34]. Likewise, estimates of genetic correlations between traits are sometimes substantially reduced when calculated using direct effect estimates from within-family GWASs (e.g., [33]). While some of these findings could reflect the contribution of indirect genetic effects to population GWASs, it is also likely that, at least for some traits, standard controls for population stratification in population GWASs have been insufficient [20,21,34–37]. Our aim in this paper is to study a general model of confounding in GWASs, to generate clear intuition for its influence on estimates of effect sizes in both population- and family-based designs. A number of the issues that we analyze have previously been raised, particularly in the context of population-based GWASs (e.g., [1,2,38,39]); here, we analyze them in a common framework that allows for comparison of multiple sources of confounding in both population and family-based GWASs. There is a large literature on GWASs in nonhuman organisms (e.g., [40–43]). However, although the results and intuition that we derive here apply equally well to human and nonhuman GWASs, we shall interpret them primarily from the perspective of human GWASs, in which the inability to experimentally randomize environments, together with the small effects that investigators hope to detect, makes confounding a particular concern. To focus on confounding, we assume no G×E and G×G interactions (the effects of G×E and G×G on estimates produced by population- and family-based designs are studied in ref. [44]). We derive expressions for estimators of direct effects in both population and within-family GWASs, as functions of the true direct and indirect effects at a locus and the genetic confounds induced by other loci. In doing so, we find that family-based estimates of direct effects are in fact susceptible to genetic confounding, contrary to standard interpretation. Reassuringly, in many of the models we consider, the resulting biases are likely to be small in humans. We also address a related case: family-based GWAS designs that consider transmitted and untransmitted parental alleles and in which the indirect (or “dynastic”) effect of an allele is estimated from its association with the offspring’s phenotype when carried by the parent but not transmitted to the offspring. We show that this estimator of indirect effects can be substantially biased by genetic and environmental confounds, in a similar way to population estimates of direct effects. Next, we consider various sources of genetic confounding—assortative mating, population structure, and stabilizing selection on GWAS traits—and how they influence estimates of direct effects in both population and within-family GWASs. We then turn to sibling indirect effects, which are known to bias estimates of direct effects in sibling-based GWASs [2,34]. We characterize this bias in a simple model and contrast it to the bias caused by sibling indirect effects in a population GWAS.
4 Discussion It has long been recognized that population GWASs in humans can be biased by environmental and genetic confounding [1,4]. Currently, population GWASs attempt to control for these confounds by focusing on sets of individuals that are genetically more similar and by controlling for population stratification. However, these controls are imperfect and are not always well defined. For example, controlling for genome-wide patterns of population stratification based on common alleles does not control for the genetic and environmental confounding of rare variants [79]. Work on genetic confounding has uncovered increasing evidence that assortative mating may be leading to large biases in estimates of direct genetic effects and to large genetic correlations for a number of traits [22,37,57]; moreover, it can often be unclear whether genetic signals of assortative mating are due to trait-based mate choice or some other more general form of genetic confounding (e.g., [80]). Additionally, while we have focused primarily on genetic confounding, for a number of traits there are also signals of residual environmental confounding in GWAS signals [31,32,35,81]. Thus, subtle and often interwoven forms of genetic and environmental confounding remain a major issue in many GWASs [34], compromising the interpretation of GWAS effect-size estimates and downstream quantities such as SNP heritabilities and genetic correlations. Effect-size estimates from within-family GWASs are less affected by these various confounds. They are not subject to environmental confounding across families, because the environments of family members are effectively randomized with respect to within-family genetic transmission. As we have shown, family-based estimates should also suffer substantially less from genetic confounding, because genetic transmission at unlinked loci (but not linked loci) is randomized by independent assortment of chromosomes in meiosis. Nonetheless, family-based GWASs can suffer from residual genetic confounding as well as sibling indirect effects (in sibling-based designs), and causal interpretations of the estimates they produce are complicated by G×E and G×G interactions [44]; they also raise a number of further conceptual problems that we discuss below. Sources of genetic confounding Genetic confounding is caused by long-range LD between loci that affect the trait or traits under study. To illustrate the potential for genetic confounds to bias GWAS effect-size estimates, we have considered several sources of long-range LD. Some of these—assortative mating, selection on GWAS traits, and phenotype-biased migration—can cause systematic directional biases in GWAS effect-size estimates. Others, such as neutral population structure, cause random biases that influence the variance of effect-size estimates and related quantities. Assortative mating and neutral population structure have received considerable theoretical attention in the GWAS literature (e.g., [22,37,38,57]). Here, we have further outlined how both selection and phenotyped-biased migration can drive systematic genetic confounding that may not be well accounted for by current methods of controlling for stratification (see also ref. [81]). We wish to emphasize stabilizing selection in particular as a potential source of systematic confounding in GWASs. Stabilizing selection has been well studied in the quantitative genetics literature but less so in the context of GWASs, despite its expected ubiquity. By selecting for compensating combinations of trait-increasing and trait-decreasing alleles, stabilizing selection generates negative LD between alleles with the same directional effect on the trait [10,11], and can therefore bias GWAS effect-size estimates downwards. While the potential for stabilizing selection to confound effect-size estimation has been noted (e.g., [66,75,82]), the resulting biases have not, to our knowledge, been quantified. Our calculations suggest that these downward biases could, for some human traits, be as large as 5% to 10% systematically across all causal loci in population GWASs. While biases of this magnitude are unlikely to compromise some goals of GWASs, such as gene discovery, they could be quantitatively problematic for other GWAS aims, such as estimation of SNP heritabilities and the strength of assortative mating. Moreover, while our results pertain to (a particular model of) stabilizing selection, many kinds of selection generate LD between genetically distant loci—in fact, only multiplicative selection among loci does not ([83], pgs. 50 and 177). Therefore, the general result that selection can generate genetic confounding will hold more broadly. Even in the absence of natural and sexual selection, the ascertainment of samples for a population- or family-based GWAS is a form of selection. When participation in a GWAS sample is based partly on a genetically influenced phenotype [84], this sample-selection bias will generate LD between loci underlying participation. In family studies, this participation bias can violate the assumption of random assignment of genotypes, and thus potentially undermine the interpretation of these studies’ results. For a given genotyped locus in a GWAS, there is no bright line between local “tagged” LD and long-range confounding LD, and one reasonable objection to the approach taken here is that that we have used an arbitrary definition of the causal loci that are locally tagged by a genotyped locus (L local in Eq (2)). All of the sources of genetic confounding that we have considered generate LD among causal loci both within and across chromosomes. Under these models, the within-chromosome LD that is generated is, in a sense, a continuation of the LD generated across chromsomes (moving from a recombination rate = 0.5 to ≤ 0.5). Thus, while investigators may prefer some looser definition of “local” when thinking about genotyped GWAS loci as tag SNPs, to extend that definition to include all loci on the same chromosome as the SNP would, by reasonable interpretation, be to include confounding into the desired estimator. The extent to which the absorption of genetic confounding in estimated effect sizes is a problem depends on the application. In the case of polygenic prediction, the absorption of environmental effects, indirect effects, and the effects of untyped loci throughout the genome can help to improve prediction accuracy, although this does come at a cost to interpretability (and potentially also to portability across contexts). For GWAS applications focused on understanding genetic causes and mechanisms, the biases in effect-size estimates and spurious signals of pleiotropy among traits generated by genetic confounding will be more problematic. Indirect genetic effects Family GWASs are often interpreted as providing the opportunity to ask to what extent parental genotypes (or other family genotypes) causally affect a child’s phenotype (“genetic nurture” [27]). Viewed in this way, the association between untransmitted parental alleles and the child’s phenotype would seem, at first, a natural estimate of indirect genetic effects. In practice, however, if the population GWAS suffers from genetic and environmental confounds, then the estimated effects of untransmitted alleles will absorb that confounding in much the same way that estimates of direct genetic effects from a population GWAS do (Eqs (8) and (9)) [50]. For example, in the case of assortative mating, a given untransmitted allele is correlated with alleles that were transmitted both by this parent and by their mate, and these transmitted alleles can directly affect the offspring’s phenotype. Thus, while family-based estimates of direct genetic effects benefit from the randomization of meiosis and from controlling for the environment, family-based estimates of indirect genetic effects lack both of these features and should be interpreted with caution. Indeed, recent work using parental siblings to control for grandparental genotypes has shown that little of the estimated “indirect genetic effect” may be causally situated in parents [36]. With empirical estimates of indirect genetic effects potentially absorbing a broad set of confounds [34,85], and few current studies of indirect effects having designs that allow such confounding to be disentangled, it is premature—and potentially invalid—to interpret associations of untransmitted alleles causally in terms of indirect genetic effects [3]. Rather, they should be treated agnostically in terms of “non-direct” effects. Direct genetic effects Mendelian segregation provides a natural randomization experiment within families [86], and so crosses in experimental organisms and family designs have long been an indispensable tool to geneticists in exploring genetic effects and causation. Growing concerns about GWAS confounding and the increasing availability of genotyped family members have led to a return of family-based studies to the association study toolkit [2]. Family-based estimates of direct genetic effects are often interpreted as being unbiased and discussed in terms of the counterfactual effect of experimentally substituting one allele for another [34,87,88]. As we have shown, family-based GWASs are indeed less subject to confounding than population-based GWASs: in the presence of genetic and environmental confounding, the family-based estimate of the effect size at a given locus provides a much closer approximation to the true effects of tightly linked causal loci than a population-based estimate does. The family-based estimate is not biased by environmental variation across families and avoids the correlated effects of the many causal loci that lie on other chromosomes. Still, the family-based estimate does absorb the effects of non-local causal loci on the same chromosome, and so cannot truly be said to be free of genetic confounding. Rather than considering a single allele being substituted between individuals, a better experimental analogy for the effect-size estimate would be to say that we are measuring the mean effect of transmission of a large chunk of chromosome surrounding the focal locus, potentially carrying many causal loci. In addition, while within-family GWASs offer these advantages, in other ways, they move us further away from the questions about the sources and causes of variation among unrelated individuals that motivate population GWASs in the first place. Indeed, the presence of confounding and of G×E/G×G interactions introduces a number of conceptual issues in moving from within-family GWAS to the interpretation of differences among individuals from different families [44,89,90]. For example, in the presence of genetic confounding, the estimated effect of a causal allele of interest will depend on a set of weights: its LD to many other causal alleles. Family-based approaches weight these LD terms differently to population-based approaches, which, we argue, can complicate the interpretation of these estimates. For example, when previously isolated populations admix, same-ancestry alleles will be held together in long genomic blocks until these are broken up by recombination, which will happen very quickly for alleles on different chromosomes but more slowly for alleles on the same chromosome. A few generations after admixture, therefore, cross-chromosome ancestry LD will largely have dissipated, but contiguous ancestry tracts will still span substantial portions of chromosome lengths. Since both population and within-family GWASs are similarly confounded by the same-chromosome LD, their mean squared effect sizes will be similar in this case (Fig 5). Bearing in mind that the LD resulting from admixture is not present in the source populations, it becomes unclear which weighting of ancestry LD is appropriate if we want to interpret the resulting effect-size estimates as direct effects. As this example illustrates, while family-based GWASs are a useful device for dealing with confounding, it is not always obvious how to interpret the quantities that they measure. A number of additional complications arise when, to compensate for the small effect sizes of individual loci, researchers combine many SNPs into a PGS and study the effects of PGSs within families (or use them as instruments in mendelian randomization analyses). For one, SNPs are usually chosen for inclusion in the PGS on the basis of their statistical significance in a population GWAS. This approach prioritizes SNPs whose effect-size estimates are amplified (or even wholly generated) by confounding (for an example of how this leads to residual environmental confounding in applications of sibling-based effect-size estimates, see ref. [79]). Second, the weights given to SNPs that are included in the PGS absorb the effects of confounding, and this confounding is heterogeneous across SNPs. Thus, when we study the correlates of trait-A PGS differences between siblings in the presence of GWAS confounding, we are not observing the average phenotypic outcomes of varying the genetic component of trait A between siblings. Rather, we are varying a potentially strangely weighted set of genetic correlates of trait A. An observation that a population-GWAS PGS is predictive of phenotypic differences among siblings demonstrates that the PGS SNPs tag nearby causal loci, but beyond that, interpretation is difficult. Notably, if there is cross-trait assortative mating for traits A and B, but no pleiotropic link between the traits, then some of the SNPs identified as significant in a GWAS on trait A may be tightly linked to loci that causally affect trait B but not trait A. If these loci are included in the trait-A PGS, then when we study the effect of variation in the trait-A PGS on sibling differences, we are accidentally absorbing some components of the variation in trait B across siblings. Thus, we might observe a correlation between the trait-A PGS and differences in trait B between siblings, and this correlation may be lower than is observed at the population level, without there existing any pleiotropic (or causal) link between A and B. These effects can be exacerbated if the 2 traits have different genetic architectures (Fig 4). Instead of using a set of SNPs and weights from a population GWAS, genetic correlations between traits due to pleiotropy could be estimated from the correlation of effect sizes estimated within families [33]. Given current sample size constraints in family-based studies, the confidence intervals on these estimates are large. Moreover, significant family-based correlations need not reflect pure pleiotropy, since, as we have shown, they are not completely free of genetic confounding due to intra-chromosomal LD. Also complicating the interpretation of family-based effect-size estimates are various types of interactions. Indirect effects between siblings can bias family estimates of direct genetic effects (Eq (20); [2,34,52]) in ways that are conceptually different from the biases they introduce to population-based estimates (Eq (22)). These sibling effects can potentially be addressed with fuller family information (e.g., parental genotypes in addition to sibling genotypes [27,34]). In summary, family-based studies are a clear step forward towards quantifying genetic effects, with large-scale family studies carrying the potential to resolve long-standing issues in human genetics. However, these designs come with their own sets of caveats, which will be important to understand and acknowledge as family-based genetic studies become a key tool in the exploration of causal effects across disparate fields of study.
5 Methods All simulations were carried out in SLiM 4.0 [53]. Code is available at
https://doi.org/10.5281/zenodo.10520811. For the purpose of carrying out sibling association studies in our simulations, we assumed a simple, monogamous mating structure: each generation, each female, and each male is involved in a single mating pair, and each mating pair produces exactly 2 offspring (who are therefore full siblings). To maintain the precisely even sex ratio required by this scheme, we assumed that a quarter of mating pairs produce 2 daughters, a quarter produce 2 sons, and half produce a son and a daughter. Population sizes were chosen to ensure that these numbers of mating pairs were whole numbers, and mating pairs were permuted randomly each generation before assigning brood sex ratios (to ensure that no artifact was introduced by SLiM’s indexing of individuals). Each generation, per-locus effect size estimates were calculated for both population-wide and sibling GWASs. The former were calculated as the regression of trait values on per-locus genotypes, while the latter were calculated as the regression of sibling differences in trait values on sibling differences in per-locus genotypes. In all simulations, the total population size was N = 40,000. Assortative mating For our general cross-trait assortative mating setup, traits 1 and 2 are influenced by variation at sets of bi-allelic loci L 1 and L 2 , respectively. The effect sizes of the reference allele at locus l on traits 1 and 2 are α l and β l , respectively. An individual’s PGS is then for trait 1 and for trait 2. In all the scenarios we simulated, traits had heritability 1, so that individuals’ trait values are the same as their PGSs. Our aim is to simulate a scenario where assortative mating is based on females’ values for trait 1 and males’ values for trait 2, such that, across mating pairs, the correlation of the mother’s PGS for trait 1, , and the father’s PGS for trait 2, , is a constant value ρ (in all of our simulations, ρ = 0.2). To achieve this, we use an algorithm suggested by Zaitlen and colleagues [69]: At the outset, we choose an accuracy tolerance ε such that, if by some assignment of mates the correlation of their PGSs falls within ε of the target value ρ, we accept that assigment. Each generation in which assortative mating occurs, we rank females in order of their PGSs for trait 1, and males in order of their PGSs for trait 2. We then calculate the PGS correlation across mating pairs, ρ 0 , if females and males were matched according to this ranking. If this (maximal) correlation is smaller than the upper bound of our target window (ρ 0 <ρ+ε, which very seldom occurred in our simulations), then females and males mate precisely according to their PGS rankings and we move on to the next generation. If, instead, ρ 0 exceeds ρ+ε, then we follow the following iterative procedure until we have found a mating structure under which the correlation of PGSs falls within ε of the target value ρ. First, we choose initial “perturbation sizes” ξ 0 and ξ 1 = 2ξ 0 . Suppose that, in iteration k of the procedure, the perturbation size is ξ k and the chosen mating structure leads to a correlation among mates of ρ k . If |ρ k −ρ|<ε, we accept the mating structure and move on to the next generation. Otherwise, we choose a new perturbation size ξ k+1 : (i) if ρ k−1 , ρ k >ρ, then ξ k+1 = 2ξ k ; (ii) if ρ k−1 >ρ>ρ k or ρ k−1 <ρ<ρ k , then ξ k+1 = (ξ k−1 +ξ k )/2; (iii) if ρ k−1 , ρ k <ρ, then ξ k+1 = ξ k /2. Once we have chosen ξ k+1 , for each individual we perturb their PGS (trait 1 for females; trait 2 for males) by a value chosen from a normal distribution with mean 0 and standard deviation ξ k+1 , independently across individuals. We then rank females and males according to their perturbed PGSs, and calculate the correlation ρ k+1 of their true PGSs if they mate according to this ranking. (Since, in our experience, there can be substantial variance in the ρ k+1 values that result from this procedure, we repeat it 5 times and choose the mating structure that produces the value of ρ k+1 closest to the target value ρ.) We then decide if another iteration—i.e., another perturbation size ξ k+2 —is required. Fig 2. Cross-trait assortative mating for traits with the same genetic architecture In the simulations displayed in Fig 2, ρ = 0.2, and traits 1 and 2 have identical but non-overlapping genetic architectures: L 1 and L 2 are distinct sets of 500 loci each, with α l = 1 and β l = 0 for l∈L 1 , and α l = 0 and β l = 1 for l∈L 2 . Loci in L 1 and L 2 alternate in an even spacing along the physical (bp) genome. Fig 2A shows results for the “single chromosome” case where the recombination fraction between adjacent loci is c = 1/999 in both sexes (such that the single-chromosome genome receives, on average, one crossover per transmission). Fig 2B shows results for the case where recombination fractions between loci are calculated from the human female and male linkage maps generated by Kong and colleagues [54]. In both cases, we assumed no crossover interference. At each locus, the initial frequency of the reference allele was 1/2, with reference alleles assigned randomly across diploid individuals and independently across loci such that, in expectation, Hardy–Weinberg and linkage equilibrium initially prevail. The assortative mating algorithm above was run for 19 generations, with a target correlation ρ = 0.2, a tolerance parameter ε = ρ/100, and an initial perturbation size . Thereafter, assortative mating was switched off, with mating pairs (still monogamous) being chosen randomly. Fig 3. Same-trait assortative mating The algorithm we followed to ensure assortative mating of a given strength was the same as that for Fig 2 above, but here traits 1 and 2 are identical, and 1,000 loci underlie variation in the trait, and are evenly spread along the physical genome. The effect size of the reference allele at each locus was drawn from a normal distribution with mean 0 and standard deviation 1, independently across loci. The initial frequency of the reference allele at each locus was drawn, independently across loci, from a uniform distribution on [MAF, 1−MAF]; in our simulations, we chose a minimum minor allele frequency of MAF = 0.1. Since here we are interested in quantifying the mean squared effect size estimate, which is directionally affected by drift-based local LD that may not be present in our initial configuration, we allowed 150 generations of random mating before switching on assortative mating (only the final 20 generations of this random mating burn-in are displayed in Fig 3). Assortative mating occurred for 20 generations, after which random mating occurred for a further 20 generations. Fig 4. Cross-trait assortative mating for traits with different architectures For Fig 4A, we again followed a similar procedure to that for Fig 2 above, but now, while traits 1 and 2 have distinct genetic bases, the numbers of loci contributing variation to traits 1 and 2 are |L 1 | = 100 and |L 2 | = 1,000. Trait-1 loci are placed evenly along the physical genome, with trait-2 loci then evenly spaced among the trait-1 loci; we used the human linkage map for these simulations. At both trait-1 and trait-2 loci, the initial frequency of the focal allele was drawn from a uniform distribution on [MAF, 1−MAF], with MAF = 0.1. At trait-2 loci, true effect sizes were randomly drawn from a normal distribution with mean zero and standard deviation 1; at trait-1 loci, true effect sizes were randomly drawn from a normal distribution with mean zero and standard deviation , so that traits 1 and 2 have equal genic variances. After a burn-in of 150 generation of random mating, assortative mating was switched on. We performed a population GWAS at the end of the period of random mating and after 20 generations of assortative mating. These GWASs were performed across 1,000 replicate trials, with the effect size estimates then pooled across trials. From these, we estimated the densities of the absolute values of effect size estimates using Matlab’s kernel density estimator ksdensity, specifying that the support of the distributions be positive. For Fig 4B, as a control, we also carried out the same simulations with |L 1 | = |L 2 | = 1,000, drawing effect sizes from a standard normal distribution. Fig 5. Population structure and admixture We wished first to simulate a situation where 2 populations of size N/2 have been separated for a length of time such that the value of F ST between them is some predefined level (in our case, a mean F ST per locus of 0.1). To do so without having to run the full population dynamics of 2 allopatric populations for a prohibitively large number of generations, we simply assigned allele frequencies to achieve the desired level of F ST . We assumed 1,000 loci spread evenly over the physical genome. At each locus l, we chose an “ancestral” frequency for the reference allele independently from a uniform distribution on [MAF, 1−MAF], with MAF = 0.2. We then perturbed this allele frequency in populations 1 and 2 by independent draws from a normal distribution with mean 0 and variance ; if a perturbed allele frequency fell below 0 or above 1, we set it to 0 or 1, respectively. The population dynamics described above, with monogamous mating pairs chosen randomly, were then run for 50 generations. In generation 50, the 2 populations merge, forming an admixed population of size N. The same population dynamics, with monogamous mating pairs chosen randomly, were then run for a further 50 generations. Fig 6. Stabilizing selection To calculate the bias in GWAS effect size estimation caused by stabilizing selection, we must first calculate the harmonic mean recombination rate. We focus on humans and consider only the autosomal genome. The set of loci underlying variation in the trait is L, which we apportion among the 22 autosomes according to their physical (bp) lengths (as reported in GRCh38.p11 of the human reference genome;
https://www.ncbi.nlm.nih.gov/grc/human/data?asm=GRCh38.p11). For each chromosome, we spread its allotment of loci evenly over its sex-averaged genetic (cM) length, using the male and female linkage maps produced by Kong and colleagues [54]. (We use genetic lengths instead of physical lengths because, were we to spread loci evenly over the physical lengths of the chromosomes, some pairs of adjacent loci on some chromosomes might have a sex-averaged recombination fraction of 0, in which case the harmonic mean recombination rate would be undefined.) For each pair of linked loci, the recombination rate between them was estimated separately from the male and female genetic distance between them using Kosambi’s map function [91]. Pairs of loci on separate chromosomes have a recombination fraction of 1/2. With the sex-averaged recombination fraction c ll′ thus calculated for every pair of loci (l,l′), the harmonic mean recombination fraction was calculated as , where is the number of pairs of distinct loci in L. Performing this calculation with |L| = 1,000 loci, we obtain an estimate of for human autosomes. Substituting this estimate into S1 Text Eqs (S.106), (S.107), and (S.108) then defines the curves plotted in Figs 6A, 6B and 6C, respectively.
Acknowledgments We thank Jeremy Berg, Doc Edge, Arbel Harpak, Gibran Hemani, Hanbin Lee, Molly Przeworski, Alex Young, and members of the Coop lab for helpful discussions and comments on earlier drafts.
[END]
---
[1] Url:
https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3002511
Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.
via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/