(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org.

(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org.
Licensed under Creative Commons Attribution (CC BY) license.
url:https://journals.plos.org/plosone/s/licenses-and-copyright

------------

A spectral theory for Wright’s inbreeding coefficients and related quantities

['Olivier François', 'Université Grenoble-Alpes', 'Grenoble', 'Clément Gain']

Date: 2021-10

Wright’s inbreeding coefficient, F ST , is a fundamental measure in population genetics. Assuming a predefined population subdivision, this statistic is classically used to evaluate population structure at a given genomic locus. With large numbers of loci, unsupervised approaches such as principal component analysis (PCA) have, however, become prominent in recent analyses of population structure. In this study, we describe the relationships between Wright’s inbreeding coefficients and PCA for a model of K discrete populations. Our theory provides an equivalent definition of F ST based on the decomposition of the genotype matrix into between and within-population matrices. The average value of Wright’s F ST over all loci included in the genotype matrix can be obtained from the PCA of the between-population matrix. Assuming that a separation condition is fulfilled and for reasonably large data sets, this value of F ST approximates the proportion of genetic variation explained by the first (K − 1) principal components accurately. The new definition of F ST is useful for computing inbreeding coefficients from surrogate genotypes, for example, obtained after correction of experimental artifacts or after removing adaptive genetic variation associated with environmental variables. The relationships between inbreeding coefficients and the spectrum of the genotype matrix not only allow interpretations of PCA results in terms of population genetic concepts but extend those concepts to population genetic analyses accounting for temporal, geographical and environmental contexts.

Principal component analysis (PCA) is the most-frequently used approach to describe population genetic structure from large population genomic data sets. In this study, we show that PCA not only estimates ancestries of sampled individuals, but also computes the average value of Wright’s inbreeding coefficient over the loci included in the genotype matrix. Our result shows that inbreeding coefficients and PCA eigenvalues provide equivalent descriptions of population structure. As a consequence, PCA extends the definition of those coefficients beyond the framework of allelic frequencies. We give examples on how F ST can be computed from ancient DNA samples for which genotypes are corrected for coverage, and in an ecological genomic example where a proportion of genetic variation is explained by environmental variables.

Funding: The study was supported by the grant ANR-19-P3IA-0003 MIAI@Grenoble-Alpes – PEG: Predictive Ecological Genomics to OF from the Agence Nationale de la Recherche ( https://anr.fr/ ) The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Data Availability: All relevant data are within the paper and its Supporting information files. All codes necessary to reproduce the simulations and data analyses presented in our study are in an R package available from the following link: https://github.com/bcm-uga/spectralfst .

Copyright: © 2021 François, Gain. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

A corollary of the theory is an alternative definition of inbreeding coefficients that allows us to extend F ST to adjusted or surrogate genotypes, such as genotype likelihoods and other modifications of allele counts [ 27 ]. To illustrate the new definition, we compute F ST for ancient human DNA samples after performing correction for genomic coverage and for distortions due to difference in sample ages [ 28 ]. In a second application, we compute F ST for Scandinavian samples of Arabidopsis thaliana after removing genetic variation associated with environmental variables taken from a climate database [ 29 , 30 ].

In this study, we develop a spectral theory of genotype matrices to investigate the relationships between PCA and Wright’s coefficients in discrete population models. Our theoretical framework assumes that the observed genotypes correspond to the sampling of K discrete populations. Decomposing the genotype matrix as a sum of between and within-population matrices, we extend the results obtained in [ 10 , 16 , 19 , 25 ]. Our main result states that the mean value of F ST over loci is equal to the squared Hilbert-Schmidt norm of the between-population matrix, which can be computed by a spectral analysis. Under a separation condition bearing on the between and within-population matrices, the sum of the first (K − 1) eigenvalues of scaled PCA approximates the mean value of F ST over loci. To describe residual variation not explained by the discrete population model, we rely on approximations of the eigenvalues of the within-population matrix from RMT [ 10 , 26 ].

Although the relationships between PCA projections and admixture estimates are well-understood, difficulties of interpreting PCA eigenvalues still remain. The main contributions in that direction were restricted to models of divergence between two populations. The arguments were based on random matrix theory (RMT) [ 10 , 19 ] and coalescent theory [ 16 ]. We note that connections between F ST and PCA are not only important for description of population structure, but also in genome scans for selection where the distribution of PCA loadings can be used to detect regions with signature of divergent selection [ 20 – 23 ]. Based on RMT, Ref. [ 10 ] proposed a threshold value of F ST for two populations with equal sample sizes. Below the threshold, there should be essentially no evidence of population structure. The coalescent approach relied on a relationship between F ST and coalescent time for a pair of genes from a single subpopulation and that of a pair of genes from the collection of subpopulations [ 6 ]. For a model of divergence between two populations, theoretical results for coalescent times were used to demonstrate a link between the leading eigenvalue of PCA and F ST [ 16 ]. Results in Ref. [ 16 ] might be extended to simple models of population structure with explicit formulas for coalescent times [ 24 ], but the results are not straightforward. While coalescent theory and RMT have provided relationships between F ST and PCA in simple cases, the general conditions under which they are valid and their extensions to more than two populations are unknown.

Assuming that the columns of the genotype matrix are either centered or scaled, PCA computes the eigenvalues and eigenvectors of the sample covariance matrix. The first eigenvectors—or axes—summarize the directions which account for most of the genetic variation, and the eigenvalues represent the variances of projected samples along the axes. Eigenvalues and eigenvectors can be computed efficiently by using the singular value decomposition (SVD) of the column-centered data matrix [ 13 ]. PCA has been considered very early in human biology, and has become a popular method to study the genetic structure of populations [ 14 , 15 ]. Inference from PCA is justified from the fact that, similar to STRUCTURE, the projections of individuals on principal axes reveal their degree of admixture with source populations when these sources are represented in the sample [ 10 , 16 – 18 ].

Defined by Sewall Wright and Gustave Malécot, the fixation index or inbreeding coefficient, F ST , measures the amount of genetic diversity found between populations relative to the amount within populations [ 1 , 2 ]. Used as a measure of population differentiation, F ST is among the most widely used descriptive statistics in population and evolutionary genetics [ 3 – 7 ]. Inbreeding coefficients were originally defined for the analysis of allelic frequencies at a single genetic locus. With the amount of data available to present-day or ancient population genomic studies, principal component analysis (PCA) and model-based estimation algorithms, such as STRUCTURE, have emerged as alternative ways to describe population structure from multilocus genotype matrices [ 8 – 12 ].

Results and discussion

Partitioning of genetic variation Consider a sample of n unrelated individuals for which a large number of loci are genotyped, resulting in a matrix, X = (x iℓ ), with n rows and L columns. For haploids, we set x iℓ = 0, 1, and for diploids x iℓ = 0, 1, 2 to count the number of derived alleles at locus ℓ for individual i. Dealing with autosomes, we simplify our presentation by considering a sample of diploids as being represented by a sample of haploids having twice the original sample size. For unphased data, we take a random phase. Although not a necessary condition, the loci are assumed to be unlinked, or obtained after a linkage disequilibrium (LD)-pruning algorithm applied to the genotype matrix [20, 31]. We use the term locus as a shorthand for single-nucleotide polymorphism (SNP), although most of our analyses could include non-polymorphic sites. Following Wright’s approach to the description of population structure, our main assumption is that individuals are sampled from K predefined discrete populations. Examples of discrete population models underlying our assumptions include Wright’s island models, coalescent models of divergence and F-models [6, 32, 33]. Application to F-models will be described afterwards. To analyze population structure, PCA can be performed after centering and sometimes after scaling the genotype matrix. The scaled matrix is denoted by Zsc and the unscaled centered matrix is denoted by Z. Scaled PCA computes the eigenvalues, , of the empirical correlation matrix. Unscaled PCA computes the eigenvalues, , of the empirical covariance matrix [9, 26]. The eigenvalues are ranked in decreasing order, and is usually interpreted as the proportion of variance explained by the kth axis of the PCA. PCA can be performed via the SVD algorithm. In this case, the eigenvalues of scaled (or unscaled) PCA correspond to the squared singular values of the scaled (or centered) matrix divided by [9, 26]. To establish relationships between PCA and inbreeding coefficients, we decompose the centered matrix into a sum of two matrices, Z = Z ST + Z S , corresponding to between and within-population components. The decomposition is performed as follows. Let i be an individual sampled from population k. At a particular locus, ℓ, the genotype, x iℓ , is equal 0 or 1 (derived allele), and p kℓ denotes the derived allele frequency in population k at this locus. The coefficient of the centered matrix, z iℓ , is equal to z iℓ = ∑ j≠k c j (p kℓ − p jℓ ) + (x iℓ − p kℓ ), where c k = n k /n represents the proportion of individuals sampled from population k. In this formulation, the between-population matrix, Z ST , has general term , repeated for all individuals in population k. By construction, the rank of Z ST is equal to (K − 1). The within-population matrix, Z S , has general term . A very similar decomposition holds for the scaled matrix, defined as , where is the derived allele frequency in the total sample at locus ℓ (see Box 1 for the notations). Box 1. Notations n Haploid sample size L Number of genomic loci F ST Wright’s fixation index, computed from Nei’s formula with correction for unequal sample sizes H S Within-population genetic diversity H T Genetic diversity in the total population D ST Among (or between) population genetic diversity X Matrix of SNP genotypes for n individuals at L loci P Vector of SNP frequency for the L loci Z Matrix of centered genotypes, X − P Zsc Matrix of scaled genotypes, Z ST An n × L matrix describing between-population data repeated for individuals from a same population Z S An n × L matrix describing within-population data Eigenvalues of the empirical covariance matrix (unscaled PCA) Eigenvalues of the empirical correlation matrix (scaled PCA), also equal to L times the proportions of variance explained by the principal axes

Spectral analysis of inbreeding coefficients Consider n samples from K discrete populations and define D ST and F ST according to Wright’s [1] and Nei’s [4, 34] equations, allowing for unequal population sample sizes. At a particular locus, set and H T = 2P(1 − P), then we have D ST = H T − H S . Wright’s inbreeding coefficient is defined as Our main result states that the mean value of F ST across loci can be computed from the singular values of the between-population matrix, . A similar relationship is also established for D ST and for the unscaled matrix, Z ST . The singular values of the between and within-population matrices can be evaluated from the SVD algorithm. The computational cost of those operations is equal to the computational cost of the PCA of the genotype matrix, of order O(n2 L). This cost could be reduced by using various methods, for example by computing the first K − 1 singular values only. All conclusions below are valid regardless of any population genetic model. The results also remain valid when genotypes are conditioned on having minor allele frequency greater than a given threshold, and when the loci are physically linked or when they exhibit LD. We use the notation to denote the average value of the quantity Q ℓ across loci. Theorem 1. Let K ≥ 2. Let Z and Zsc be the unscaled and scaled genotype matrix respectively. Define Z ST and as in the previous section. We have (1) and (2) The key arguments for those results involve matrix norms, and they can be found in S1 Text. By the Pythagorean theorem, we have ‖Z‖2 = ‖Z ST ‖2 + ‖Z S ‖2 (S1 Text). According to this result, Theorem 1 can be reformulated using within-population matrices as follows. Corollary. Let K ≥ 2. Let Z and Zsc be the unscaled and scaled genotype matrix respectively. Define Z S and as in the previous section. We have (3) and (4)

Inbreeding coefficients for surrogate genotypes Besides an interest in connecting population genetic theory to the spectrum of the genotype matrix, the results in Eqs (1) and (2) have important consequences for data analysis. First, the results support the definition of F ST for multilocus genotypes as an average of ratios rather than a ratio of averages [35, 36]. More importantly, Theorem 1 leads to alternative definitions of Nei’s D ST and Wright’s F ST from population genetic data. As will be demonstrated by applications to ancient DNA and to ecological genomics, the main interest in those new definitions is their straightforward extension to adjusted genotypic matrices, providing statistics analogous to F ST for modified genotypic values. For example, adjusted genotypic matrices arise when correcting for biases due to technical artifacts, including batch effects and genomic coverage in population genomic data [37, 38]. In general, F ST could be adjusted for any specific effect by considering the residuals of latent factor regression models [39–41]. More specifically, for Z (or Zsc) and for a set of measured covariates, Y, latent factor regression models estimate a matrix of surrogate genotypes, W, by adjusting a regression model of the form In this model, the B matrix contains effect sizes for each variable in Y, and ϵ is a matrix that represents centered errors. The latent matrix W has a specified rank, k, lower than n minus the number of covariates. The rank k corresponds to the number of latent factors incorporated in the model. The matrix Zadj = W + ϵ leads to a definition of an adjusted inbreeding coefficient, . The adjusted inbreeding coefficient can be computed as the squared norm of the between-population matrix, , after scaling. Note that this definition considers quantitative values observed at each locus, and is similar to a population genetic quantity called Q ST [42, 43]. Alternatively, can be computed from the average coefficient of determination, R2, obtained from the regression of each scaled surrogate genotype on the population labels. The definitions are equivalent, and we have

Inbreeding coefficients and PCA eigenvalues Having established that the mean value of F ST across loci can be computed from the leading eigenvalues of the between-population matrix, , the next question is to ask whether similar results hold for the leading eigenvalues of the PCA of the scaled matrix, (5) and for the PCA of the centered matrix, (6) Those results require that the ranked eigenvalues of sort into approximate eigenvalues of followed by approximate eigenvalues of . Said differently, the (K − 1)th singular value of must separate from the leading singular value of (7) We suppose that the ratio L/n is constant for large L and n, and make the following assumptions: 1) The separation condition Eq (7) is verified, 2) The leading eigenvalue of is of order (RMT hypothesis). Then, under those conditions, the accuracy of the approximation in Eqs (5) and (6) is of order O(K/L). More precisely, for any singular value, σ k (Z ST ), of , there exists a singular value, σ k (Z), of such that we have A similar result holds for the first K − 1 eigenvalues of scaled matrices, and . In other words, the mean value of F ST across loci can be approximated from the sum of the (K − 1) leading eigenvalues of the PCA with an accuracy proportional to the number of populations and to the inverse of the number of unlinked loci in the genotype matrix. Mathematical arguments for those results are detailed in S1 Text. Poor approximations may be caused by insufficient sample size, incorrect definition of populations, inclusion of individuals with mixed ancestry, spatial structure, etc. Poor approximations may also be accompanied by failure to verify the separation condition Eq (7), as our simulations will illustrate afterwards. In addition, the results show that F ST and PCA exhibit similar biases, for example, when the sampling design is uneven or when loci are filtered out of the genotype matrix [16, 31, 44]. We note that the RMT hypothesis for the residual matrix, Z S , is difficult to prove for population genetic models. Like [10], we will rely on simulations to show that RMT describes residual variation in population genetic models accurately. In empirical data analyses, checking the residual matrix for agreement with RMT will also provide an informal test for the number of components in PCA similar to Cattel’s elbow rule [45].

Brief example To illustrate the approximation of by the leading eigenvalues of the PCA, we present a short simulation example, in which a genotype matrix was generated according to a three-population F-model. Simulation studies of F-models and additional examples based on real data will be developed more extensively later on. In this first simulation example, the average ancestral frequency was equal to 20%, and the drift parameters for the three populations were equal to F 1 = 5%, F 2 = 10% and F 3 = 30%. Populations 1 and 2 were genetically closer to each other than to population 3, which was the most diverged population. Three hundred samples (n k = 100, for k = 1, 2, 3) were genotyped at 10,000 loci. PCA was conducted on L = 9740 SNP loci after monomorphic loci were removed. The average value of F ST across loci was equal to 9.52%, and approximated the sum of the leading eigenvalues of the PCA (9.54%) accurately. The first axes of the PCA explained 6.78%, 2.76%, and 0.47% of the total variation respectively (Fig 1). The non-null eigenvalues of , 6.77% and 2.75%, were close to the values obtained for the first two PCs. As stated in Theorem 1, their sum was equal to . Clearly, the smallest eigenvalue of separated from the leading value of (0.47%), which was close to the eigenvalue for the third PC and to its prediction from RMT (0.31%, Fig 1). PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 1. Spectral analysis of a three-population model. PCA scree plots and PC plots for the scaled matrix, Zsc, for the between-population matrix, , and for the residual matrix, of simulated data. The simulation was performed for n = 300 individuals and an F-model (F 1 = 5%, F 2 = 10%, F 3 = 30%) with ancestral frequency drawn from a beta(1,4) distribution. https://doi.org/10.1371/journal.pgen.1009665.g001 To show the effect of having incorrectly labelled population samples, we considered the same genotype matrix, and replicated the analyses after grouping the “paraphyletic” samples from populations 2 and 3 against the least diverged sample from population 1 (Fig 2, n 1 = 200 and n 2 = 100, K = 2). The average F ST value was equal to 3.46%, and failed to approximate the first eigenvalue of the PCA (6.78%). The leading value of (6.06%) did not verify the separation condition, and it differed from its RMT prediction (0.32%). The PC plot for provided evidence of residual population structure within the paraphyletic population sample. Like for a regression analysis, those results outlined the usefulness of visualizing the residual matrix in order to evaluate the number of populations from the genotype matrix (Figs 1 and 2). PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 2. Spectral analysis with incorrectly labelled population samples (K = 2). For the same genotype matrix as in Fig 1, samples from populations 2 and 3 (blue) were grouped against population 1 (green). PCA scree plots and PC plots for the scaled matrix, Zsc, for the between-population matrix, , and for the residual matrix, of simulated data. F ST was lower than the leading eigenvalue of the residual matrix, and it differed from the first PC eigenvalue. https://doi.org/10.1371/journal.pgen.1009665.g002

Single population models In a first series of simulations without population structure, we investigated whether RMT predictions were valid for F-models. For single population F-models, the results supported that the leading eigenvalues of PCA were accurately predicted by the Marchenko-Pastur distribution (S1 Fig). Then we investigated whether condition Eq (7) could be verified when there was no structure in the data, and two population samples were wrongly defined from a preliminary structure analysis. We ran two-hundred simulations of single population models (n = 100 and L ≈ 10, 000), and, for each data set, we partitioned the samples in two groups according to the sign of their first principal component. This procedure maximized the likelihood of detecting artificial groups, leading to an average F ST ≈ 1.1%. For those artificial groups, we computed the non-null singular value of the between-population matrix, , and the leading singular value of the within-population matrix, . For the simulations, the separation condition was never verified, rejecting population structure in all cases (S2(A) Fig). For smaller sample sizes (n = 10 and L ≈ 1, 000), the separation condition was erroneously checked in 21% simulations, indicating that we had less power to discriminate among artificial groups with small sample sizes (S2(B) Fig). Those results were also consistent with difficulties reported for between-group PCA [46].

More than three-population models For F-models, the eigenvalues of the theoretical covariance matrix were analysed formally for small numbers of populations (S1 Text). To achieve this goal, we considered the covariance matrix of the random vector z defined by , for all k = 1 to K. The K × K covariance matrix of the random vector z could be obtained from the drift statistics and defined in [48, 49]. The coefficients of this matrix are , for j ≠ k, and otherwise. For K = 3, the eigenvalues of the Λ matrix were computed exactly (S1 Text). We performed simulations of three-population F-models to check whether the data agreed with theoretical predictions for the leading eigenvalues, λ 1 , and for D ST and F ST . With random drift coefficients (n = 100, L = 20000), the separation condition was verified in all simulated data sets. An almost perfect agreement between λ 1 + λ 2 and the mean value of D ST /2 (unscaled PCA) or F ST (scaled PCA) was observed (S7(A)–S7(C) Fig). The leading eigenvalue of unscaled PCA exhibited a small but visible bias with respect to the value predicted for λ 1 (S7(B) Fig). The third eigenvalue of scaled PCA was close to the approximation provided by RMT (S7(D) Fig). To study cases in which the separation condition was not verified, we considered smaller number of genotypes (L ≤ 1000) and lower values of drift coefficients (F k ≤ 10%). For small values of n and L, a significant proportion of simulated data sets did not verify the separation condition (S8 Fig), although models were correctly specified. Those results provided additional evidence of biases in analyses of population structure with small data sets. Extending our simulation study to larger number of populations, we checked that the accuracy of the approximation of by the sum of the first (K − 1) PCA eigenvalues was proportional to K/L when the separation condition was verified (S9 Fig). Poorer accuracy was observed when the data failed to verify the separation condition.

Human data To provide evidence that the relationships between PCA eigenvalues and F ST are verified for real genotypes, we computed F ST , their approximation by PCA, and the leading eigenvalues of the residual matrix for pairs and triplets of human population samples from The 1000 Genomes Project [50] (Table 1 and S1 Table). In pairwise analyses excluding admixed samples, the separation condition was verified in all analyses at the exception of the pair CEU-IBS, formed of two closely related European samples. The leading eigenvalue of scaled PCA was accurately approximated by , and the leading eigenvalue of the residual matrix was accurately predicted by RMT. For triplet analyses without admixed samples, the separation condition was also verified and the sum of the first two eigenvalues of scaled PCA was accurately approximated by . RMT still predicted the leading eigenvalue of the residual matrix accurately. In pair and triplet analyses including admixed samples, the separation condition was fulfilled in all analyses at the exception of the pair ACB-ASW (S1 Table). The approximation of by the leading eigenvalues of scaled PCA was less accurate than in analyses without admixed samples. In the CEU-ASW analysis for example, (= 4.55%) was lower than the leading eigenvalue of PCA (= 4.87%). An explanation for these discrepancies may be that F ST is informative about the admixture proportion between admixed populations and their parental source populations [51]. With admixed samples, mismatches were also observed between the leading eigenvalue of the residual matrix and its prediction from RMT. The results suggest that the data do not agree with models of K discrete populations, and modified definitions of F ST could be more appropriate for describing population structure in the presence of admixed individuals [52, 53]. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Table 1. F ST estimates for populations from The 1,000 Genomes Project. https://doi.org/10.1371/journal.pgen.1009665.t001

[END]

[1] Url: https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1009665

(C) Plos One. "Accelerating the publication of peer-reviewed science."
Licensed under Creative Commons Attribution (CC BY 4.0)
URL: https://creativecommons.org/licenses/by/4.0/

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/