(C) PLOS One

(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .

SparsePro: An efficient fine-mapping method integrating summary statistics and functional annotations [1]

['Wenmin Zhang', 'Quantitative Life Sciences', 'Mcgill University', 'Montreal', 'Quebec', 'Hamed Najafabadi', 'Department Of Human Genetics', 'Dahdaleh Institute Of Genomic Medicine', 'Yue Li', 'School Of Computer Science']

Date: 2024-02

Abstract Identifying causal variants from genome-wide association studies (GWAS) is challenging due to widespread linkage disequilibrium (LD) and the possible existence of multiple causal variants in the same genomic locus. Functional annotations of the genome may help to prioritize variants that are biologically relevant and thus improve fine-mapping of GWAS results. Classical fine-mapping methods conducting an exhaustive search of variant-level causal configurations have a high computational cost, especially when the underlying genetic architecture and LD patterns are complex. SuSiE provided an iterative Bayesian stepwise selection algorithm for efficient fine-mapping. In this work, we build connections between SuSiE and a paired mean field variational inference algorithm through the implementation of a sparse projection, and propose effective strategies for estimating hyperparameters and summarizing posterior probabilities. Moreover, we incorporate functional annotations into fine-mapping by jointly estimating enrichment weights to derive functionally-informed priors. We evaluate the performance of SparsePro through extensive simulations using resources from the UK Biobank. Compared to state-of-the-art methods, SparsePro achieved improved power for fine-mapping with reduced computation time. We demonstrate the utility of SparsePro through fine-mapping of five functional biomarkers of clinically relevant phenotypes. In summary, we have developed an efficient fine-mapping method for integrating summary statistics and functional annotations. Our method can have wide utility in understanding the genetics of complex traits and increasing the yield of functional follow-up studies of GWAS. SparsePro software is available on GitHub at https://github.com/zhwm/SparsePro.

Author summary Accurately identifying causal variants from genome-wide association studies summary statistics is important for understanding genetic architecture of complex traits and identifying therapeutic targets. Functional annotations are commonly used as additional evidence for prioritizing causal variants. In this study, we present SparsePro to integrate summary statistics and functional annotations for accurate identification of causal variants. SparsePro extends the capabilities of a popular fine-mapping method, SuSiE, with important contributions in hyperparameter estimation, posterior summaries and integration of function annotations. Through extensive simulations, we demonstrate that our proposed approach can effectively integrate summary statistics and functional annotation, leading to improved power for identifying causal variants. Furthermore, we evaluate the benefits of incorporating functional annotations through real data analyses of five functional biomarkers. In summary, by improving power and providing valuable insights into complex disease genetics, SparsePro will have wide utility in advancing our knowledge and facilitating follow-up discoveries.

Citation: Zhang W, Najafabadi H, Li Y (2023) SparsePro: An efficient fine-mapping method integrating summary statistics and functional annotations. PLoS Genet 19(12): e1011104. https://doi.org/10.1371/journal.pgen.1011104 Editor: Gao Wang, Columbia Presbyterian Medical Center: Columbia University Irving Medical Center, UNITED STATES Received: January 17, 2023; Accepted: December 11, 2023; Published: December 28, 2023 Copyright: © 2023 Zhang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: SparsePro is an open-access software publicly available at \url{https://github.com/zhwm/SparsePro}. The simulation scripts are deposited at \url{https://github.com/zhwm/SparsePro_analysis}. Individual-level phenotype and genotype data from the UK Biobank are available upon successful application at \url{https://www.ukbiobank.ac.uk}. GCTA was downloaded from \url{https://cnsgenomics.com/software/gcta/bin/gcta_1.93.2beta.zip}. FINEAMP was downloaded from \url{http://www.christianbenner.com/finemap_v1.4_x86_64.tgz}. SuSiE (version 0.12.16) was installed from CRAN. PolyFun was installed from \url{https://github.com/omerwe/polyfun}. UK Biobank LD information was downloaded from \url{https://alkesgroup.broadinstitute.org/UKBB_LD/}. Tissue-specific annotation was downloaded from \url{https://alkesgroup.broadinstitute.org/LDSCORE/}. Funding: W.Z. has been supported by a doctoral training fellowship from the FRQNT (319188) and the Healthy Brains, Healthy Lives Program, funded by the Canada First Research Excellence Fund (CFREF), Quebec’s Ministère de l’Économie et de l’Innovation (MEI), and the Fonds de recherche du Québec (FRQS, FRQSC and FRQNT). H.N. holds a Canada Research Chair funded by the Canadian Institutes of Health Research. Y.L. is supported by Natural Sciences and Engineering Research Council (NSERC) Discovery Grant (RGPIN-2019-0621), Fonds de recherche Nature et technologies (FRQNT) New Career (NC-268592), and Canada First Research Excellence Fund Healthy Brains for Healthy Life (HBHL) initiative New Investigator start-up award (G249591). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist.

Introduction Establishment of large biobanks and advances in genotyping and sequencing technologies have enabled large-scale genome-wide association studies (GWAS) [1–3]. Although GWAS have revealed hundreds of thousands of associations between genetic variants and traits of interest, understanding the genetic architecture underlying these associations remains challenging [4–6], mainly because GWAS rely on univariate regression models that are not able to distinguish causal variants from other variants in linkage disequilibrium (LD) [5, 7, 8]. Several fine-mapping methods have been proposed to identify causal variants from GWAS. For instance, BIMBAM [9], CAVIAR [10] and CAVIARBF [11] estimate the posterior inclusion probabilities (PIP) for variants in a genomic locus by exhaustively evaluating likelihoods of all possible causal configurations. FINEMAP [12] accelerates such calculations with a stochastic shotgun search focusing on the most likely subset of causal configurations. However, the total number of variant-level causal configurations can grow exponentially with the number of causal variants, which can lead to a very high computational cost in classical fine-mapping methods. Starting from the motivation of quantifying uncertainty in selecting variants for constructing credible sets, SuSiE introduced a novel sum of single effect model and proposed an efficient iterative Bayesian stepwise selection (IBSS) algorithm [13, 14]. The IBSS algorithm sheds light to a promising approach to improve fine-mapping efficiency. Additionally, functional annotations are commonly used as auxiliary evidence for prioritizing causal variants. PAINTOR [15] uses a probabilistic framework that integrates GWAS summary statistics with functional annotation data to improve accuracy of fine-mapping. Similarly, TORUS [16] incorporates highly informative genomic annotations to help with quantitative trait loci discoveries. Recently, PolyFun [17] was developed to use genome-wide heritability estimates from LD score regression to set the functional priors for fine-mapping methods. Given the computational efficiency of SuSiE, integrating functional annotations into similar algorithms can be desirable. In this work, we present SparsePro for efficient fine-mapping with the ability to incorporate functional annotations. We connect the SuSiE IBSS algorithm with earlier work on a paired mean field variational inference algorithm [18] through the implementation of a sparse projection. We further propose effective strategies for estimating hyperparameters and summarizing posterior probabilities. We assess the performance of our proposed approach via simulation studies and examine the utility of SparsePro by fine-mapping five functional biomarkers of clinically relevant phenotypes.

Discussion Accurately identifying causal variants is fundamental to human genetics research and particularly important for interpreting GWAS results [5, 8]. In this work, we presented SparsePro to help prioritize causal variants by integrating GWAS summary statistics and functional information. We showcased the improved performance of our proposed approach through simulation studies. By fine-mapping genetic associations in five biomarkers of clinically relevant phenotypes, we demonstrated that functional annotations were useful in prioritizing biologically relevant variants. SparsePro builds upon SuSiE [13] and extends the capabilities of SuSiE with several important contributions. First, we proposed an effective strategy for estimating hyperparameters. Specifically, local heritability-based estimates can reduce the number of parameters to be estimated by the fine-mapping algorithm, resulting in improved power and efficiency. To showcase its utility, we also applied this strategy to SuSiE and observed substantial improvement of fine-mapping power (S2 Table) with calibrated PIP (S8 Table). Moreover, we provided an alternative attainable coverage-based approach for posterior summaries. Specifically, we calculated attainable coverage for each effect group and only effect groups with attainable coverage greater than ρ were summarized to ρ-level credible sets. We also applied this approach to SuSiE, which yielded improved set-level summaries compared to its original implementation with purity-based filtering (S9 Fig). As expected, if both strategies for estimating hyperparameters and summarizing posterior probabilities were incorporated in SuSiE, its performance could be comparable to that of SparsePro (S1 and S2 Tables). Importantly, we provided a framework to integrate GWAS summary statistics and functional annotations. Functional annotations are widely used as additional evidence to prioritize causal variants together with statistical associations, with the possibility to elucidate the causal mechanisms [15–17]. In this study, we proposed an integrated approach for functional fine-mapping by jointly estimating enrichment weights for functional annotations and subsequently incorporating enrichment weights to derive functionally-informed priors. Therefore, the obtained priors were adaptive to functional enrichment based on the data, which allows the use of functional annotations in a cautious manner. We additionally introduced a G-test to assess the relevance of annotations. This G-test evaluates whether the causal signals are significantly enriched in the annotation of interest. In simulations, the G-test has shown its effectiveness in accurately identifying the enriched annotations (S3 and S5 Tables). However, in SparsePro, filtering annotations by the G-test does not impact the fine-mapping results dramatically (S2 and S8 Tables). This is because when irrelevant annotations are included in the joint estimates, their estimated enrichment weights are typically small (S6 Table), thus having a limited impact on the functionally-informed priors. Nonetheless, in SparsePro, screening annotations by G-test leads to simple interpretable models. While we used a p-value threshold of 1 × 10−5 in both simulations and fine-mapping of functional biomarkers, users can adjust this threshold based on their preference for a more complicated model or a sparser model. Additionally, for other functionally-informed methods that are sensitive to annotation specifications, particularly those deriving strong priors from annotations, incorporating our proposed G-test can be useful to mitigate the impact of annotation misspecification. In real data analyses, the “non-synonymous” [20] annotation is highly relevant in fine-mapping (S7 Fig) and indeed, by using this annotation to prioritize variants, we were able to identify rs1260326 as a causal variant for pulse rate when statistical evidence alone was not able to distinguish it from other variants in high LD (S10 Fig). However, future investigations are still needed to elucidate the roles of many other functional annotations. Similar to existing fine-mapping algorithms, there are caveats in fine-mapping analysis using SparsePro. First, there are challenges related to allele flipping and LD rank deficiency when using summary statistics for fine-mapping. SparsePro, similar to Zou et al [14], does not require a full-rank LD matrix as it does not require matrix inversion throughout the algorithm. However, allele flipping can lead to algorithm convergence issues. To address this, it is recommended that users closely monitor the convergence of the algorithm and utilize scripts we provided to automate the formatting of GWAS summary statistics to match alleles in the LD reference panel. By taking these precautions, the potential convergence issues caused by allele flipping can be mitigated. Additionally, the identification of causal variants in fine-mapping relies on the rigorousness of GWAS study design, and may be biased if unmeasured confounding factors such as population stratification are not properly controlled for. In summary, SparsePro is an accurate and efficient fine-mapping method integrating statistical evidence and functional annotations. We envision its wide utility in understanding the genetic architecture of complex traits, identifying target genes, and increasing the yield of functional follow-up studies of GWAS.

Methods SparsePro for efficient fine-mapping integrating summary statistics and functional annotations In SparsePro, we use a generative model to integrate GWAS summary statistics and functional annotations (Fig 1 and S1 Text). First, we specify prior inclusion probability for the gth variant : where A g is the M × 1 vector of M annotations for the gth variant and w is a M × 1 vector of enrichment weights. Here, we use the softmax function to ensure the prior probabilities are normalized. If no functional information is provided, the prior inclusion probability is considered equal for all variants. Subsequently, we assume the high dimensional genotype matrix X N×G can be represented by altogether K effect groups via a sparse projection S G×K = [s 1 , …, s K ] with Then the effect sizes for effect groups can be represented by β = [β 1 , …, β K ] where Finally, for a continuous trait y N×1 over N individuals, we have: For inference, we use an efficient paired mean field variational inference algorithm [18] adapted for GWAS summary statistics, which we show is equivalent to the SuSiE IBSS algorithm [13] (detailed in the S1 Text). We estimate hyperparameters for effect sizes τ β and residual errors τ y using local heritability-based estimates from HESS [27] (S1 Text) and propose an attainable coverage-based strategy for summarizing posterior probabilities (S1 Text). Additionally, we use joint estimates of enrichment weights to derive functionally-informed priors to further prioritize causal variants (S1 Text) and introduce a G-test to screen relevant functional annotations (S1 Text). Locus simulation studies We conducted locus simulations to evaluate the performance of fine-mapping methods under different settings. We randomly selected three 1-Mb regions, and obtained genotypes for 353,570 unrelated UK Biobank White British ancestry individuals [1]. For each locus, we generated 50 replicates for each combination of parameters: K ∈ {1, 2, 5, 10} (number of causal variants) and W ∈ {0, 1, 2} (enrichment intensity) among variants that were annotated as “conserved sequences” [19], “DNase I hypersensitive sites” (DHS) [28], “non-synonymous” [20], or overlapping with histone marks H3K27ac [29] or H3K4me3 [28]. In the simulated weight vector w, the entries that correspond to the these enriched annotations had a value of W. Causal variants in each simulation replicate were randomly assigned. Then, we used the GCTA GWAS simulation pipeline [30] to simulate a continuous trait with a total heritability of K × 10−4. We performed association test between each variant and the simulated trait, and obtained GWAS summary statistics using the fastGWA software [31]. Next, we ran the different fine-mapping programs with the GWAS summary statistics and in-sample LD as inputs. For methods using functional annotations, we provided the aforementioned five annotations with enrichment of causal variants as well as five additional annotations without enrichment: “actively transcribed regions” [32], “transcription start sites” [32], “promoter regions” [33], “5’-untranslated regions” [20], and “3’-untranslated regions” [20]. The statistical fine-mapping results obtained from SparsePro without annotation information were denoted as “SparsePro-”. Annotations with a G-test p-value < 1 × 10−5 were selected for functionally-informed fine-mapping, and the results were referred to as “SparsePro+”. Additionally, we performed functionally-informed fine-mapping by including all annotations (i.e., a G-test p-value < 1.0) without G-test screening, denoted as “SparsePro+1.0”. Moreover, we conducted statistical fine-mapping using the stochastic shotgun search mode of FINEMAP (V1.4) and the function “susie_rss” from SuSiE (V0.12.16). The mcmc mode for PAINTOR (V3.0) was used to obtain the baseline model results and the annotated model results, separately denoted as “PAINTOR-” and “PAINTOR+”. The largest K used for SparsePro, SuSiE and FINEMAP was 10. Due to the high computation cost, PAINTOR only allows up to 3 causal variants per locus. Computation time was recorded on a 2.1 GHz CPU for fine-mapping programs including all procedures. Furthermore, we investigated the benefits of our proposed strategies for estimating hyperparamters and summarizing posterior probabilities (detailed in the S1 Text) by incorporating them into SuSiE. Specifically, local heritability-based estimates for effect size variance and residual variance were provided to “scaled_prior_variance” and “residual_variance” respectively in both SuSiE+HESS and SuSiE+SparsePro while the default empirical Bayes based hyperparameter estimates were used in SuSiE. The posterior summaries obtained from SuSiE with heritability-based hyperparameters were denoted as “SuSiE+HESS” while the posterior summaries obtained using our proposed approach were denoted as “SuSiE+SparsePro” (S1 Text). Genome-wide simulation studies We conducted genome-wide simulations to compare SparsePro+ with other methods that requires genome-wide GWAS summary statistics for functional fine-mapping. We obtained genotypes of 353,570 unrelated UK Biobank White British individuals on chromosome 22 and sampled 100 causal variants with W ∈ {0, 1, 2} (enrichment intensity) among variants that were annotated as “non-synonymous” [20]. We used the GCTA GWAS simulation pipeline [30] to simulate a continuous trait with a per-chromosome heritability of 0.01. We tested the association between each variant and the simulated trait, and obtained GWAS summary statistics using the fastGWA software [31]. This process was repeated 22 times to obtain genome-wide GWAS summary statistics. Additionally, we obtained LD information calculated using the UK Biobank participants from Weissbrod et al [17]. These LD matrices were generated for genome-wide variants binned into sliding windows of 3 Mb with neighboring windows having a 2-Mb overlap. We applied SparsePro to the GWAS summary statistics with the aforementioned LD information, iterating over all sliding windows initially without any functional annotation. The fine-mapping results obtained were referred to as “SparsePro-”. Next, the 10 annotations used in locus simulations were used to derive functional priors. The fine-mapping results from SparsePro with a prior derived from PolyFun were denoted as “SparsePro+PolyFun”. Additionally, results from SparsePro with a functional prior estimated from annotations with a G-test p-value less than 1 × 10−5 were denoted as “SparsePro+” while results from SparsePro with a functional prior estimated from all 10 annotations were denoted as “SparsePro+1.0”. In these fine-mapping analyses, variants in each 3-Mb sliding window were fine-mapped jointly. However, we only retained PIP for variants located in the 1-Mb region central to the window as well as credible sets with top variants located in this 1-Mb region. Therefore, variants were fine-mapped together with neighboring variants within at least 1-Mb to mitigate boundary effect. To further investigate the impact of annotation misspecification or annotation measurement errors on functionally-informed fine-mapping, we utilized the “conserved sequences” [19] annotation, which partly overlaps with the simulated enriched “non-synonymous” [20] annotation. We used this annotation for deriving the functional prior using both SparsePro and PolyFun, and the corresponding results were labeled as “SparsePro+Misspecified” and “SparsePro+Misspecified PolyFun”, respectively. These analyses allowed us to evaluate the robustness of the functionally-informed fine-mapping approach to annotation misspecifications and potential measurement errors. Fine-mapping of functional biomarkers of clinically relevant phenotypes To investigate potential genetic coordination mechanisms, we performed GWAS in the UK Biobank [1], focusing on five functional biomarkers: forced expiratory volume in one second to forced vital capacity (FEV1-FVC) ratio for lung function, estimated glomerular filtration rate for kidney function, pulse rate for heart function, gamma-GT for liver function and blood glucose level for pancreatic islet function. For each biomarker, we first regressed out the effects of age, age2, sex, genotyping array, recruitment centre, and the first 20 genetic principal components before inverse normal transforming the residuals to z-scores that had a zero mean and unit variance. We then performed GWAS analysis on the resulting z-scores with the fastGWA software [30, 31] to obtain summary statistics. Using the summary statistics and the matched LD information [17], we performed genome-wide fine-mapping with “SparsePro-”, “SparsePro+” and “SparsePro+PolyFun” as described in Section 5.3 with annotations from the “baselineLF2.2.UKB” model [17] provided by PolyFun. To assess the biological relevance of fine-mapping results, we used 10 tissue-specific annotations derived from four histone marks H3K4me1, H3K4me3, H3K9ac, and H3K27ac by Finucane et al [34]. This set of annotations was not used by any functional fine-mapping methods. To assess tissue specificity of the obtained PIP values, we ran G-test and estimated enrichment weight (S1 Text) for each tissue-specific annotation. Additionally, we examined whether the top variants from 95% credible sets identified for a trait were more enriched for relevant tissue-specific annotations compared to the top variants identified for other traits by Fisher’s exact tests. We used phenogram [35] to illustrate genes that harbored causal variants for at least two biomarkers to explore possible pleiotropic effects.

Acknowledgments This study has been conducted using UK Biobank Resources under Application Number 45551 and we thank NeuroHub for providing access to data resources. This study was enabled, in part, by support from Calcul Québec and Compute Canada. We thank Dr. Robert Sladek and Dr. Josée Dupuis for helpful discussion and suggestions.

[END]
---
[1] Url: https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1011104

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/