(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .
Prediction of causal genes at GWAS loci with pleiotropic gene regulatory effects using sets of correlated instrumental variables [1]
['Mariyam Khan', 'Computational Biology Unit', 'Department Of Informatics', 'University Of Bergen', 'Bergen', 'Adriaan-Alexander Ludl', 'Sean Bankier', 'Johan L. M. Björkegren', 'Department Of Medicine', 'Huddinge']
Date: 2024-12
Multivariate Mendelian randomization (MVMR) is a statistical technique that uses sets of genetic instruments to estimate the direct causal effects of multiple exposures on an outcome of interest. At genomic loci with pleiotropic gene regulatory effects, that is, loci where the same genetic variants are associated to multiple nearby genes, MVMR can potentially be used to predict candidate causal genes. However, consensus in the field dictates that the genetic instruments in MVMR must be independent (not in linkage disequilibrium), which is usually not possible when considering a group of candidate genes from the same locus. Here we used causal inference theory to show that MVMR with correlated instruments satisfies the instrumental set condition. This is a classical result by Brito and Pearl (2002) for structural equation models that guarantees the identifiability of individual causal effects in situations where multiple exposures collectively, but not individually, separate a set of instrumental variables from an outcome variable. Extensive simulations confirmed the validity and usefulness of these theoretical results. Importantly, the causal effect estimates remained unbiased and their variance small even when instruments are highly correlated, while bias introduced by horizontal pleiotropy or LD matrix sampling error was comparable to standard MR. We applied MVMR with correlated instrumental variable sets at genome-wide significant loci for coronary artery disease (CAD) risk using expression Quantitative Trait Loci (eQTL) data from seven vascular and metabolic tissues in the STARNET study. Our method predicts causal genes at twelve loci, each associated with multiple colocated genes in multiple tissues. We confirm causal roles for PHACTR1 and ADAMTS7 in arterial tissues, among others. However, the extensive degree of regulatory pleiotropy across tissues and the limited number of causal variants in each locus still require that MVMR is run on a tissue-by-tissue basis, and testing all gene-tissue pairs with cis-eQTL associations at a given locus in a single model to predict causal gene-tissue combinations remains infeasible. Our results show that within tissues, MVMR with dependent, as opposed to independent, sets of instrumental variables significantly expands the scope for predicting causal genes in disease risk loci with pleiotropic regulatory effects. However, considering risk loci with regulatory pleiotropy that also spans across tissues remains an unsolved problem.
Although genome-wide association studies have mapped thousands of genetic variants that explain the heritable nature of many complex traits and diseases, the causal genes and mechanisms underlying these associations are often unclear. This is partly due to the widespread presence of “regulatory pleiotropy”, a phenomenon where the same genetic variants affect gene expression of multiple genes in the same genomic locus across multiple tissues. Mendelian randomization is a statistical method that uses genetic variants as instrumental variables to estimate causal effects of exposures on outcomes. Here we have extended this technique to the situation where multiple exposures can have a simultaneous effect on an outcome, and no independent instrumental variables are available for each exposure. When applied to a dataset of genetic and gene expression variation in seven vascular and metabolic tissues of 600 individuals undergoing heart surgery, our method identified candidate causal genes and tissues for coronary artery disease risk at genomic positions where regulatory pleiotropy and the extensive correlations between genetic variants made the application of existing Mendelian randomization methods infeasible. Further support for the validity of our method to identify causal genes using sets of correlated instrumental variables was provided by extensive simulations and theoretical results.
Funding: T.M. acknowledges support from the Research Council of Norway (project number 312045). J.L.M.B. acknowledges support from the Swedish Research Council (2018-02529 and 2022-00734), the Swedish Heart Lung Foundation (2017-0265 and 2020-0207), the Leducq Foundation AteroGen (22CVD04) and PlaqOmics (18CVD02) consortia; the National Institute of Health-National Heart Lung Blood Institute (NIH/NHLBI, R01HL164577; R01HL148167; R01HL148239, R01HL166428, and R01HL168174), American Heart Association Transformational Project Award 19TPA34910021, and from the CMD AMP fNIH program. T.M. and J.L.M.B. acknowledge the European Union’s Horizon Europe (European Innovation Council) programme (grant agreement number 101115381). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Data Availability: All source code related to the paper is available at
https://github.com/mariyam-khan/Causal_genes_GWAS_loci_CAD . Results reported in this paper used version 1.0.0 of the code, which has been archived at \url{
https://doi.org/10.5281/zenodo.10091331 }. Supporting data is available at
https://dataverse.no/dataset.xhtml?persistentId=doi:10.18710/VM0WKQ .
Copyright: © 2024 Khan et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
To support our theoretical results, we conducted extensive simulations with binomially distributed, correlated instruments mimicking real LD from the human genome. We further applied our MVMR method with correlated instrumental variable sets to transcriptomics data from seven vascular and metabolic tissues in 600 coronary artery disease (CAD) cases undergoing surgery from the STARNET study [ 26 ] to predict causal genes and tissues at 19 genome-wide significant CAD risk loci with regulatory pleiotropy.
We show that for the exactly determined MVMR problem with correlated instruments (equal number of instruments and exposures), replacing the true covariances by their finite-sample estimates is the optimal GMM estimator. For the overdetermined system, a weighted version that corresponds exactly to the two-stage least squares solution is optimal. In particular, the solution of McDaid et al. [ 20 ] of the two-stage least squares problem in terms of summary statistics is shown to be the finite-sample GMM estimator of the “regression of regression coefficients” solution to the causal identification problem.
For the estimation problem, we expand the instrument set to the overcomplete set of all genetic variants in the locus of interest. We assume that this set contains at least as many causal variants as the number of exposures, and analyze a class of methods called the Generalized Method of Moments (GMM) [ 25 ]. As shown by Hansen [ 25 ], every previously suggested instrumental variables estimator, in linear or nonlinear models, with cross-section, time series, or panel data, as well as system estimation of linear equations can be cast as a GMM estimator. GMM is therefore viewed as a unifying framework for estimation in causal inference problems.
For the identification problem, we show that the MVMR causal graph with correlated instruments belongs to a class of causal graphs previously studied in the AI literature [ 24 ]. Specifically, we show that any set of non-redundant causal variants of the same size as the set of exposure traits, regardless of their mutual LD levels and regardless of any interactions between the exposures, satisfies Brito and Pearl’s [ 24 ] “instrumental set” condition, allowing identification of the causal effects of each exposure on the outcome. In particular, the causal effects can be expressed as the coefficients of the linear regression of the instrument-outcome covariances on the instrument-exposure covariances, thereby showing correctness of the “regression of regression coefficients” method in all settings.
The basis of our approach is Wright’s method of path coefficients [ 21 , 22 ] and its generalization into the graph structural causal inference theory by Pearl and colleagues [ 23 ]. Accordingly, we distinguish between identification and estimation of causal effects. Causal identification refers to the question whether causal effects can be expressed in terms of the true (but unknown) joint distribution of the variables in a causal graph. Causal estimation refers to the problem of estimating the identified causal effects from a finite number of observational samples from the joint distribution.
Here we clarify and strengthen the theoretical underpinnings of MVMR with correlated instrumental variables, supported by realistic simulations, to identify causal genes at genomic loci with pleiotropic gene regulatory effects. If we restrict to genetic instruments in the same locus to exclude horizontal pleiotropy, then inclusion of correlated instruments becomes a necessity.
However, the precise conditions to ensure validity of MVMR remain somewhat ambiguous, in particular with regards to the presence of linkage disequilibrium (correlations) between the SNPs in the set of instrumental variables. For instance, Burgess et al. [ 16 ] state that estimates from the two-stage least squares method in the individual-level data setting are valid even if the genetic variants are in LD [ 15 ], but require uncorrelated instruments in their proof of equivalence with the regression method in the summary data setting. McDaid et al. [ 20 ] show that the two-stage least squares estimates can be obtained from univariate summary data even if the genetic variants are in LD, allowing for two-sample MVMR analysis, albeit at the cost of the estimates no longer being expressed as a “regression of regression coefficients”. Nevertheless, despite their formula for the causal effect estimates being valid for correlated instruments, McDaid et al. [ 20 ] still follow common practice in MVMR to select only instruments spread far apart on the genome to eliminate any LD between them. Similarly, Porcu et al. [ 19 ], whose TWMR method is based on the analytic results of McDaid et al. [ 20 ], only use variants with low mutual LD (r 2 ≤ 0.1) in their analysis of loci with regulatory pleiotropy. Moreover, all simulations reported in the literature to test MVMR methods in a controlled situation use independently sampled instruments [ 15 – 17 , 19 ].
As with single-exposure MR, MVMR was first developed for phenotypic exposure traits [ 15 – 18 ]. MVMR requires a set of instrumental variable SNPs, at least as many as the number of exposures that are associated with the exposure and outcome variables, and not affecting the outcome other than through the set of exposure variables. MVMR can then estimate the direct causal effect of each exposure on the outcome using a two-stage least squares method in the single-sample, individual-level data setting, or a “regression of regression coefficients” method in the two-sample summary data setting [ 15 – 17 ]. Recently, MVMR was applied to transcriptomic and genome-wide association study (GWAS) data in the two-sample setting, in a first attempt to overcome the challenge of regulatory pleiotropy when analyzing mRNA abundances to identify causal disease genes, in a procedure called transcriptome-wide MR (TWMR) [ 19 ].
When genes or proteins are tested as exposures one-by-one using MR, as in most studies to date, regulatory pleiotropy could clearly lead to horizontal pleiotropy as one cannot know a priori to which of the associated genes a genetic instrument exerts its influence on the outcome. Hence, to account for regulatory pleiotropy, all MR analyses using molecular abundances must employ multivariable MR (MVMR) methods, where all gene products in a genomic locus of interest are treated simultaneously as a set of potential causes for an outcome of interest.
However, analyses of large transcriptomic datasets have shown that genetic variants with “regulatory pleiotropy”, defined as variants associated with more than one gene in the same locus [ 12 ], are common. For instance, it has been found that around 10% of non-redundant cis-expression quantitative trait loci (cis-eQTLs; single nucleotide polymorphisms (SNPs) exhibit regulatory pleiotropy [ 14 ]. This covered cases where a true cis-regulatory DNA region was shared between genes as well as cases where genes had distinct regulatory elements that were in high linkage disequilibrium (LD) with one another. More recently, analyses by the GTEx Consortium of RNA abundances across 49 human tissues showed that a median of 57% of variants per tissue are associated with more than one gene in the same locus, typically co-occurring across tissues [ 12 ].
More recently, MR has been applied to estimate causal effects of molecular traits such as mRNA or protein abundances on phenotypic traits [ 7 – 10 ]. Using molecular exposure traits in MR offers potential advantages due to genetic effects on molecular abundances being much larger than those for phenotypic traits, with greater than two-fold allelic effect sizes [ 11 ] on gene expression not being uncommon [ 12 ]. Moreover, fundamental molecular biological principles reduce the risk of horizontal pleiotropy. Indeed, most genetic variants associated to mRNA and protein abundances (and phenotypic traits) are located in non-coding regions of the genome and there are no known mechanisms by which such variants could affect outcome traits other than through affecting transcription and translation of nearby genes, typically considered to be at most a distance of 1 Mbp away [ 13 ]. Hence, by limiting the selection of genetic instruments to those in the genomic vicinity of the gene whose mRNA or protein product is used as the exposure in an MR analysis, the risk of these variants affecting the outcome through alternative pathways not including the exposure gene or protein is reduced.
Traditionally MR is used to study phenotypic traits where genetic effect sizes are small and horizontal pleiotropy can never be fully excluded, that is, alternative causal paths may exist between the genetic instruments and the outcome trait independent of the exposure trait. Hence much of the methodological development in MR has sought to address these limitations [ 5 , 6 ].
Mendelian randomization (MR) is a statistical technique that uses genetic instruments to estimate causal effects between complex traits and diseases [ 1 – 3 ]. It is based on the fact that genotypes are independently assorted and randomly distributed in a population by Mendel’s laws, and not affected by environmental or genetic confounders that affect both traits (commonly called the “exposure” and the “outcome” trait for the putative causal and affected trait, respectively). If it can be assumed that a genetic locus affects the outcome only through the exposure, then the causal effect of the exposure on the outcome can be derived from their relative associations to the genetic locus, which acts as an instrumental variable [ 4 ].
Materials and methods
The method of path coefficients Causal effect identification consists of expressing the effect of an intervention of one variable on the marginal distribution of one or more other variables using knowledge of the underlying causal diagram and joint distribution of all variables under consideration [23]. A causal diagram is a directed acyclic graph (DAG) of directed edges between variables where a causal relationship is known or hypothesized, augmented with bidirected edges between variables known or hypothesized to be affected by unmodelled confounding factors (Fig 1); bidirected edges can be represented equivalently by two directed edges originating from a common latent factor. PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 1. Causal diagrams for identification (A-C) and estimation (D-F) of causal effects at GWAS loci with regulatory pleiotropy. (A) The standard instrumental variable graph is only suitable for modelling the causal effect of gene X on outcome trait Y if X is the only cis-eGene in the locus of the genome-wide significant variant E for Y. (B) A causal graph for MVMR with regulatory pleiotropy, where two or more cis-eGenes X i are found in a genome-wide significant locus for Y; the causal effects of each X i on Y can be identified if there exist at least the same number of causal variants as the number of cis-eGenes in the GWAS locus, even when they are in LD with each other, because the variants form an instrumental set. (C) If the number of causal variants is smaller than the number of cis-eGenes and other variants in the locus are merely associated by LD, the causal effects of X i on Y are non-identifiable. (D,E) When estimating causal effects from finite samples, we use an overdetermined set of genetic variants as instruments, without knowing the number or identity of the causal variants. (F) If the (unknown) number of causal variants in a locus is smaller than the number of cis-eGenes, causal effects can not be estimated, a condition that can be tested by analyzing the rank or determinant of the covariance matrix between the variants and cis-eGenes. Solid arrows represent direct causal effects and dashed (bi)directed arrows represent non-zero covariances due to unobserved (U) confounders.
https://doi.org/10.1371/journal.pgen.1011473.g001 A special class of causal models are those where the causal diagram is assumed to define a linear structural equation model (SEM) [27], that is, a system of linear equations among a set of variables {X 1 , …, X p } of the form The U i are error or disturbance terms representing unobserved background factors which are assumed to have mean zero and a variance/covariance matrix [Ψ ij ] = Cov(U i , U j ). It is often assumed that the error terms are normally distributed, but, as already pointed out by Wright [22], this is not necessary—the results below hold for any error distribution. Every non-zero coefficient c ij corresponds to a directed edge X j → X i in the causal diagram, that is, c ij is the direct causal effect of X j on X i , and every non-zero element Ψ ij corresponds to a bidirected edge X i ↔ X j . The SEM structure implies that the variables {X 1 , …, X p } have mean zero and covariance matrix where C is the matrix of coefficients c ij , CT is its transpose, and I is the identity matrix. Causal identification in linear SEMs consists of solving for the non-zero c ij (and Ψ ij ) in terms of the covariance matrix Σ. The method of path coefficients is a result derived by Sewall Wright [21, 22, 27], which expresses the covariance between any pair of variables in a linear SEM as a polynomial in the parameters (causal coefficients and error covariances) of the model: (1) where the summation ranges over unblocked paths p connecting X and Y, and T(p) is the product of the parameters (c ij or Ψ ij ) of the edges along p. A path in a graph is a sequence of edges such that each pair of consecutive edges share a common node and each node appears only once along the path. A path is unblocked if it does not contain a collider, that is, a pair of consecutive edges pointing at the common node. Two nodes are called d-separated if there are no unblocked paths between them. Eq (1) is valid for standardized variables that are normalized to have zero mean and unit standard deviation; if this is not the case, a modified rule can be applied where each T(p) is multiplied by the variance of the variable that acts as the “root” for path p [27].
Instrumental sets Fix a variable Y in a linear SEM, and assume we have the equation in our SEM. Brito and Pearl [24] derived a sufficient graphical condition for identifying the parameters c 1 , …, c K by application of Wright’s rule (1), of which we present a simplified version. The set of variables {E 1 , …, E K } is said to be an “instrumental set” relative to {X 1 , …, X K } if there exist pairs (E 1 , p 1 ), …, (E K , p K ) such that for each i the following three conditions hold. E i is a non-descendant of Y and p i is an unblocked path between E i and Y including edge X i → Y. E i is d-separated from Y in 1 → Y, …, X k → Y from the original graph For j > i, the variable E j does not appear in path p i , and if paths p i and p j have a common variable V, then both p i [V ∼ Y] and p j [E j ∼ V] point to V, where p i [V ∼ Y] and p j [E j ∼ V] denote truncations of p i and p j from V and until V, respectively. If {E 1 , …, E K } is an instrumental set relative to {X 1 , …, X K }, then the parameters of the edges X 1 → Y, …, X K → Y are identifiable and can be computed by a set of linear equations [24]. For concrete examples verifying the instrumental set conditions, see S1 File.
Causal effect identification in MVMR with correlated instruments We assume genes X 1 , …, X K are colocated in the same genomic locus and have a potential causal effect on a phenotypic trait Y associated to the locus. We assume there exist K causal variants in the locus such that each gene has at least one causal variant (causal variants may be shared between genes). The variants are correlated by LD, the exposures X i can be connected by causal relations or unmodelled latent factors, and likewise there can be latent factors affecting the exposures X i and outcome Y. Randomization of genotypes ensures that there are no unmodelled correlations between the E j and X i . An example causal diagram for K = 2 is shown in Fig 1B. We now show that {E 1 , …, E K } is an instrumental set relative to {X 1 , …, X K }. By relabelling, we may assume that there exists an edge E i → X i in the causal diagram. Define the paths p i = {E i → X i , X i → Y}. Since E i is d-separated from Y in the diagram where all edges X j → Y are removed, it follows that the set {(E 1 , p 1 ), …, (E K , p K )} satisfies the three instrumental set conditions (see above) and the causal parameters c i are identifiable. We apply Wright’s rule (Eq (1)), following the general proof of Brito and Pearl [24]. By Eq (1), (2) where the sum is over unblocked paths p. It is clear that any path that starts in E i , ends in Y, and includes a bidirected edge X j ↔ Y must be blocked by a collider V → X j ↔ Y, where V can be either a variant or another gene. Hence no such paths enter the sum in Eq (2). In other words, all unblocked paths must end with an edge X j → Y contributing a factor c j to T(p). Collecting all paths by their final edge, the sum can be decomposed as where the inner sum now extends over the subset of paths from E i to X j obtained by truncating the unblocked paths from E i to Y which have X j as their penultimate node. Since the truncation of an unblocked path is also necessarily unblocked, each p′ is an unblocked path from E i to X j . Vice versa, every unblocked path from E i to X j can be extended to an unblocked paths from E i to Y by adding the edge X j → Y. Hence, the sum over p′ is the sum over all unblocked paths from E i to X j , By another application of Wright’s rule (1), and hence In matrix-vector notation, we obtain: where c = (c 1 , …, c K )T is the vector of causal effects, and Σ EX and Σ EY are the matrix and vector of covariances and , respectively. Since Σ EX is not a symmetric matrix, we cannot assume that an inverse exists, but we can always write the solution to the linear set of equations using a generalized left inverse: (3) that is, find the least squares solution. These equations are valid for standardized variables.
The generalized method of moments for causal effect estimation in MVMR with correlated instruments To estimate the causal effects from observational data, we consider multiple finite-sample approximations to the above equations. For finite-sample estimation, the number of instruments can be greater than the number of exposures, that is, we assume we have N observations of the random variables {E 1 , …, E L }, {X 1 , …, X K }, and Y, with L ≥ K. We use lower-case letters to denote sample values, e.g. e il denotes the value of E l in observation i. We use the notation e i⋅ , e .l , and e ⋅⋅ to represent the vector of sample values for all L instruments in observation i, the vector of all N observations of instrument l, and the N × L matrix of all observations for all instruments, respectively, and similarly for x i⋅ , x ⋅k , and x ⋅⋅ . We assume the observations for all variables are standardized to mean zero and unit standard deviation. The generalized method of moments (GMM) [25] is based on moment functions that depend on observable random variables and unknown parameters, and that have zero expectation in the population when evaluated at the true parameters: (4) where g denotes the moment function, w i is a vector of samples of random variables in observation i, and c is a vector of unknown parameters. The moment function g can be linear or nonlinear. Assuming a linear MVMR SEM as before, the natural moment functions are of the form (5) that is, we consider one moment equation per instrument. From the structural equation for Y in the assumed SEM (see above and Fig 1), it follows that Y − XTC = U Y , the residual error on Y, and since all instruments E l are assumed independent of U Y (no bidirected arrows between E l and Y), that is, , it follows immediately that the moment conditions are satisfied: Replacing the expectation with the empirical mean over the observations results in L equations in K unknowns (c 1 , …, c K ) which can be written in matrix-vector notation as (6) If the number of instruments equals the number of unknowns, L = K, this system can be solved exactly, and by the standardization of all variables, the solution reduces to the least-squares estimator , (7) which is also obtained by replacing the true covariances in the instrumental variable formula (3) by their empirical estimates. If the model is overdetermined, L > K, then in general there is no unique solution to (6), because not all L sample moments will hold exactly. While the least-squares estimator provides one approximate solution to the overdetermined system, Hansen [25] proposed an alternative solution, based on bringing the sample moments as close to zero as possible by minimizing the quadratic form (8) with respect to the parameters c, where Δ is a positive definite L × L weighting matrix, and the quantity between the large square brackets is understood to be the L-vector of moment functions. It can be shown that this yields a consistent estimator of c, under certain regularity conditions. Again using the standardization of all variables, the cost function can be written as with solution (9) When Δ = I, the identity matrix, we again obtain the least-squares estimator, . While all choices of Δ lead to a consistent estimator, in finite samples different Δ-matrices will lead to different point estimates. Hansen [28] showed that the GMM estimator with the smallest asymptotic variance is obtained by taking Δ equal to the inverse of the variance-covariance matrix of the moment functions (5). If the errors are homoskedastic, , we have and obtain the estimator (10) We will refer to this estimator as the GMM estimator. Note that this estimator is equivalent to the two-stage least squares estimator.
Simulations with randomly generated instrument correlations We simulated data from idealized models corresponding to the causal diagrams in Fig 1 with binomially distributed instruments and normally distributed exposure and outcome variables. For the diagrams with a single exposure (Fig 1A and 1D), we set the true causal effect c X = 0.3. In the case of one instrument (Fig 1A) we varied the instrument strength a between 0.01 (weak instrument), 0.3 (mediocre instrument), and 0.8 (strong instrument). In the case of multiple instruments (Fig 1D), we simulated ten instruments with randomly sampled instrument strengths, either all strong (effect sizes between 0.1 − 0.3) or all weak (effect sizes between 0.001 − 0.03). For the diagrams with multiple exposures and correlated instruments (Fig 1B, 1C, 1E and 1F), we set the true causal effects and in all simulations. To test the effect of instrument correlation we simulated data with two instruments (Fig 1B) where the pairwise correlation of the instruments varied between 0.01, 0.3, 0.7, 0.9 and 0.95. The matrix A = [a ij ] of instrument effect sizes, where a ij is the direct causal effect of E i on X j , was generated randomly such that its determinant was greater than 0.05 and there were no weak instruments (a ij between 0.1 − 0.3). To test the effect of weak instruments, we simulated data with two instruments (Fig 1B) with randomly generated non-zero pairwise correlation of the instruments and randomly generated matrix of instrument effect sizes A with determinant either less than 0.001 and effect size of each instrument less than 0.001 or determinant greater than 0.05 and effect size of each instrument between 0.1 − 0.3. To test robustness of the causal effect estimates, we simulated data with two instruments and two exposures (Fig 1B), with randomly generated non-zero pairwise correlation between the instruments, randomly generated matrix of instrument effect sizes A with determinant greater than 0.05, and true causal effects of the exposures on the outcome equal to and , but estimated causal effects for each exposure separately under the (false) hypothesis that the data was generated from Fig 1A.
Simulations with instrument correlations derived from real LD matrices For more realistic simulations, we simulated data with binomially distributed genotypes and normally distributed exposure and outcome variables in such a way that the summary statistics from the simulated data match summary statistics at real GWAS loci for coronary artery disease (CAD). We used eQTL summary statistics from GTEx [12] and CAD GWAS summary summary statistics from the CARDIoGRAMplusC4D meta-analysis [29]. For a given locus of interest with cis-eQTLs E 1 , …, E L for genes X 1 , …, X K (L ≥ K), with LD matrix Σ E (size L × L), matrix of eQTL summary statistics A (size L × K), and vector of CAD GWAS summary statistics Σ EY (size L × 1), we simulated data as follows. To sample binomially distributed genetic instruments according to a predefined LD matrix, we made the simplifying assumption that the correlation structure between the simulated instruments can be described by a Markov chain. We sampled the first SNP’s genotype from a binomial distribution with probability of success equal to its real minor allele frequency (MAF). For every other SNP, we sampled a genotype from a binomial distribution with probability of success conditional on the genotype of the previous SNP in the sequence, such that both the MAF and pairwise LD between successive SNPs match the real data from the simulated locus. Further details are given in S1 File. We assumed that a randomly selected subset of at least K eQTLs were causal and all others were associated with the exposures only through LD. For the causal eQTLs, we sampled the vector a of instrument effect sizes randomly between 0.1 and 0.3, with σ2 = 1. Finally, we set the causal effect c of X on the outcome Y equal to the value obtained from eq. (3) with the real Σ EY , and sampled outcome data from with σ2 = 1. We performed simulations to analyze the influence of LD matrix accuracy, weak instrument bias, standard error calculation method, and horizontal pleiotropy. For the simulations on the accuracy of the LD matrix, we generated datasets for the SLC22A3-LPA-PLG locus (centered on chr 6: 161089307) in liver as exposures. Causal effects of the genes on the outcome variable Y were calculated to be c SLC22A3 = 0.15, c LPA = −0.05 and c PLG = −0.27 when seven instruments were used, and the same values were also used in the other simulations. We compared the situations where L = 3 (LD threshold of 0.25), L = 4 (LD threshold of 0.3), L = 5 (LD threshold of 0.4), L = 6 (LD threshold of 0.5), and L = 7 instruments (LD threshold of 0.8) were used. We performed one-sample simulations where the sample size varied between 500 − 30000. For the instrument effect sizes, we assumed that three eQTLs were causal for all genes (their effect sizes sampled from a uniform distribution within the range 0.1 − 0.3), while additional eQTLs were assumed to be associated with the exposures only through LD. For the simulations on weak instrument bias, we generated datasets within the same locus and setup as the LD accuracy simulations except here we kept L = 8 eQTLs (LD threshold of 0.96) and varied the sample size between 500 − 10000. For the instrument effect sizes, we again assumed that only three eQTLs were causal for all genes, their effect sizes sampled from a uniform distribution within the range 0.1 − 0.3 for strong instruments and within the range 0.001 − 0.01 for weak instruments. We confirmed that these instruments were indeed weak/strong by employing the conditional F-statistics computed using the Mendelian Randomization package [30] as well as the MVMR package [31] (S3 Fig). For the simulations on comparing the standard errors computed approximately using summary-level data and exactly using individual-level data, we generated datasets within the same locus and setup as the previous simulations. We kept L = 7 eQTLs (LD threshold of 0.95) and varied the sample size between 500 − 10000. For the instrument effect sizes, we again assumed that only three eQTLs were causal for all genes. Details about the standard error calculation using summary and individual-level data are given in detail in S1 File. To simulate the effect of horizontal pleiotropy, we generated datasets for the SLC22A3-LPA-PLG locus with causal effects of the genes on the outcome variable Y set to c LPA = −0.05 and c PLG = −0.27 and c SLC22A3 ∈ [0.0, 0.4]. We used eight instruments where only three eQTLs were causal for all genes, their effect sizes sampled from a uniform distribution within the range 0.1 − 0.3. We estimated the effects in a misspecified estimation model where we performed MVMR only using LPA-PLG genes, assuming horizontal pleiotropy through one unknown/unmeasured gene SLC22A3 in the same locus. To simulate the effect of potential bias introduced by two-sample MR, we generated independently sampled eQTL and GWAS datasets for the MRAS-ESYT3 locus in the Aorta tissue of STARNET, which share L = 5 eQTLs in their locus (centred on chr 3:138121920), with causal effects of the genes on the outcome variable Y set to c MRAS = 0.208 and c ESYT3 = −0.294. We simulated eQTL sample sizes between 300 − 4000 and GWAS sample sizes between 10000 − 140000.
Comparison to other MVMR methods We compared the least squares (LS) (Eq (7)) and GMM (Eq (10)) estimators to seven other multi-variable MR methods: Transcriptome-Wide MR (TWMR) [19], the mv_multiple estimator from TwoSampleMR package [32], and the mv_mvivw estimator from the Mendelian Randomization package [30], estimators MVMR-Robust, MVMR-Median, and MVMR-Lasso [33] and estimator MVMR-cML [34]. The least squares and GMM estimators can also be applied in the univariate MR setting. In this setting we performed a comparison to all available methods in the MR-Base package, namely, Inverse Variance weighted (IVW), Weigted Median (WM), Maximum Likelihood (MaxLik), MR-Egger (Egger). We also included TWMR in the comparison. In the course of our analysis we identified the presence of a “regularization” term in the TWMR source code that was hard-coded to a specific value, presumably related to the sample size of the use-case from their paper [10]. More precisely, for the weight matrix used in the GMM estimator, H = ((XTE) ⋅ (ETE)−1 ⋅ (ETX))−1, they used shrinkage of the form Here α is where N = 3781 and I is the identity matrix. Results reported here used the published version of TWMR including this hard-coded term.
[END]
---
[1] Url:
https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1011473
Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.
via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/