(C) PLOS One

(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .

Combining Mendelian randomization and network deconvolution for inference of causal networks with GWAS summary data [1]

['Zhaotong Lin', 'Division Of Biostatistics', 'University Of Minnesota', 'Minneapolis', 'Minnesota', 'United States Of America', 'Haoran Xue', 'Wei Pan']

Date: 2023-06

Mendelian randomization (MR) has been increasingly applied for causal inference with observational data by using genetic variants as instrumental variables (IVs). However, the current practice of MR has been largely restricted to investigating the total causal effect between two traits, while it would be useful to infer the direct causal effect between any two of many traits (by accounting for indirect or mediating effects through other traits). For this purpose we propose a two-step approach: we first apply an extended MR method to infer (i.e. both estimate and test) a causal network of total effects among multiple traits, then we modify a graph deconvolution algorithm to infer the corresponding network of direct effects. Simulation studies showed much better performance of our proposed method than existing ones. We applied the method to 17 large-scale GWAS summary datasets (with median N = 256879 and median #IVs = 48) to infer the causal networks of both total and direct effects among 11 common cardiometabolic risk factors, 4 cardiometabolic diseases (coronary artery disease, stroke, type 2 diabetes, atrial fibrillation), Alzheimer’s disease and asthma, identifying some interesting causal pathways. We also provide an R Shiny app ( https://zhaotongl.shinyapps.io/cMLgraph/ ) for users to explore any subset of the 17 traits of interest.

Understanding the causal relationship and building a causal network among several risk factors and diseases is a key to therapeutic development and informed clinical decision making. The development of Mendelian randomization (MR) opens a door to using observational GWAS summary data for causal inference between a pair of traits. Here we first take advantage of bi-directional Mendelian randomization to infer the total causal effect between each pair of traits in the network without specifying an exposure and an outcome a priori. Then we apply network deconvolution to the (estimated) total causal network, obtaining an estimated direct causal network, each edge of which represents the direct causal effect of one trait on another that is not mediated through any other traits. On the other hand, most MR methods are vulnerable to violation of valid instrumental variable assumptions and are based on two independent samples while GWAS sample overlap becomes inevitable as more large-scale consortium data being used. Therefore we propose a robust and efficient MR method, called MR-cML-C, accommodating overlapping samples, and show that it has nice statistical properties. We also clarify both finite-sample and large-sample properties of the causal parameter estimator under the incorrect independence assumption (i.e. ignoring sample overlap). By reconstructing both a total and a direct causal networks of 17 traits, including 11 common cardiometabolic risk factors and 6 diseases, we demonstrate the usefulness of our method.

Introduction

A fundamental task in science is to understand causal pathways among various risk factors and diseases. This is particularly challenging with observational data due to the likely presence of hidden confounding, implying that an observed association is not equivalent to a causation. In our real data example, we’d like to infer which of some known risk factors are causal to coronary artery disease (CAD). While many previous studies have established for example that obesity is associated with CAD [1], whether it is causal, especially independent of other known risk factors, is still debatable with conflicting results from observational studies [2]. Mendelian randomization (MR) is a powerful tool to infer causal relationship between two traits in the presence of unmeasured confounding, by using single nucleotide polymorphisms (SNPs) as instrumental variables (IVs) [3–5]. A distinct and useful feature of MR is its applicability when the two traits come from two different genome-wide association study (GWAS) summary datasets. The conventional MR analysis usually assumes the causal direction is known from an exposure to an outcome. When the direction is not clear, bidirectional MR can be applied [6, 7]. However, such a causal estimate only reflects the total causal effect from one trait to the other, which consists of possibly both a direct effect and an indirect effect mediated through other factors [8–11]. In our motivating real data example, we’d like to estimate causal relationships among multiple common risk factors and diseases; we are not only interested in a total effect of a risk factor, say obesity/BMI, on a disease, say CAD, but also its direct effect after accounting for possible mediating effects through other risk factors. In addition, in general we do not want to pre-specify any causal directions because, for example, there may be a bidirectional relationship between BMI and CAD. For this purpose, we propose a two-step framework to infer both total and direct causal networks, allowing bi-directional relationships (i.e. cycles). In the first step, we apply bidirectional MR on every pair of traits to construct a total causal (effect) graph, depicting the total causal effect from one node to the other. In the second step, we apply network deconvolution [12] to the (estimated) total causal network to estimate the direct causal (effect) graph, each edge of which measures the direct effect of one node on the other after accounting for mediating effects through other nodes in the graph.

In principle, any bidirectional MR method could be used in the first step. However, the inference of the direct causal graph depends crucially on the validity of the estimated total causal effects in the first step, which relies on the three key IV assumptions in MR: (i) Relevance assumption—IVs are associated with the exposure; (ii) Independence assumption—IVs are independent of unmeasured confounding; (iii) Exclusion restriction—IVs affect the outcome only through the exposure. However, these assumptions may be violated due to the pervasive horizontal pleiotropy [13, 14]. Under the plurality assumption (that the valid IVs form the largest group of IVs sharing the same causal parameter value), MR-cML is robust to the presence of some invalid IVs violating any or all of three IV assumptions and has been shown to perform better than many existing methods under various scenarios [15]. Furthermore, as shown before [16], with a simple IV screening procedure, MR-cML achieves good performance in inferring both causal directions and effect sizes between two traits while allowing bidirectional relationships (i.e. either trait is causal to the other at the same time). Thus, we will apply MR-cML in our causal graph framework, called Graph-MRcML.

One limitation of the original MR-cML is its implementation only for two-sample MR (i.e., assuming two independent GWAS summary datasets) [15]. However, in practice, multiple traits may come from the same study, as several lipid traits from the Global Lipids Genetics Consortium GWAS data to be used in our real data example [17]. More generally, as more international consortia and large-scale biobanks emerging, it is inevitable to have overlapping samples between some GWAS datasets. It has been shown that sample overlap may lead to biased estimates and inflated type-I errors in MR [18]. To address this, we first extend MR-cML to the overlapping-sample set-up, which turns out to be non-trivial, especially with respect to valid statistical inference. In addition, we establish theory that, perhaps surprisingly, the bias of the causal parameter estimator under the incorrect independence assumption (i.e. ignoring sample overlap) will disappear asymptotically (as the sample size increases); however, the usual (model-based) variance will be biased, thus we propose a robust/sandwich estimator. More importantly, the causal parameter estimator fully accounting for sample overlap is more efficient than the one under the working independence assumption. It is emphasized that, as a distinct feature, our proposed method not only estimates causal networks, but also can assess the statistical significance of any estimated causal effects. For this purpose, in addition to developing statistical theory for large-sample inference, we also develop a novel and effective data perturbation scheme for more accurate finite-sample inference by accounting for model fitting uncertainties (e.g. in selecting out invalid IVs). The latter task is technically challenging mainly because of the presence of some complex dependencies among the traits and the SNPs: we need to fully take into account of not only possible correlations among the traits (due to overlapping samples), but also each trait’s being used multiple times across many pairs of traits (thus inducing dependencies among the resulting estimates in a causal network) and linkage disequilibrium (LD) among the SNPs/IVs across all the traits (even if the SNPs/IVs are selected as independent for each trait). In particular, it would be impractical to restrict the SNPs/IVs to be independent across all the traits, leading to no or few SNPs.

There are several approaches in the MR literature aiming to estimate the direct causal effects among multiple traits. [19] proposed a two-step framework similar to ours, which used MR-Egger to construct a causal network of total effects, then under the sparsity assumption approximately invert it by penalized regression to infer the corresponding direct causal network. Besides the difference of our using more robust and efficient MR-cML versus their (modified) MR-Egger, no theory of their method is established; in particular, it is unclear how their proposed statistical inference would perform, partly due to technical challenges imposed by their using penalized regression. Another related method is two-step MR [20] or network MR [8], which focuses on the set-up with a candidate mediator between an exposure and an outcome. Our proposed method can be regarded as a generalization of this approach to infer a more complex causal network of multiple traits without pre-specifying causal directions and mediators. Finally, multivariable MR (MVMR) [9, 10] can be used to estimate direct effects of multiple exposures on an outcome. However, first, our method depends only on the validity of univariable MR (UVMR) (and the corresponding valid IV assumptions), while there are additional assumptions required for MVMR [21]. For example, a valid IV for UVMR may not be valid for MVMR, and there is a potential issue of multicollinearity in MVMR, leading to weak IV biases [22]. Second, existing MVMR methods all require the use of independent IVs for all exposures, sometimes leading to no or only few IVs for some exposures if the number of exposures is not too small. More generally, application of any existing MVMR method would reduce the number of the IVs, leading to loss of estimation efficiency and the possible issue of multicollinearity as to be confirmed in the real data example.

To summarize, we have two main contributions in methods development. First, we propose a general framework for inferring (including estimating and testing) both total and direct causal graphs among multiple traits of interest. Second, for better performance of the proposed framework, we extend the MR-cML method [15] to accommodate overlapping samples, and modify the network deconvolution algorithm, either of which can be useful in their own applications. Through extensive simulation studies, we show that the extended MR-cML performed better than the original one and other widely-used MR methods in the presence of sample overlap. We also show improved performance of our modified network deconvolution algorithm over that of the original one. Finally, we applied the proposed framework to 17 large-scale GWAS summary datasets (with median sample size of 256879 and median 48 IVs) to infer causal networks among 11 common cardiometabolic risk factors and 6 diseases, including 4 cardiometabolic diseases (coronary artery disease, stroke, type 2 diabetes, atrial fibrillation), Alzheimer’s disease (AD) (for its associations with some cardiometabolic risk factors/diseases [23]) and asthma (more as a negative control), identifying some interesting causal pathways.

[END]
---
[1] Url: https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1010762

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/