(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .
The identification of mediating effects using genome-based restricted maximum likelihood estimation [1]
['Cornelius A. Rietveld', 'Department Of Applied Economics', 'Erasmus School Of Economics', 'Erasmus University Rotterdam', 'Rotterdam', 'The Netherlands', 'Erasmus University Rotterdam Institute For Behavior', 'Biology', 'Ronald De Vlaming', 'Department Of Economics']
Date: 2023-03
Abstract Mediation analysis is commonly used to identify mechanisms and intermediate factors between causes and outcomes. Studies drawing on polygenic scores (PGSs) can readily employ traditional regression-based procedures to assess whether trait M mediates the relationship between the genetic component of outcome Y and outcome Y itself. However, this approach suffers from attenuation bias, as PGSs capture only a (small) part of the genetic variance of a given trait. To overcome this limitation, we developed MA-GREML: a method for Mediation Analysis using Genome-based Restricted Maximum Likelihood (GREML) estimation. Using MA-GREML to assess mediation between genetic factors and traits comes with two main advantages. First, we circumvent the limited predictive accuracy of PGSs that regression-based mediation approaches suffer from. Second, compared to methods employing summary statistics from genome-wide association studies, the individual-level data approach of GREML allows to directly control for confounders of the association between M and Y. In addition to typical GREML parameters (e.g., the genetic correlation), MA-GREML estimates (i) the effect of M on Y, (ii) the direct effect (i.e., the genetic variance of Y that is not mediated by M), and (iii) the indirect effect (i.e., the genetic variance of Y that is mediated by M). MA-GREML also provides standard errors of these estimates and assesses the significance of the indirect effect. We use analytical derivations and simulations to show the validity of our approach under two main assumptions, viz., that M precedes Y and that environmental confounders of the association between M and Y are controlled for. We conclude that MA-GREML is an appropriate tool to assess the mediating role of trait M in the relationship between the genetic component of Y and outcome Y. Using data from the US Health and Retirement Study, we provide evidence that genetic effects on Body Mass Index (BMI), cognitive functioning and self-reported health in later life run partially through educational attainment. For mental health, we do not find significant evidence for an indirect effect through educational attainment. Further analyses show that the additive genetic factors of these four outcomes do partially (cognition and mental health) and fully (BMI and self-reported health) run through an earlier realization of these traits.
Author summary Mediation analysis is instrumental to identify intermediate factors between causes and outcomes. We developed Mediation Analysis using Genome-based Restricted Maximum Likelihood (MA-GREML) estimation to quantify to which degree the genetic component (cause) of a trait (outcome) is mediated by another trait (intermediate factor). This method has two main advantages. First, it captures the total additive genetic variance as estimated using GREML rather than the smaller part of the genetic variance typically captured by polygenic scores. Second, the individual-level data approach GREML takes allows users to directly control for confounders in the model. We use analytical derivations, simulations, and empirical data to validate the underlying model that is used to quantify to which degree an intermediate trait mediates the relationship between the genetic component of the outcome and the outcome itself. We conclude that MA-GREML is an appropriate method to estimate genetic mediation and to test the significance of the indirect effect. We implemented the estimation procedure in the freely available Python-based software package MGREML, available at
https://github.com/devlaming/mgreml/.
Citation: Rietveld CA, de Vlaming R, Slob EAW (2023) The identification of mediating effects using genome-based restricted maximum likelihood estimation. PLoS Genet 19(2): e1010638.
https://doi.org/10.1371/journal.pgen.1010638 Editor: Zoltán Kutalik, University Hospital of the Canton Vaud (CHUV), SWITZERLAND Received: June 24, 2022; Accepted: January 23, 2023; Published: February 21, 2023 Copyright: © 2023 Rietveld et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: Access to the genetic data of the Health and Retirement Study (HRS) has been obtained through the Database of Genotypes and Phenotypes (dbGaP, application 3544;
https://www.ncbi.nlm.nih.gov/gap/). The RAND HRS data, produced by the RAND Center for the Study of Aging, containing the phenotypic data are available after registration via the HRS website (
https://hrsdata.isr.umich.edu/data-products/rand). Researchers who wish to link genotype and phenotype data from the HRS must apply for approval from HRS (
https://hrsdata.isr.umich.edu/data-products/hrs-dbgap/genetic-data-request-form). The mediation tool, simulation code, and an extensive tutorial are freely available on the MGREML GitHub page (
https://github.com/devlaming/mgreml). S1 Data contains all simulation output. Funding: This work is supported by the European Research Council (Starting Grant 946647 GEPSI to C.A.R.). E.A.W.S. is funded by National Institute for Health Research (NIHR Cambridge BRC). The funders had no role in study design, data analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist.
Introduction Mediation analysis is widely used to identify the mechanisms and intermediate factors in between causes and outcomes [1]. The original idea of mediation analysis dates back to at least as early as 1934 [2], but it gained momentum after its seminal introduction in the social sciences by Baron and Kenny almost 40 years ago now [3]. Since then, the number of studies using mediation analysis has grown exponentially [4]. Motivated by the desire to control for genetic confounding [5], the earlier twin-study literature already attempted to investigate the effect of an environmental factor on a trait while controlling for genetic influences assuming that a significant relationship between the mediating variable and the outcome variable in a genetically-informative context would provide evidence for mediation [6]. This approach is not guaranteed to be unbiased because the shared environment (family) component in a twin study may pick up parental genetic influences; moreover, this approach fails in case the mediating variable is shared between twins [6]. The more recently developed biometric mediation model for twin data [7] can detect genetic confounding in a mediation setting, but has been designed to assess mediation between observed (non-genetic) variables in a genetically-informative model. The Baron and Kenny approach, though, to assess mediation draws upon conventional regression techniques in a sample of unrelated individuals. By establishing that the effect of the independent variable on the outcome variable significantly changes upon inclusion of the mediating variable in the model, one can assess mediation [3, 8]. By drawing on polygenic scores (PGSs), several recent studies have employed such regression models to identify the mechanisms through which genotypes affect traits [9–11]. A clear limitation of this approach is that PGSs have limited explanatory power and as such do not fully account for the additive genetic component of traits [12–15]. Although several regression approaches have been developed recently to ‘disattenuate’ the estimated effect of the PGS from measurement error [15–17], these approaches cannot be used to properly account for the total additive genetic component of the outcome variable in a mediation model because these approaches do not recover the relationship between the main explanatory variable (the PGS) and the mediating variable. To overcome this limitation of PGS-based mediation analyses, we developed a structural equation model (SEM) for mediation analysis using Genome-based Restricted Maximum Likelihood (GREML) estimation. This approach we refer to as MA-GREML and has been implemented in MGREML, a Python-based command-line tool [18, 19]. Originally, GREML estimation has been developed to quantify the contribution of random, additive effects of single-nucleotide polymorphisms (SNPs) to variation in a trait [20, 21] (i.e., the genetic variance) and to the covariance between traits [22] (i.e., the genetic covariance). Moreover, GREML also quantifies the environmental variance of traits (i.e., the phenotypic variance not accounted for by the random SNP effects) and the environmental covariance between traits (i.e., the phenotypic covariance not accounted for by the SNP effects). Therefore, MA-GREML is able to quantify not only the genetic (co)variance of supposed mediator M and outcome Y, but also the genetic variance of Y that is mediated by M (i.e., the indirect effect) and the genetic variance of Y that is not mediated by M (i.e., the direct effect). Importantly, these estimates are not subject to the typical attenuation bias one would expect in a mediation analysis using a noisy approximation of the genetic component of M and/or Y, viz., a PGS. MA-GREML, like Genomic SEM [23], relies on a structural equation model (SEM). Importantly, whereas MA-GREML employs individual-level genetic data Genomic SEM draws upon genome-wide association study (GWAS) summary statistics. Genomic SEM is a flexible two-stage tool that allows users to specify and estimate a wide range of SEMs, also ones involving forms of mediation. Nevertheless, MA-GREML uses a specific SEM focused on the question to which degree the relationship between the genetic component of outcome Y and outcome Y itself is mediated by mediator M. Genomic SEM cannot be utilized to answer this specific question because such a model based on GWAS summary statistics for M and Y would involve more parameters than there are degrees of freedom in the model. We refer to Section F in S1 Text for a more elaborate discussion of mediation analysis using Genomic SEM. Mendelian randomization (MR) models, in which the main explanatory variable is only allowed to impact the outcome variable through the mediator, can be conceptualized as a restricted version of the MA-GREML model. This restriction boils down to the so-called exclusion assumption to hold in an MR study, which is often debatable and not empirically verifiable [24, 25]. MA-GREML may however be instrumental to provide evidence in favor of the exclusion restriction to hold (i.e., whether there is indeed full mediation). Relatedly, a GWAS is sometimes conducted on a proxy-trait that is easier to measure than the trait of interest itself [26]. Evidence that genetic effects on the proxy trait run largely or even fully through the actual trait of interest may build confidence in such proxy-GWAS findings. We use analytical derivations, simulations, and empirical data, to validate the structural model underlying MA-GREML and the procedure developed to test the significance of mediating effects. For the empirical analyses, we draw on longitudinal data from the US Health and Retirement Study [27]. While we do find evidence that genetic effects on Body Mass Index (BMI), cognition and self-reported health in later life run through educational attainment, the evidence regarding the mediating role of educational attainment for mental health in later life is inconclusive. In addition, we find that the additive genetic factors of these four later-life health outcomes do partially (cognition and mental health) and fully (BMI and self-reported health) run through an earlier realization of these traits.
Verification and comparison Using simulations, we analyze whether the MA-GREML estimators are consistent and if the LRT for the presence of an indirect effect behaves appropriately (i.e., no inflated false-positive rate for simulations in which the null hypothesis of no indirect effect holds true). Besides a Baseline scenario with partial mediation (a2 + g2 = 2, b = 1, and c2 = 1), we also consider three scenarios with no mediation (i.e., indirect effect = 0), viz., Scenario (i) where a2 + g2 = 0, b = 1, and c2 = 1; Scenario (ii) where a2 + g2 = 2, b = 0, and c2 = 1; Scenario (iii) where a2 + g2 = 0, b = 0, and c2 = 1. Finally, we consider one scenario with full mediation (i.e., direct effect = 0), viz., Scenario (iv) where a2 + g2 = 2, b = 1 and c2 = 0. Section I in S1 Text provides more details about the set-up of the simulations and Table A-E in S1 Data contains simulation output for each run of the simulation. The simulation results are based on 100 runs. Table 2 provides the average estimates and their SEs across runs (more detailed simulation results can be found in Section J in S1 Text). Across scenarios and runs, estimates are close to the true values and have small SEs. For the scenario with partial mediation (Baseline), estimates indeed reveal both the direct and indirect effect; for the scenarios with no mediation (i–iii), the estimated indirect effect is indeed very close to zero; for the scenario with full mediation (iv), the estimated direct effect is very close to zero, as expected. We conclude that MA-GREML consistently estimates the effect of M on Y, the direct effect, and the indirect effect, irrespective of whether we have no mediation, partial mediation, or full mediation. PPT PowerPoint slide
PNG larger image
TIFF original image Download: Table 2. Average estimates and corresponding standard errors of estimated parameters in the mediation model (100 simulation runs for each scenario). is the estimator of b based on Eq 8 (i.e., the effect of mediator M on outcome Y); is an estimator of c2 based on Eq 10 (i.e., the genetic variance of Y that is not mediated by M); is an estimator of (a2 + g2)b2 based on Eq 9 (i.e., the genetic variance of Y that is mediated by M); average standard errors are reported between parentheses.
https://doi.org/10.1371/journal.pgen.1010638.t002 Regarding the behavior of our LRT for the indirect effect, in Scenario (i) the optimum of the likelihood of the model under the null hypothesis (i.e., under the constraint that (a2 + g2) b2 = 0) is typically found in (i.e., where a2 + g2 = 0 and where b can differ from zero) and, therefore, the LRT tends to have two degree of freedom. Fig 2A shows the observed distribution of p-values resulting from the LRTs compared to the expected distribution of p-values. We find that the observations lie somewhat below the 45 degree line. Thus, the LRT is conservative in case M has no genetic variance at all, implying a reduced false-positive rate in that case. By the same token, this finding implies that the LRT is underpowered in case M has a very low but non-zero . This drawback, however, has little practical relevance, as users would typically only consider potential mediators that have an appreciable . PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 2. QQ-plots visualizing the distribution of p-values resulting from the likelihood ratio tests (LRTs) for the presence of the indirect effect (a2 + g2)b2 in different scenarios. Panel (a): Mediator M does not have a genetic component, but has an effect on outcome Y ((a2 + g2) = 0, b ≠ 0); Panel (b): Mediator M has a genetic component, but does not have an effect on outcome Y ((a2 + g2) > 0, b = 0); Panel (c): Mediator M does not have a genetic component, nor an effect on outcome Y ((a2 + g2) = 0, b = 0).
https://doi.org/10.1371/journal.pgen.1010638.g002 In Scenario (ii), the optimum of the likelihood of the model under the null hypothesis is mostly found in (i.e., where b = 0 and where a2 + g2 can exceed zero) and, therefore, the LRT tends to have one degree of freedom. Fig 2B plots the observed distribution of p-values resulting from the LRTs compared to the expected distribution of p-values. With the observations being close to the diagonal, we find that the empirically observed p-values closely follow their theoretically expected distribution in case b = 0 and (a2 + g2)>0. Thus, also in this setting, we do not suffer from an inflated false-positive rate. Moreover, here we find no evidence of any loss in statistical power (e.g., where b is small but nonzero). Within the set of scenarios where the null hypothesis of no indirect effect holds true (i.e., Scenarios i–iii), we judge this scenario to be the most realistic one because in practice mediators will have a non-negligible , irrespective of whether they affect Y. Finally, for Scenario (iii) the observed distribution of p-values resulting from the LRTs compared to the expected distribution of p-values are shown in Fig 2C. In this setting, it is not possible to a priori determine whether the optimum of the likelihood of the model under the null hypothesis will be in either or . This absence of a prior expectation for the degrees of freedom poses no problem; MA-GREML simply sets the degrees of freedom based on where the optimum is found under the restrictions imposed by the null hypothesis (i.e., two degrees of freedom if the optimum under the null hypothesis is found in and one if found in ). The observed p-values, like in Scenario (i), are somewhat conservative. We conclude that also in this scenario the procedure for testing the significance of the indirect effect sufficiently controls for Type I errors. Again, we note that in practice a setting in which the mediator is not heritable is not realistic.
Applications Ethics statement For our empirical illustrations, we employ data from the US Health and Retirement Study (HRS). The HRS is a longitudinal panel study that surveys a representative sample of approximately 20,000 individuals aged 51 years and older (and their spouses) in the United States of America [27]. Collection and production of HRS data comply with the requirements of the University of Michigan’s Institutional Review Board (HUM0006112). The specific research project in which the present study is embedded has been approved locally by the Institutional Review Board of the Erasmus Research Institute of Management (IRB-E 2014–04). Empirical results To construct the genomic-relatedness matrix (GRM) we need for the GREML analyses, we use the 2012 release of genetic data. These genetic data were obtained from the DNA samples collected from HRS participants in the years between 2006 and 2008. Participants signed consent forms stating that the genetic data could be used for non-profit research use only. Genotyping was carried out using the Illumina Human Omni-2.5 Quad BeadChip. After imposing the quality control filters recommended by the genotyping center, removal of non-autosomal SNPs and SNPs with minor allele frequency (MAF) < 0.05, genotyping call rate < 95% and Hardy–Weinberg p < 10−6, and selection of unrelated individuals from European ancestry with <5% SNP missingness, we construct the GRM for a sample of 8,652 individuals using 1,188,298 SNPs. Phenotypic data from the biennial waves of data collection in the HRS are harmonized by the RAND cooperation. We exploit information about educational attainment (years) and four health outcomes in later life—Body Mass Index (BMI; kg/m2), Cognition (index score ranging from 0–35), Mental health (index score ranging from 0–8), and Self-reported health (index score ranging from 1–5)—as available in the RAND HRS Longitudinal File 2018 (V1). By analyzing a variable that is realized in early life as mediator and health measures in later life as outcomes, we make sure that the mediating variable precedes the outcomes. We selected phenotypic data from the wave of data collection with the (combined) lowest number of missing values for these four outcomes (Wave 8). In addition, we analyze to what extent genetic effects run through an earlier realization of these four variables. That is, we analyze whether the Wave 7 measures mediate the additive genetic effect on the Wave 8 outcomes. There are 5,305–8,565 individuals with information on the four later-life health outcomes. For the mediators, the sample size ranges between 5,747–8,638. In our analyses, we control for sex (Female/Male) and birth year. MGREML also adjusts for the first 20 principal components of the genomic-relatedness matrix to control for subtle forms of population stratification [37, 38]. Table 3 provides descriptive statistics of the HRS analysis sample. Due to the exclusion of individuals with a genetic relatedness larger than 0.025 in the analysis sample, the actual analysis samples comprise slightly fewer individuals (see Table 4). PPT PowerPoint slide
PNG larger image
TIFF original image Download: Table 3. Descriptive statistics of the analysis sample from the Health and Retirement Study. Std. dev. = Standard deviation; Min. = Minimum; Max. = Maximum.
https://doi.org/10.1371/journal.pgen.1010638.t003 PPT PowerPoint slide
PNG larger image
TIFF original image Download: Table 4. MA-GREML results for the empirical analyses using data from the Health and Retirement Study. LRT = Likelihood-Ratio Test for significance of indirect effect (a2 + g2)b2 (i.e., the proportion of of outcome Y running through mediator M); Standard errors in parentheses.
https://doi.org/10.1371/journal.pgen.1010638.t004 The main results of the empirical analyses are presented in Table 4. The later-life health outcomes are all heritable. The SNP-based heritability of BMI is 18.7%, and that of cognition is 24.2%. It equals 19.2% for mental health and 10.6% for self-reported health. The heritability estimates of the mediator (educational attainment) ranges from 26.3% to 29.8% across the slightly different analysis samples. The estimated effects b of M on Y show that every additional year of educational attainment leads to a decrease in BMI of 0.222 kg/m2 and an increase in the cognition index score of 0.323. For mental health and self-reported health, the estimates are −0.044 and −0.046, respectively. We find that the effect of the additive genetic factor of BMI on BMI runs through educational attainment for just 1.4%, but this indirect effect is statistically significant (p = 0.025). For mental health, the indirect effect appears to be statistically insignificant (p = 0.182). The proportion mediated for cognition is 27.6% (p = 0.011). This percentage is roughly similar for self-reported health, 30.6%, and for this outcome the indirect effect is also significant (p = 0.010). For the analyses with the lagged variables as mediators, the results are shown in the bottom of Table 4. For BMI, we observe full mediation: the proportion mediated is 99.9%, and the indirect effect is statistically significant (p = 3.615 × 10−4). For self-reported health, we observe a similarly high proportion mediated, i.e., 96.8% (p = 7.785 × 10−5). These estimates suggest that the additive genetic factor of these two outcomes is relatively time-invariant between consecutive data collection waves in the analysis sample. For cognition and mental health the proportions mediated are lower, 71.3% (p = 2.901 × 10−3) and 54.9% (p = 1.563 × 10−3), respectively, suggesting relatively strong impact of the environment on the development of these traits over the life-course.
Discussion We designed a framework to analyze to what extent a particular factor mediates the relationship between the additive genetic factor of a trait and the trait itself. By doing so, we particularly improve on the current practice of using polygenic scores for such analyses, because GREML enables capturing the full SNP-based heritability of a trait rather than just the ‘explained SNP-based heritability’ [15]. The statistical procedure has been implemented in the ready-to-use command-line tool MGREML [19]. The GitHub page (
https://github.com/devlaming/mgreml) accompanying this tool comes with a full tutorial on its usage. Usage of MGREML for mediation analysis requires careful assessment of whether the assumptions of the model hold for the specific empirical application, in particular that the causal direction between mediator M and outcome Y is correctly specified. Moreover, SNP-based heritability estimates may capture both direct genetic effects and gene-environment correlations such as genetic nurture effects (i.e., the effect of parental genotype on the child’s outcomes), depending on the outcome analyzed [15, 39]. In Section H in S1 Text we introduce a structural equation model including genetic nurture effects and we derive that in situations where parental factors (e.g., parental genes or parental educational attainment) affect offspring outcomes estimates of mediating effects are typically affected. As genetic nurture can lead to both over and underestimation of the effect of mediator M on outcome Y, the impact of such effects on the MA-GREML estimates depends on the specific empirical application. Nevertheless, the assessment of mediation is of prime importance in the social sciences [1], because it allows for an improved understanding of the mechanisms underlying the relationship between a predictor and an outcome. As a result, more nuanced questions extending beyond merely determining whether an outcome occurs can be analyzed and answered. However, the results of mediation analyses are also informative for genetic epidemiology. The proxy-phenotype approach has been developed to discover genetic variants in a GWAS for being associated with a trait by investigation of a closely related but easier to measure trait [26]. While a strong genetic correlation between the two traits often serves as a motivation for this experimental design [36], backing up such a study with an analysis showing that genetic effects on the easier to measure trait in fact largely run through the more difficult to measure trait would particularly build confidence in the eventual proxy-GWAS findings. Moreover, Mendelian randomization (MR) models are widely used for causal inferences by exploiting the notion that the inheritance of alleles is random conditional of the genotypes of the parents [24]. In MR models, the exclusion assumption needs to hold meaning that the main genetic factor which is used as instrument is only allowed to impact the outcome variable through the mediator. Whether this assumption holds in empirical applications is often debatable and not empirically verifiable [24, 25], especially when a PGS based on a genome-wide scan of SNPs is being used as instrument. However, our mediation model may provide suggestive evidence in favor of the exclusion restriction to hold in case full mediation is found. Even in case a biology-informed decision is taken regarding the subset of genetic variants to use in MR applications, the GREML-based mediation model can be used for such an informative analysis by constructing a GRM using these genetic variants only rather than a genome-wide scan of SNPs. For example, the GRM can be constructed using SNPs from specific chromosomes, SNPs in genes expressed in particular cell types (functional categories), or SNPs with an allele frequency in a particular range, etc. [40–42]. Moreover, SNPs included in the GRM can be weighted to reflect a specific heritability model [18, 43]. As illustration, Section K in S1 Text contains analysis results for the four later-life outcomes from the empirical illustration with educational attainment as mediator for which we partitioned the GRM by ten functional categories. Interestingly, we find that the evidence for mediation differs across functional categories. The empirical results from the Health and Retirement Study presented in the main text, however, suggest that for two reasons MR analysis based on a genome-wide scan of SNPs (or a PGS as summary of that) cannot be used to analyze the causal impact of educational attainment on BMI, cognition, mental health, or self-reported health in later life. While for BMI, cognition and self-reported health we do find a significant indirect effect, the proportion mediated is (far) below 100% suggesting that there are other channels than educational attainment through which the additive genetic factor for these two outcomes affects the outcomes. For mental health the indirect effect is statistically insignificant, both invalidating the exclusion restriction and the so-called relevance assumption that the instrument has explanatory power for the mediator [24]. In the empirical analyses with lagged versions of the outcome variable as mediator, we find that the additive genetic factor underlying the outcome can be time-variant for some traits. These results underscore that the conventional assumption in MR analyses that the effects of the genetic variants used as instruments on the exposure (i.e., the mediator in the MA-GREML framework) do not vary over the life-course is often questionable [44]. Thus, beyond the assessment of the relative size of indirect effects, these results show that results obtained using MA-GREML are of importance for validating assumptions of other methodologies. Therefore, we believe that the mediation analysis approach as developed in this study and implemented in the freely available MGREML tool will prove to be an important framework not only for the social sciences but also for genetic epidemiology.
Acknowledgments The authors are grateful for computational resources provided by the Dutch national e-infrastructure with support of the SURF cooperative (Project EINF-607). The Health and Retirement Study (HRS) is sponsored by the National Institute on Aging and is conducted by the University of Michigan. Disclaimer The views expressed are those of the authors and not necessarily those of the National Institute for Health Research.
[END]
---
[1] Url:
https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1010638
Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.
via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/