(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .



Leveraging epigenomes and three-dimensional genome organization for interpreting regulatory variation [1]

['Brittany Baur', 'Wisconsin Institute For Discovery', 'University Of Wisconsin-Madison', 'Madison', 'Wisconsin', 'United States Of America', 'Junha Shin', 'Jacob Schreiber', 'Paul G. Allen School Of Computer Science', 'Engineering']

Date: 2023-08

Understanding the impact of regulatory variants on complex phenotypes is a significant challenge because the genes and pathways that are targeted by such variants and the cell type context in which regulatory variants operate are typically unknown. Cell-type-specific long-range regulatory interactions that occur between a distal regulatory sequence and a gene offer a powerful framework for examining the impact of regulatory variants on complex phenotypes. However, high-resolution maps of such long-range interactions are available only for a handful of cell types. Furthermore, identifying specific gene subnetworks or pathways that are targeted by a set of variants is a significant challenge. We have developed L-HiC-Reg, a Random Forests regression method to predict high-resolution contact counts in new cell types, and a network-based framework to identify candidate cell-type-specific gene networks targeted by a set of variants from a genome-wide association study (GWAS). We applied our approach to predict interactions in 55 Roadmap Epigenomics Mapping Consortium cell types, which we used to interpret regulatory single nucleotide polymorphisms (SNPs) in the NHGRI-EBI GWAS catalogue. Using our approach, we performed an in-depth characterization of fifteen different phenotypes including schizophrenia, coronary artery disease (CAD) and Crohn’s disease. We found differentially wired subnetworks consisting of known as well as novel gene targets of regulatory SNPs. Taken together, our compendium of interactions and the associated network-based analysis pipeline leverages long-range regulatory interactions to examine the context-specific impact of regulatory variation in complex phenotypes.

Single nucleotide polymorphisms (SNPs) are single base pair changes in the human genome that can contribute to variations in complex phenotypes, including diseases. Many SNPs lie in non-coding parts of the genome, far away from genes, making it difficult to identify downstream molecular pathways that are affected by these variants. Long-range gene regulation wherein a regulatory sequence element affects the expression of a gene hundreds of kilobases away, has emerged as a possible mechanism by which non-coding SNPs impact the function of a gene. Hi-C is a high-throughput genomic assay that can be used to infer three-dimensional organization and long-range interactions. However, collection of Hi-C datasets across diverse cell types and contexts is a significant challenge. In our study, we used publicly available datasets to train a model to predict Hi-C measurements across multiple cell types. We use these predictions to infer long-range interactions between SNPs and genes and combine with network-based approaches to identify potential downstream pathways disrupted by these SNPs. Our collection of long-range interactions, affected pathways and analysis approach should be a useful resource to uncover biological pathways that are affected by non-coding genomic variation for different disease and normal conditions.

We applied our approach to fifteen different phenotypes from the NHGRI-EBI GWAS catalog that represent different autoimmune, cardiovascular, neurological and cancer phenotypes [ 1 ]. We identified gene subnetworks that are enriched with specific SNPs and exhibit network rewiring across the 55 cell types. These subnetworks connect known and novel genes associated with a phenotype of interest offering novel hypotheses and prioritize validation experiments needed to improve our understanding of the role of regulatory variation in specific phenotypes.

A. L-HiC-Reg is trained on 1 MB regions of a chromosome with one-dimensional regulatory genomic datasets and high resolution Hi-C data using a random forest regression algorithm. B. The trained models are then applied to measured and imputed datasets in the Roadmap epigenomics database to generate a compendium of predictions in 55 cell types. Generic 5kb bins are shown in gray, bins overlapping SNPs are in orange and TSS bins are in blue. C. SNPs and genes are connected in the SNP-gene network via long-range interactions with the SNPs for a given phenotype. Genes are scored based on the average significance of interactions with SNPs D. Physical molecular interactions from protein-protein interaction networks and transcription factor (TF)-gene interaction networks are used to perform graph diffusion on the scores from C. Multi-task graph clustering is used to identify gene subnetworks jointly across each cell type to identify pathways affected by the set of SNPs.

To address both these challenges – the lack of comprehensive high-resolution cell-type-specific long-range regulatory interactions and defining the target pathways of a set of regulatory variants– we developed a computational pipeline with two components: L-HiC-Reg (Local HiC-Reg), to predict long-range interactions, and graph-diffusion coupled with multi-task graph clustering, to identify the gene networks targeted by a set of variants in a cell type-specific manner ( Fig 1 ). L-HiC-Reg is a Random Forests regression method that adapts our previous tool, HiC-Reg [ 18 ], to leverage the local structure of genomic segments, improving generalizability across cell types and tissues. We used L-HiC-Reg models trained on high-resolution Hi-C data to generate a compendium of in silico Hi-C maps for 55 publicly available cell types and tissues using experimental and imputed data from the Roadmap epigenomics database [ 26 ]. We then leverage the compendium to link non-coding SNPs to target genes via long-range regulation. Compared to existing tools that predict enhancer-gene relationships such as GeneHancer [ 27 ] and JEME [ 14 ], our compendium is both cell-type-specific and is based on regression, rather than classification. The advantage of the regression approach is that it does not rely on specific peak calling tools to define an interaction, which may bias the results. Classification-based approaches would need to use an interaction calling tool [ 15 , 16 , 28 ] to generate training pairs, and these tools can vary in the number of interactions detected and the range of distance between pairs in detected interactions [ 29 ]. Generation of the negative set could also introduce additional biases because of the paired nature of Hi-C interactions and the lack of true negative interactions [ 26 , 27 ]. Furthermore, predicting counts directly allows the end user the flexibility to apply different statistical tools including peak calling and topologically associated domain (TAD) analysis. Our multi-task graph clustering approach identifies gene subnetworks for a particular phenotype of interest in a cell-type-specific manner. Previous approaches to link non-coding SNPs to target genes have focused solely on the direct target genes using predicted or measured chromatin loops [ 22 , 23 , 24 ]. By contrast, our approach integrates both distal and proximal interactions and identifies downstream pathways that may be disrupted by a set of SNPs. Our predicted long-range interactions are corroborated with experimentally derived interactions from complementary ChIA-PET and Capture-Hi-C experiments and link more sequence variants to genes compared to existing approaches.

A related challenge is identifying interactions that link regulatory variants to their target genes and the impact such interactions have on downstream pathways. Computational approaches to identify molecular networks that are impacted by sequence variants have largely focused on coding variants [ 22 ]. Recent computational approaches have generated cell-type-specific enhancer functional networks linking distal regions to disease [ 23 ] and enabled the identification of non-coding risk variants by leveraging disease-relevant gene regulatory networks [ 24 ]. While both approaches accurately identify disease-relevant genomic loci, neither identify the target genes or gene pathways. Another recent approach identifies cell-type-specific enhancers harboring GWAS variants, but requires measured Hi-C to link them to genes [ 25 ]. Furthermore, to our knowledge, no existing approach has examined the downstream pathways impacted by long-range regulatory interactions.

Genome-wide Chromosome Conformation Capture (3C) technologies, such as Hi-C [ 12 ], are experimental methodologies for measuring genome-wide maps of 3D proximity of genomic loci from which we can potentially determine which physical interactions have regulatory roles. However, detecting long-range interactions between regulatory elements and genes requires high-resolution Hi-C datasets (e.g., 5kb), which are limited to a handful of well-characterized model cell types due to sequencing costs and the large number of cells required for making reliable measurements at high resolution. Micro-C, which was recently developed for mammalian cell types, generates higher resolution datasets but requires more cells than Hi-C, roughly 5 million, and is only available for a small number of cell types [ 13 ]. Because of these technical challenges, a number of computational approaches have been developed to predict long-range interactions using classification [ 14 – 16 ] and regression approaches [ 17 , 18 ]. While many of these approaches have primarily used sequence [ 19 – 21 ], several others have additionally used one-dimensional epigenomic signals such as histone modifications, transcription factor (TF) binding and accessibility, which are widely available for many cell types. However, prediction of counts and interactions in new cell types remains difficult and requires a large number of datasets for training, which may not be available for most cell types and tissues.

Genome-wide association studies (GWAS) have identified a large number of variants associated with different phenotypes and diseases [ 1 ]. Approximately, 93% of all GWAS variants are regulatory variants, located in non-coding regions that can regulate gene expression, with nearly 20% located over 100kb away from any genic feature [ 2 , 3 ]. Understanding the mechanisms by which such variants contribute to phenotypic variation is a significant challenge because the target genes and pathways of non-coding variants as well as the specific cell types in which these variants operate are unknown. Recent studies have shown that regulatory sequences such as enhancers can harbor non-coding variants that impact gene expression [ 4 – 7 ] in a cell-type-specific manner [ 8 , 9 ]. Three-dimensional organization of the genome enables long-range regulatory interactions between distal enhancers and genes through chromosomal looping that brings the enhancer in spatial proximity to target genes. Consequently, long-range interactions are emerging as an important component for the interpretation of regulatory variation [ 8 – 10 ] and have been implicated in many diseases such as autoimmune diseases [ 10 ] and cancer [ 11 ]. To systematically examine the impact of regulatory variants, we need long-range regulatory interaction maps across multiple cell types as well as computational tools that can leverage these interactions to identify the gene networks that are targeted by sets of regulatory variants.

Results

L-HiC-Reg accurately predicts contact counts in a large number of cell types We previously developed HiC-Reg, an approach to computationally predict contact counts based on one-dimensional signals such as histone marks and transcription factor binding sites [18]. HiC-Reg performs well across chromosomes within the same cell type; however, predicting across cell types is still challenging and requires a large number of one-dimensional signals. We hypothesized that 3D genome conformation may be driven by different factors at different genomic loci. While a Random Forests prediction model should be able to capture multiple combinations of factors, a predictive model trained in a locus-specific manner can be more expressive and capture more nuanced dependencies than a global model for a whole chromosome. Therefore, we developed L-HiC-Reg, a “local” version of HiC-Reg that divides the chromosome into 1Mb segments and trains a separate model on each segment, as opposed to training on the whole chromosome (Fig 1A and Methods). Training on smaller segments also makes the prediction model much more scalable as it can be easily parallelized. L-HiC-Reg also uses a smaller number of discretized one-dimensional signals than originally used for HiC-Reg since these signals are widely available for many cell types enabling us to have a high coverage across diverse cell types. Briefly, to train L-HiC-Reg. we used features from seven experimentally measured and three computationally derived one-dimensional signals, together with genomic distance to represent a pair of regions. These datasets include the architectural protein CTCF, repressive marks (H3K27me3, H3K9me3), a mark associated with active gene bodies and transcriptional elongation (H3K36me3), active enhancer specific marks (H3K4me1, H3K27ac), a promoter mark (H3Kme3), cohesin component (RAD21), a general transcription factor (TBP) and DNase 1 (open chromatin). The histone marks were selected because they were the most frequently measured across the cell types in the Roadmap database, which we used to generate our compendium of predictions (S1 Fig). Datasets for CTCF, TBP and RAD21 were computationally derived from sequence and open chromatin (Methods) [30]. Similar to HiC-Reg, L-HiC-Reg predicts contact counts between 5kb pairs using the one-dimensional regulatory signals of histone marks, accessibility and architectural protein binding motifs. A pair of regions in L-HiC-Reg and HiC-Reg is represented by the features for both 5kb regions, as well as the average signals between the two regions (window signal) as initially proposed in TargetFinder [16]. The contact counts to be predicted were derived from high-resolution (5kb) Hi-C datasets from Rao et al (Methods) [31]. We applied trained L-HiC-Reg models on adjacent 1Mb regions using both original and imputed signals and predict in the same 1Mb region in a different cell type. We then concatenated the predictions for the entire chromosome. To test whether L-HiC-Reg recapitulates contact counts more accurately than HiC-Reg across cell types, we trained L-HiC-Reg and HiC-Reg models on the five 5kb-resolution Hi-C data sets from Rao et al each corresponding to a different cell line. We then generated all possible cross-cell type predictions for each chromosome resulting in 440 total predictions for each method (22 chromosomes x 20 cross-cell type combinations) [31]. We assessed the performance of L-HiC-Reg and HiC-Reg using distance-stratified Pearson’s correlation computed on the pairs from the test cell type, which we originally used for HiC-Reg [18]. We take the area under the distance-stratified correlation curve (AUC) as a measure of cross-cell type prediction accuracy. We found that in 270 of the 440 cross-cell type predictions, the AUC was higher for L-HiC-Reg than HiC-Reg (Kolmogorov-Smirnov (KS) p-value=0.0011). This suggests that in the majority of cross-cell type predictions, L-HiC-Reg recapitulated contact counts better than HiC-Reg (Fig 2A, top row). Specifically, L-HiC-Reg is significantly better than HiC-Reg with the test cell lines Gm12878 and Nhek (KS p-value < 0.05). Of these 270 instances, Hmec was most frequently used as the training cell line and Nhek as the test cell line (S2A and S2B Fig). Furthermore, chromosomes 2, 7, 16 and 15 benefitted most from L-HiC-Reg (S2C Fig). Full distance stratified correlation curves for cross cell line predictions from chromosome 10 further compares the performance of the two methods as a function of distance (S3 Fig). Additionally, we compared the performance of L-HiC-Reg to a baseline model in which the count from the training cell type is used as the predicted count for the test cell type (“Transfer count”). L-HiC-Reg performed better than transfer count for 275 of the 440 cross-cell type predictions (KS p-value=1.5e-04). L-HiC-Reg also had higher performance than “Transfer count” in more cell lines than HiC-Reg (262 of the 440) suggesting that L-HiC-Reg is predicting more cell-type specific contact counts (Fig 2A, bottom row). PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 2. Evaluation of L-HiC-Reg predictions. A. Performance of L-HiC-Reg for count prediction between pairs of regions is assessed with the Area Under the correlation Curve (AUC) against HiC-Reg (top) and against transfer count (bottom). Each panel is a different test cell type and the color indicates the training cell type. B. Performance of L-HiC-Reg based on identified TADs from L-HiC-Reg predictions versus HiC-Reg predictions based on Jaccard coefficient similarity of TADs from true and predicted counts. Jaccard coefficient was used to assess the overlap of TADs found on the true counts with TADs found on the predicted counts by L-HiC-Reg (y-axis) and HiC-Reg (x-axis). C. Heatmaps of exemplar regions on chromosome 17 comparing Huvec L-HiC-Reg predictions and HiC-Reg predictions (top) against measured Huvec Hi-C data (bottom). D. Assessing predictions from measured versus imputed marks. AUC for L-HiC-Reg predictions generated from experimental one-dimensional datasets (y-axis) and imputed one-dimensional datasets (x-axis) in the test cell type (top). Jaccard coefficients comparing TAD recovery from measured (y-axis) and imputed data (x-axis) (bottom). https://doi.org/10.1371/journal.pcbi.1011286.g002 We next used the predicted counts from HiC-Reg and L-HiC-Reg to define topologically associating domains (TADs) by applying the Directionality Index (DI) TAD finding method [32]. We compared these TADs to TADs identified by DI on the true counts based on a metric derived from the Jaccard coefficient (Methods). The Jaccard coefficient is a number between 0 and 1, with 0 representing no agreement and 1 representing perfect agreement. We computed the Jaccard coefficients for all cross-cell type predictions on all chromosomes for L-HiC-Reg and HiC-Reg (Fig 2B). We found that the Jaccard coefficients were higher in L-HiC-Reg than HiC-Reg in 302 of 440 cross-cell type predictions, was significantly better in two of the five test cell types (Gm12878, K562, KS-test p-value <0.05) and was comparable in the other cell types. This suggests that L-HiC-Reg recapitulates TADs better than HiC-Reg when predicting in a new cell type. Visual inspection of the predicted matrices from the two methods for a 1Mb region in the HUVEC cell type (Fig 2C) also demonstrated a clearer TAD structure when using L-HiC-Reg than HiC-Reg. Taken together these results show that L-HiC-Reg can accurately predict contact counts and high-level features of chromatin architecture in new cell types.

Prediction of Hi-C contact counts across multiple cell types with publicly available and imputed epigenomic datasets To generate a compendium of high-resolution cell-type-specific Hi-C maps, we applied L-HiC-Reg Gm12878 models on publicly available epigenomic datasets for different cell types. Here, we considered cell types and tissues in the Roadmap Epigenome database for which a number of one-dimensional signals were available. These include DNase I for accessibility and six histone marks that are commonly measured (H3K27me3, H3K9me3, H3K36me3, H3K27ac, H3K4me1, H3K4me3), although often not within the same cell type. In particular, only 22 of the total 183 cell types had all six epigenomic signals and DNase I (S1 Fig). As the epigenomic dataset availability varies across cell types, it is difficult to make predictions from a model trained on features generated from a certain set of epigenomic marks that are not available in the test cell type. To overcome this limitation, we used imputed datasets in the test cell type to generate predictions in cases where the measured feature dataset is missing. Specifically, to acquire these imputations, we trained a version of Avocado [33] to predict read counts at 5kb resolution, as opposed to the original signal p-values at 25bp resolution, to match the resolution of our L-HiC-Reg models. Our Avocado model was trained using 559 experiments spanning 104 cell types and 33 assays, and we subsequently used the model to impute experimental readouts for histone modifications in the 55 Roadmap cell types and tissues for which DNase I accessibility was available. We chose not to use imputed DNase I because we needed the signal for motif feature generation of CTCF, RAD21 and TBP and for generating TF-gene networks in our subsequent analysis and interpretation pipeline (Methods). To determine if we could accurately predict contact counts in a test cell type where all histone features are imputed (worst case scenario), we generated cross-cell type predictions for the 5 Rao et al. cell types using the imputed features in the test cell type as input to L-HiC-Reg models trained on the original measured values in the training cell type [31]. Based on the area under the distance-stratified correlation, the resulting performance was not significantly different from those generated from entirely experimental features (Fig 2D; KS test p-value=0.11). Additionally, we used the predictions based on imputed features to call TADs and found that the Jaccard coefficients from these TADs were also similar to the ones generated from experimental features (Fig 2D, KS test p-value=0.84). These results suggest that there is no significant deterioration in the quality of the predictions if imputed features are used. Finally, we applied our trained Gm12878 L-HiC-Reg models on measured and imputed chromatin mark signals, DNase I, and accessibility derived binding of CTCF, TBP, and RAD21 to all 55 cell types from the Roadmap project. We then performed a series of validation experiments for these predictions and leveraged the compendium to link SNPs to genes and downstream pathways as discussed in the following sections.

L-HiC-Reg long-range interactions are validated by complementary experimental sources and associated with highly expressed genes We defined long-range interactions as significantly interacting regions (loops) using a distance-stratified Binomial test [34] on the L-HiC-Reg predicted contact count matrices for each of the 55 cell types (Methods). Across 55 cell types, we identified between 681,889 and 1,481,911 significant (q-value <0.05) interactions with an average of 983,250 interactions. To evaluate these significant interactions, we compared them against interactions identified from complementary experimental assays: ChIA-PET and Capture-Hi-C (Methods). We obtained 10 published ChIA-PET datasets for different factors (RNA Pol II, CTCF and RAD21) and histone marks in multiple cell types [35,36]. We estimated fold enrichment of ChIA-PET interactions in the L-HiC-Reg-based interactions in each of the 55 cell types (Fig 3A). We did the same for two other computationally predicted resources, JEME [14] and GeneHancer [27]. GeneHancer is a network of enhancer-promoter interactions that scores potential enhancers and their interactions based on evidence from various publicly available datasets, such as expression Quantitative Trait Loci (eQTL) studies, functional annotation and chromatin accessibility. JEME uses a classification approach that incorporates information from multiple cell types to predict an interaction between a promoter and candidate enhancers. ChIA-PET interactions from all the 10 datasets were enriched (fold enrichment > 1) in the L-HiC-Reg, JEME and GeneHancer predictions (Fig 3A). High fold enrichment also correlates with significant P-values computed from a HyperGeometric test (S4 Fig). Additionally, L-HiC-Reg had a higher fold enrichment compared to JEME and GeneHancer in 7 out of 10 datasets (blue and red asterisk, Fig 3A). PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 3. Performance of long-range interaction prediction approaches against experimentally derived datasets. A. Fold enrichment of JEME (cyan), GeneHancer (red) and L-HiC-Reg (purple) against gold standard ChIA-PET datasets. Blue asterisks indicate where L-HiC-Reg performed better than GeneHancer. Red asterisks indicate where L-HiC-Reg performed better than JEME. Predictions for each cell type from L-HiC-Reg and JEME were compared against each ChIA-PET dataset shown as a separate panel. GeneHancer is not cell type specific and had single set of predictions that were compared against all ChIA-PET datasets. Enrichment was calculated separately for each chromosome. The box plot shows the distribution of the enrichment values for each chromosome for a pair of predicted and ChIA-PET interactions B. Fold enrichment against capture Hi-C data. The predictions from L-HiC-Reg and JEME were compared to interactions from the capture Hi-C dataset matched by cell type. Enrichment was calculated separately for each chromosome. Colors correspond to different methods as in A. C. Comparing the fold enrichment of pairs of computational approaches GeneHancer and L-HiC-Reg, JEME and L-HiC-Reg and JEME and GeneHancer in matched cell types. D. Expression of genes (RPKM) with interactions compared to genes with no interaction in GeneHancer (red), JEME (cyan) and L-HiC-Reg (purple). L-HiC-Reg and JEME were matched for the cell type. https://doi.org/10.1371/journal.pcbi.1011286.g003 We next compared L-HiC-Reg, JEME and GeneHancer predictions to several Capture Hi-C datasets from Jung et al., 2019, which profiled 27 cell types, 9 of which overlapped with the cell types we examined [37]. We compared directly with the matching cell type for all chromosomes. For GeneHancer, which does not predict cell type specific interactions, we considered the same set of interactions for each cell type. L-HiC-Reg interactions exhibit the highest fold enrichment for these datasets compared to JEME and GeneHancer, with the exception of h1 derived mesenchymal stem cells (h1_MSC) for which GeneHancer is better (Fig 3B). We also compared the predicted interactions from each of these three methods, L-HiC-Reg, Genehancer and JEME, to each other. All computational methods were mutually enriched among each other, with L-HiC-Reg exhibiting the highest enrichment in JEME interactions (Fig 3C). This is likely because L-HiC-Reg and JEME are both cell-line specific and random forest-based. The fact that there is a high enrichment among the different computational approaches is encouraging, especially since GeneHancer has a different underlying model. Chromosomal looping is often associated with increased gene expression by bringing enhancers bound by transcription factors in close proximity to the promoters of their target genes. It has previously been experimentally demonstrated that transgenes inserted at interactions demonstrate higher expression values over all genomic distances, indicating that the presence of looping tends to increase transcriptional activity [38,39]. To test this property in our predictions, we linked significant interactions to genes in one or both of the interacting regions. Across the four cell types that were common to JEME and L-HiC-Reg and had RNA-seq data available, L-HiC-Reg linked an average of 10,341 genes over all the cell types, and such genes had a significantly higher expression (t-test P-value < 0.05) than genes that are not associated with significant interactions (Figs 3D and S5). GeneHancer linked 14,467 genes and also had a large difference between the genes with interactions versus genes that did not have interactions (t-test P-value<0.01). JEME linked an average of 5,810 genes across the cell types and the difference in expression between genes with and without interactions was not significant, likely because of the small number of genes linked by JEME. Taken together these results show that our compendium of L-HiC-Reg-based cell type-specific interactions is well supported by existing complementary assays and are of as good or higher quality than other computational predictions. While both JEME and L-HiC-Reg are cell type-specific, L-HiC-Reg produces a greater number of long-range connections to genes.

Leveraging L-HiC-Reg long-range interactions to link regulatory SNPs to genes across diverse cell types We next used our compendium of significant L-HiC-Reg interactions to link non-coding SNPs to genes in a cell-type-specific manner across the 55 cell types. We downloaded the NHGRI-EBI GWAS catalog, containing curated GWAS SNPs for thousands of phenotypes [1]. For a given GWAS, we defined a non-coding SNP as one that was labeled as intergenic and identified all SNPs in Linkage Disequilibrium (LD) with it (R2 = 0.8), resulting in a total of 23,116 GWAS non-coding SNPs and 364,893 SNPs in LD (Methods). We mapped all non-coding SNPs and SNPs in LD to genes using L-HiC-Reg significant interactions across all 55 cell types (Methods). Of the total 23,116 GWAS SNPs, we were able to map 35.16% to genes across our 55 cell types with an average 7,450 genes and around 30k interactions across cell types (Fig 4A). Of the 364,893 LD SNPs, L-HiC-Reg mapped 27.9% to an average of 9,549 genes in any cell type with ~40k interactions. In comparison, both JEME and GeneHancer mapped fewer proportions of SNPs (24.47% with JEME and 28.4% using GeneHancer). Similarly, for the SNPs in LD, L-HiC-Reg mapped a higher proportion of SNPs to genes (27.19%, S6 Fig) compared to JEME (18.7%) or GeneHancer (21.17%). PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 4. A. Distribution plots for all GWAS non-coding SNPs and SNPs in LD that were linked to genes across the 55 cell types (left), genes linked to SNPs (middle) and SNP-Gene pairs (right). B. Number of genes associated with SNPs (row and then column normalized to length one) across phenotypes (rows) and cell types (columns). Each entry is normalized such the rows are of length 1. Rows and columns are reordered using non-negative matrix factorization. C. Precision (left) and recall (right) of L-HiC-Reg, JEME, GeneHancer and nearest neighbor (NN) for eQTL SNP-gene associations as a function of distance (x-axis). D. Contact count and one-dimensional signals centered around the vitiligo-associated SNP rs10876864. The GDF11 gene is highlighted on the Gene of interest track. The first row of the black-white heatmap below represents the presence(black)/absence(white) of the GDF11-rs10876864 interaction. The columns are cell types and tissues. The second row indicates which cell types are skin-related and the bottom row represents which cell types are immune-related. E. Contact count and one-dimensional signals for coronary artery disease-associated SNP rs599839. The PSMA5 gene is highlighted on the Gene of interest track. The first row of the black-white heatmap below indicates the presence or absence of the PSMA5- rs599839 loop, the second row indicates which are the skin cell types and the third row indicates the immune cell types. https://doi.org/10.1371/journal.pcbi.1011286.g004 As SNP-gene associations could indicate regulatory relationships between the SNP and the gene by impacting its expression, we leveraged expression QTL (eQTL) datasets from the GTEx database [40] to assess the SNP-gene associations from each of the methods (Methods). We considered 4149 GWAS and 73662 LD SNPs for 15 phenotypes in the GWAS catalog with the largest number of SNPs (Table 1). The number of genes linked by L-HiC-Reg varied between 100-2400 across the phenotypes and was relatively stable across cell types (Fig 4B). For all GWAS SNPs in the NHGRI catalog associated with a gene, we calculated the proportion of predicted interactions that are also supported by eQTL SNP-gene associations (precision) and the proportion of eQTL SNP-gene associations in each predicted set (recall). As a baseline, to assess the importance of leveraging long-range interactions to link SNPs to genes, we also found the closest gene (nearest neighbor, NN) based on genomic distance to each GWAS and LD SNP and compared these SNP-gene associations to eQTL associations. To account for the fact that eQTL SNP-gene pairs are more likely to be close together and many of the predictions are between close regions, we stratified these two quantities by distance (Fig 4C). SNP-associations from L-HiC-Reg and GeneHancer had a higher precision than JEME for all distance bins compared, and GeneHancer had a higher precision than L-HiC-Reg for more distance bins (35 vs 16). Based on recall, GeneHancer was generally better than L-HiC-Reg with the exception of associations at 400-600k. Both methods were better than JEME for recall. The higher performance of GeneHancer is not surprising as it uses GTEx data, along with many other resources, to assign a score to an interaction. GeneHancer used GTEx data from versions lower than 6p (<6p) to generate its results, while we used v8 for this analysis. We therefore tested GeneHancer on GTEx data with version greater than 6p and found that its precision and recall decreased (Fig 4C). L-HiC-Reg performed much better than GeneHancer on version greater than 6p (>6p) GTEx data, for precision but still slightly worse in terms of recall. JEME performs well in close-range interactions, but rapidly declines in precision and recall for longer-range interactions. Notably, all computational approaches perform better than the NN approach and, for distances greater than 200kb, very few NN genes are in eQTL (Fig 4C). Similar observations can be made for the set of LD SNPs (S7 Fig). Taken together, these results highlight the importance of leveraging long-range interactions to link SNPs to genes, particularly those predicted at >200kb distances. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Table 1. Statistics for 15 phenotypes analyzed using our framework. Shown are the total number of GWAS non-coding SNPs, total number of SNPs in LD with these SNPs, total number of SNPs (GWAS and LD) linked to genes, total number of genes linked to SNPs and total number of SNP-gene pairs. https://doi.org/10.1371/journal.pcbi.1011286.t001 Next, we manually verified several of the SNP-gene associations that L-HiC-Reg predicted using Capture Hi-C assays [41,42]. We obtained two large-scale Capture-Hi-C datasets that investigated GWAS SNP-gene interactions. One of these focused on blood cell traits and generated a Capture-Hi-C dataset in lymphoblastoid cells [41]. The second study used Capture-Hi-C in macrophages and studied immune and cardiovascular risk variants [42]. In addition to Capture-Hi-C datasets, these studies used a number of computational filters, such as motif disruption and chromatin state signal to identify SNP-gene associations. We filtered these SNP-gene associations based on whether interactions were distal which is defined as any SNP-containing region directly interacting with a probed promoter [41]. In our predictions, we found the SNP rs10876864 to loop to GDF11 which was also identified by Cavalli et al., based on Capture-Hi-C, motif disruption and chromatin signals (Fig 4D). This SNP is associated with vitiligo, a skin disease, and GDF11 is known to be involved in the regulation of skin biology. The black-white heatmap in Fig 4D shows which cell types the loop is present in (top row), which cell types are skin-related (middle row) and which cell types are immune (bottom). The L-HiC-Reg predicted count profile shows a significantly high value for this SNP and GDF11, compared to the background count distribution in one exemplar cell type (keratinocyte, gray shading, Fig 4D). Furthermore, inspection of the one-dimensional signals in the keratinocyte cell type showed a clear association of H3K4me1, an active enhancer mark, between the SNP and the gene. Similarly, we found an interaction between the SNP rs599839, associated with cardiovascular diseases, and the gene PSMA5, that was present among immune cell types (black-white heatmap, Fig 4E). Inspection of the predicted count profile and one-dimensional signals around this SNP in CD4 primary cells again highlighted the high interaction count for this interaction supported by one-dimensional signals such as DNase I and H3k4me3. Overall, these results offer experimental support for our predictions many of which are likely valid regulatory interactions based on the overlap with eQTL studies.

Identifying downstream pathways impacted by regulatory variants Our results so far showed that we can successfully link regulatory variants to genes across diverse cell types. However, phenotypic variation is typically driven by groups of genes that interact in a larger unknown pathway and therefore interpretation of regulatory variation requires us to identify the impact of sequence variants at the pathway level. Systematic identification of pathways affected by a set of SNPs in a specific cell type is challenging because not all pathways are known and the extent to which they are preserved across different cell and tissue types is unknown. To examine the impact of a set of variants at the pathway level, we developed a graph-theoretic framework based on graph diffusion and multi-task graph clustering to simultaneously define gene subnetworks representative of gene pathways across different cell types (Fig 5). We first created a collection of cell-type-specific molecular interaction networks that combined protein-protein interaction networks, promoter proximal and distal transcription factor (TF)-gene interactions across each of the 55 cell types (Methods). The protein-protein interactions were context-unspecific while the TF-gene interactions were cell-type-specific because they were based on DNase I accessible motif instances and our long-range interactions. For a given phenotype, we scored a gene predicted to interact with a SNP based on the average -log(Pvalue) of its L-HiC-Reg interactions with SNPs (Methods). Such genes were called “direct hits.” We used graph diffusion to define additional downstream genes with no direct SNP interactions but with the greatest diffusion signal (top 1%) and created cell-line specific weighted graphs where the edge weight corresponded to the strength of a diffusion signal from the direct hits. We then applied a multi-task graph clustering approach, MUSCARI, to the graphs to identify different gene networks based on their connectivity to the direct hit genes (Fig 5A) [43]. Our multi-task graph clustering approach takes as input a relationship tree between the cell types and leverages these relationships while defining gene subnetworks for each cell type. This approach was found to be more advantageous for inferring gene clusters compared to single task clustering, as well (Fig 5B). We used the similarity of genome-wide H3K4me3 promoter signal between pairs of cell types and performed hierarchical clustering to obtain this tree (S8A Fig). The H3K4me3 signal-based tree was similar to a tree based on shared long-range interactions between pairs of cell types (S8B Fig) with a significantly high Fowlkes-Mallows index used to measure similarity between two hierarchical clusterings (S8C and S8D Fig). We applied our pipeline to each of the 15 phenotypes in the GWAS catalog with the most non-coding SNPs (Fig 4 and Table 1). The phenotypes include a range of neuronal (autism spectrum disorder and schizophrenia), cardio-vascular (coronary artery disease), and auto-immune disorders (Crohn’s disease) as well as other diseases (breast cancer, Type 2 diabetes). Below we discuss two of these phenotypes, coronary artery disease and breast cancer. The results from our other phenotypes are available at https://regvar-networks.wid.wisc.edu/. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 5. A. Overview of our Multi-task Graph Clustering (MTGC) method. Genes are scored based on their significant interactions with SNP-containing regions and mapped to a physical molecular interaction network. Two stage graph diffusion is performed to obtain a fully connected adjacency matrix with edge weights corresponding to the effect of SNP from one gene to another. The input into the multi-task graph clustering approach is a matrix for each cell type, the number of clusters and relationship tree for the cell types. The outputs are the matched gene clusters (dashed purple and magenta groups) corresponding to gene networks for each cell type. B. Comparison of the quality of clusters with Davies-Bouldin index for single task spectral clustering and our multi-task clustering approach (lower values are better) in the breast cancer phenotype for all cell lines and tissues. https://doi.org/10.1371/journal.pcbi.1011286.g005

Multi-task graph clustering for identifying cell type-specific subnetwork targets of non-coding SNPs in CAD Coronary artery disease (CAD) is a type of cardio-vascular disease associated with thickening of arterial walls and is a major cause of disease in developed countries [44]. Recent GWAS of CAD have shown a significant portion of associated SNPs to lie in non-coding regions of the DNA [45,46]. Using L-HiC-Reg we mapped a total of 193 SNPs (141 GWAS and 52 in LD) of total 407 SNPs to 1036 genes across the 55 cell types with a total of 1737 interactions (Fig 6A and Table 1). Among the cell types that had the largest number of genes were fetal heart, skin and immune cell types like CD4 and CD34 (Fig 6A). After graph diffusion and selecting the top 1% of genes with the highest diffused score in each cell type, we had a total of 1866 genes across the cell types (1449-1821 genes in any cell type, S1 Data). We determined the optimal number of clusters (k) for MUSCARI based on modularity of clusters, each cluster corresponding to a gene subnetwork, defined across the 55 cell types and found k=7 to be best. We tested the clusters for enrichment of Gene Ontology (GO) terms in each of the clusters for the 55 cell types and analyzed the enrichment patterns with non-negative matrix factorization (NMF) to identify sets of GO terms associated with groups of cell types/MUSCARI clusters. NMF provided a faithful low-dimensional representation of the enrichment patterns with an explained variance of 0.97. Using NMF factor scores we selected the top GO terms and cell types/MUSCARI cluster for each NMF cluster (S9 Fig and Methods). The clusters exhibited enrichment in diverse processes including immune cell differentiation, heart development, and protein ubiquitination. (Figs 6B and S9). In particular, NMF identified a bi-cluster of GO terms relevant to CAD such as blood coagulation, wound healing, platelet activation and regulation of fluid levels that are associated with fetal heart and other muscle tissues in MUSCARI cluster 2 (S9 Fig). Dysfunctional coagulation has been shown to be associated with coronary artery disease [47]. Low mean platelet volumes are associated with worse outcomes for CAD patients [48]. Additionally, fetal heart is a top cell type for cluster 2 (S9 Fig). PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 6. Interpreting regulatory variants in Coronary Artery Disease (CAD). A. Distribution of the number of genes linked to SNPs, number of pairs and number of SNPs linked to genes across tissues (left). Number of genes linked to SNPs for each cell type (right). B. Example networks with nodes (genes) colored by their clustering assignment and the major GO enrichment process for each cluster. C. Cluster assignments across the genes (rows) and cell types (columns). Rows are ordered according to cluster assignment in small bowel mucosa (first column), which is indicated in large font on the row groups. Some genes do not maintain their cluster assignment and are referred to as transitioning or differential. Conserved gene sets are those that have the same cluster assignment across all cell types. D. An example transitioning gene set. Rows are the genes and columns are the cell types with the color indicating cluster assignment (top). E. Networks of the genes in the set in D for representative cell types with (CD4, CD56) or without the interactions (heart) around PSMA5 with edge weight represented by edge thickness (bottom). The size of the node represents the node score after graph diffusion. https://doi.org/10.1371/journal.pcbi.1011286.g006 We next examined the gene cluster assignment across the 55 cell types to identify genes that change their cluster assignment across cell types. Such genes are differentially connected across cell types and represent cell-line specific network components. We clustered genes into gene sets based on their gene assignment across cell types (Methods). Of the 1866 genes in the CAD phenotype, 968 genes were in a total of 44 transitioning gene sets and therefore differentially connected. Twelve of the 44 gene sets were associated with SNPs in at least one cell type indicating they were jointly targeted by these SNPs. We examined each of these gene sets for the correspondence between presence-absence patterns SNP-gene interactions across each cell type and the change in cluster assignment. Several of the 44 transitioning gene sets showed concordant changes in cluster assignment and SNP-gene interaction, including #196, #200 and #243 (S10 Fig). In particular, genes in set #243 were in cluster 6 for most of the cell types but transitioned to cluster 3 in the immune cell types (Fig 6D). This gene set included PSMA5 and PSB4 that had a number of CAD SNPs associated with them, specifically in the immune cell lines that transitioned (S10A Fig). This gene set also exhibits differences in the diffused edge weights between immune cell types and non-immune cell types, particularly around the PSMA5 gene (Fig 6D, bottom). The association of PSMA5 with one of the SNPs was also identified by a Capture-HiC experiment [42] (Fig 4E). Most of the genes in the transitioning gene set are in the ubiquitin-proteasome family of genes, which plays an important role in cardiovascular disease as elevated levels of ubiquitin are found in advanced coronary artery plaques [49]. While PSMA5 has been shown before to associate with CAD GWAS SNPs, PSMB4 is a novel prediction from our study. PSMB4 has been shown to inhibit cardiomyocyte apoptosis and is a potential therapeutic target [50]. Genes in transitioning gene set #196 were in cluster 2 for most cell types but transitioned to cluster 1 in fetal brain, fetal heart, fetal skin and mesendoderm cells (S10B Fig). This gene set contains PIM1, which is specifically targeted by SNPs in the transitioning cell types. PIM1 has been implicated in vascular smooth muscle cell proliferation in atherosclerotic plaques and is strongly upregulated in coronary artery disease [51]. Gene set #200 transitions to cluster 3 in a range of skin-related cell types such as fibroblasts, keratinocytes and melanocytes, as well as heart (S10C Fig). CAD-associated SNP rs6006426 specifically targets LIF in many of these cell types. LIF has been shown to be protective to the myocardium under various stressors such as ischemia, indicating it could be a therapeutic target in various heart diseases [52]. Taken together, our analysis of CAD GWAS SNPs demonstrates the utility of our network-based interpretation framework to pinpoint specific gene subnetworks that could be targets of different regulatory SNPs. These subnetworks included known and novel genes for CAD and represent plausible candidates for future validation studies.

[END]
---
[1] Url: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011286

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/