(C) PLOS One

(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .

scapGNN: A graph neural network–based framework for active pathway and gene module inference from single-cell multi-omics data [1]

['Xudong Han', 'State Key Laboratory Of Reproductive Medicine', 'Offspring Health', 'School Of Medicine', 'Southeast University', 'Nanjing', 'Department Of Histology', 'Embryology', 'Nanjing Medical University', 'Bing Wang']

Date: 2023-11

Although advances in single-cell technologies have enabled the characterization of multiple omics profiles in individual cells, extracting functional and mechanistic insights from such information remains a major challenge. Here, we present scapGNN, a graph neural network (GNN)-based framework that creatively transforms sparse single-cell profile data into the stable gene–cell association network for inferring single-cell pathway activity scores and identifying cell phenotype–associated gene modules from single-cell multi-omics data. Systematic benchmarking demonstrated that scapGNN was more accurate, robust, and scalable than state-of-the-art methods in various downstream single-cell analyses such as cell denoising, batch effect removal, cell clustering, cell trajectory inference, and pathway or gene module identification. scapGNN was developed as a systematic R package that can be flexibly extended and enhanced for existing analysis processes. It provides a new analytical platform for studying single cells at the pathway and network levels.

Funding: This work was supported by the National Key R&D Program of China (2021YFC2700200 to XG), the Chinese National Natural Science Foundation (Grants No. 82221005 to XG, 81971439 to XG, 82001611 to YL, 31871164 to HZ, 82071702 to HZ) and the fund from Health Commission of Jiangsu Province (M2020071 to YL). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Data Availability: All relevant data are within the paper and its Supporting information files. The scapGNN has been implemented as an R package is freely available from CRAN ( https://github.com/XuejiangGuo/scapGNN ), GitHub ( https://cran.r-project.org/web/packages/scapGNN/index.html ), and FigShare ( https://figshare.com/articles/software/scapGNN/23734017 ). The R packages for scapGNN and related scripts are also available from Zenodo ( https://doi.org/10.5281/zenodo.8322402 ).

Hence, we proposed a uniform framework called scapGNN, which was a graph neural network (GNN)-based framework that inferred and reconstructed gene–cell, gene–gene, and cell–cell association relationships for transforming sparse single-cell profile data into the stable gene–cell association network. Furthermore, the scapGNN integrated single-cell multi-omics data, calculated single-cell pathway activity scores, and identified cell phenotype–associated gene modules by quantifying network information. The real and simulated single-cell datasets were used to benchmark the performance of scapGNN, demonstrating that it outperformed state-of-the-art methods in multiple single-cell data analysis tasks.

With the development of deep learning techniques, many methods have been developed to extract low-dimensional features from high-dimensional single-cell data and integrated single-cell multi-omics data in a low-dimensional space. Cobolt constructed a multimodal variational autoencoder based on a hierarchical Bayesian generative model that projected the single-cell multi-omics data into shared latent space to perform visualization and clustering [ 20 ]. Single-cell Deep learning model for ATAC-Seq and RNA-seq Trajectory integration (scDART) is a deep learning framework that compresses scRNA-seq and scATAC-seq data into a shared space and aligns cells according to trajectories [ 21 ]. Graph-linked unified embedding (GLUE) also enables single-cell multi-omics data integration by encoding cells into the latent space [ 22 ]. Compared with the previous 2 methods, GLUE introduces the knowledge-based guidance graph via a graph autoencoder and extracts gene features to correct the alignment of cells in latent space. However, these methods share a common limitation: They all align cells from different omics data within a latent space. While this facilitates cell clustering and annotation, its biological interpretations make extracting deep mechanisms from the data difficult. Also, Cobolt and scDART rely on shared information, leading to data loss. They only extract low-dimensional features of cell and do not process genes, ignoring potential relationships between genes. scDART aligns cells in low-dimensional space according to cell trajectories, which may not be applicable to single-cell data without differentiation trajectories. GLUE introduces predefined knowledge-based guidance graphs, such as protein interaction networks, introducing noise beyond single-cell data. Some statistical frameworks, such as multi-omics factor analysis (MOFA2) and a nonnegative matrix factorization algorithm (UINMF), are also designed to integrate single-cell multi-omics data [ 23 , 24 ]. MOFA2 builds on the Bayesian group factor analysis framework to infer a low-dimensional representation of the data in terms of a small number of (latent) factors that capture the global sources of variability. UINMF derives a nonnegative matrix factorization algorithm for integrating single-cell datasets containing both shared and unshared features. These methods still compress data into low-dimensional features for data integration, which still fails to explain the biological mechanisms in single-cell multi-omics data. Integrating multi-omics data at the pathway and gene module levels enables a comprehensive study of complex biological processes, highlights the interrelationship of relevant biomolecules and their functions, and can mine potential biological mechanisms that cannot be discovered by single-omics data [ 25 , 26 ]. Nevertheless, a gap remains in inferring active pathways and cell phenotype–associated gene modules supported by single-cell multi-omics data.

Moreover, these methods can only be based on predefined pathways or gene sets. Gene modules, serving as building blocks of complex biological networks, are structural subnetworks that exhibit the same organizational patterns or functions [ 15 , 16 ]. Module-based analyses can achieve a higher-level understanding of the design and organization of biological systems. Genomap is an entropy-based cartography method to contrive the high-dimensional single-cell gene expression data into a configured image format and discover cell-specific gene sets [ 17 ]. It can compute cell type–specific gene importance scores by constructing the class activation map. However, it does not evaluate the significance level of the cell type–specific gene importance. Identifying gene modules autonomously and efficiently based on cell phenotype information is conducive to understanding the mechanism of cell state transitions and the regulation of different cell phenotypes [ 18 , 19 ].

These methods still have limitations that make it difficult to mine information from single-cell data. AUCell depends on the ranked list of genes, which allows it to identify only a few pathways associated with top genes at a time. Pagoda2 only focuses on the first principal component, leading to data loss. UniPath needs to construct the null background model for different species, affecting the scalability of the method. Meanwhile, the completeness of the null background model directly affects its performance. Furthermore, these methods do not make inferences about genes with dropout events for the scRNA-seq data with many zero values [ 9 ]. Besides, AUCell and Pagoda2 are designed to perform pathway analysis only for single-cell transcriptome data. UniPath proposes a corresponding pathway enrichment method that uses the hypergeometric or binomial test for single-cell ATAC sequencing (scATAC-seq) data, although it still relies on the background distribution.

Recently, some pathway enrichment methods using single-cell RNA sequencing (scRNA-seq) data, such as AUCell [ 12 ], Pagoda2 [ 13 ], and UniPath [ 14 ], have been proposed to study cellular heterogeneity. For example, AUCell calculates the area under the recovery curve (AUC) score for the pathway in the ranked list of genes for each cell as the pathway activity score. Pagoda2 fits a model to renormalize gene expression profiles and uses the first weighted principal component to quantify pathway activity scores. UniPath models the distribution of gene expression as bimodal and converts nonzero expressions into p-values. It combines the p-values of genes in the pathway and adjusts them as pathway enrichment scores using a common null background model.

A biological pathway is a collection of relationships between genes that lead to a certain product or a change in the biological process in the cell [ 1 ]. Some databases—such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) and Gene Ontology (GO) databases—have manually grouped interacting or similarly characterized molecules into pathways or gene sets by evidence-supported annotations [ 2 , 3 ]. Biological pathways in distinct cell types have different activation patterns, which facilitates the understanding of cell functions. In single-cell studies, pathway activation analysis has become a powerful approach for the extraction of biologically relevant signatures to uncover the potential mechanisms of cell heterogeneity and dysfunction in human diseases [ 4 , 5 ]. However, the current pathway enrichment analysis methods (e.g., gene set enrichment analysis (GSEA) [ 6 ], single-sample gene set enrichment analysis (ssGSEA) [ 7 ], and gene set variation analysis (GSVA) [ 8 ]) developed for bulk RNA-seq data have been reported to be inappropriate for single-cell sequencing data [ 9 , 10 ]. Compared with RNA-seq data obtained from bulk cell populations, single-cell sequencing data are much sparser, noisier, and lower in library size due to the particular sequencing techniques and experiment protocols [ 11 ]. These seriously compromise the accuracy and integrity of gene-level analyses in single-cell data [ 1 ]. Hence, an efficient method is urgently needed to parcel out the pathway activity of individual cells.

Results

Application of scapGNN to scATAC-seq data Besides scRNA-seq, scapGNN could also process single-cell epigenome data. We used the mouse cortical brain dataset and peripheral blood mononuclear cell (PBMC) dataset of 2 different species (mouse and human) to evaluate the performance of scapGNN in scATAC-seq data (S5 Table). scapGNN maintained high-accuracy cell clustering and pathway identification performance, which was robust to different strengths of dropout noise (Fig 5A–5C and S18A Fig). For the scATAC-seq data, scapGNN also could stably identify cell-intrinsic active pathways in combination with different cell types (S18B Fig). We next used scapGNN to identify active pathways in the scATAC-seq data of the PBMC dataset. The results of cell type–specific marker gene sets identification showed that UniPath using binomial and hypergeometric tests for pathway enrichment performed well, similar to scapGNN only on monocyte cells but failed on natural killer cells and native CD8+ T cells (Fig 5D). For the known active T-cell receptor signaling pathway in T cells, we found that scapGNN could more accurately identify the active pathways of T cells (Fig 5E). The T-cell receptor signaling pathway had a higher pathway activity score and could be successfully identified using Seurat as a marker pathway for T cells (S19A and S19B Fig). We also evaluated the stability of scapGNN pathway scoring in scATAC-seq data. scapGNN could still consistently identify the T-cell receptor signaling pathway in the top 5 pathways of T cells (S20 Fig). We tested the robustness of scapGNN in the scATAC-seq data by adding different strengths of dropout noise to the PBMC dataset. The results showed that scapGNN and the hypergeometric test method of UniPath were highly robust to the dropout noise of the scATAC-seq data (Fig 5F). Thus, scapGNN performed well in the analysis of pathway activities for scATAC-seq datasets. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 5. Performance of scapGNN on scATAC-seq data. (A) UMAP visualization of mouse cortical brain dataset using pathway activity score matrix of scapGNN. (B) Bar graph of the 4 cell clustering accuracy indicators for pathway activity score matrix of scapGNN on mouse cortical brain dataset. (C) AUC of 4 cell clustering accuracy indicators for pathway activity score matrix of scapGNN on mouse cortical brain dataset with dropout noise of different strengths. The proportion of cells that detected the corresponding correct cell type marker gene sets (D) and the proportion of T cells that detected the T-cell receptor signaling pathway (E) in the top 1 to 5 of the pathway scores on the PBMC dataset. (F) Robustness evaluation of the scapGNN in correctly detecting the marker gene set of monocytes with different dropout rates on the PBMC dataset. The data underlying this figure can be found in S4 Data. ARI, adjusted rand index; AUC, area under the recovery curve; NMI, normalized mutual information; PBMC, peripheral blood mononuclear cell; scATAC-seq, single-cell ATAC sequencing; SW, silhouette width; UMAP, Uniform Manifold Approximation and Projection. https://doi.org/10.1371/journal.pbio.3002369.g005

[END]
---
[1] Url: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3002369

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/