(C) PLOS One

(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .

PathIntegrate: Multivariate modelling approaches for pathway-based multi-omics data integration [1]

['Cecilia Wieder', 'Section Of Bioinformatics', 'Division Of Systems Medicine', 'Department Of Metabolism', 'Digestion', 'Reproduction', 'Faculty Of Medicine', 'Imperial College London', 'London', 'United Kingdom']

Date: 2024-04

As terabytes of multi-omics data are being generated, there is an ever-increasing need for methods facilitating the integration and interpretation of such data. Current multi-omics integration methods typically output lists, clusters, or subnetworks of molecules related to an outcome. Even with expert domain knowledge, discerning the biological processes involved is a time-consuming activity. Here we propose PathIntegrate, a method for integrating multi-omics datasets based on pathways, designed to exploit knowledge of biological systems and thus provide interpretable models for such studies. PathIntegrate employs single-sample pathway analysis to transform multi-omics datasets from the molecular to the pathway-level, and applies a predictive single-view or multi-view model to integrate the data. Model outputs include multi-omics pathways ranked by their contribution to the outcome prediction, the contribution of each omics layer, and the importance of each molecule in a pathway. Using semi-synthetic data we demonstrate the benefit of grouping molecules into pathways to detect signals in low signal-to-noise scenarios, as well as the ability of PathIntegrate to precisely identify important pathways at low effect sizes. Finally, using COPD and COVID-19 data we showcase how PathIntegrate enables convenient integration and interpretation of complex high-dimensional multi-omics datasets. PathIntegrate is available as an open-source Python package.

Omics data, which provides a readout of the levels of molecules such as genes, proteins, and metabolites in a sample, is frequently generated to study biological processes and perturbations within an organism. Combining multiple omics data types can provide a more comprehensive understanding of the underlying biology, making it possible to piece together how different molecules interact. There exist many software packages designed to integrate multi-omics data, but interpreting the resulting outputs remains a challenge. Placing molecules into the context of biological pathways enables us to better understand their collective functions and understand how they may contribute to the condition under study. We have developed PathIntegrate, a pathway-based multi-omics integration tool which helps integrate and interpret multi-omics data in a single step using machine learning. By integrating data at the pathway rather than the molecular level, the relationships between molecules in pathways can be strengthened and more readily identified. PathIntegrate is demonstrated on Chronic Obstructive Pulmonary Disease and COVID-19 metabolomics, proteomics, and transcriptomics datasets, showcasing its ability to efficiently extract perturbed multi-omics pathways from large-scale datasets.

Funding: CW, TE - This research was funded in whole, or in part, by the Wellcome Trust [222837/Z/21/Z]. For the purpose of open access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission. TE acknowledges partial support from BBSRC grants BB/T007974/1 and BB/W002345/1. RPJL was supported by a UK MRC fellowship (MR/R008922/1) which is part of the EDCTP2 programme supported by the European Union and a NIH-NIAID grant (R01 AI145436). JC is supported by a state-funded PhD contract (MESRI (Minister of Higher Education, Research and Innovation)). FJ - This research was funded by the Agence Nationale de la Recherche (ANR, French National Research Agency)—MetaboHUB, the national metabolomics and fluxomics infrastructure (Grant ANR-INBS-0010). KK, RB - Research reported in this publication was supported by the National Heart Lung Blood Institute of the National Institutes of Health to KK and RB under award number R01HL152735. This work was supported by NHLBI grants U01 HL089897 and U01 HL089856 and by NIH contract 75N92023D00011. The COPDGene study (NCT00608764) is also supported by the COPD Foundation through contributions made to an Industry Advisory Committee that has included AstraZeneca, Bayer Pharmaceuticals, Boehringer-Ingelheim, Genentech, GlaxoSmithKline, Novartis, Pfizer, and Sunovion. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Heart, Lung, and Blood Institute or the National Institutes of Health. The funders did not play any role in the study design, data collection, analysis, or publication of this work.

Data Availability: The COVID dataset is publicly available from Mendeley data ( https://data.mendeley.com/datasets/tzydswhhb5/5 ). The COPDgene multi-omics data can be found at the following sources: Clinical Data and SOMAScan data are available through COPDGene ( https://www.ncbi.nlm.nih.gov/gap/ , ID: phs000179.v6.p2). RNA-Seq data is available through dbGaP ( https://www.ncbi.nlm.nih.gov/gap/ , ID: phs000765.v3.p2). Metabolon data is available at Metabolomics Workbench ( https://www.metabolomicsworkbench.org/ ID: PR000907). PathIntegrate is available via the open-source PathIntegrate Python package ( www.github.com/cwieder/PathIntegrate ). Tutorials and documentation for PathIntegrate can be found at https://cwieder.github.io/PathIntegrate . Source code for benchmarking and applications can be found at https://github.com/cwieder/PathIntegrate_scripts .

Copyright: © 2024 Wieder et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

PathIntegrate consists of two supervised learning frameworks for pathway-based multi-omics integration: PathIntegrate Single-View, which produces a multi-omics pathway-transformed dataset and applies a classification or regression model to the data, and PathIntegrate Multi-View, which uses a multi-block partial least regression (MB-PLS) model to model interactions between pathway-transformed omics datasets. Note that both PathIntegrate Multi-View and Single-View are multi-omics integration methods, and here we use the terms ‘Multi’ and ‘Single’ to refer to the type of predictive model applied (multi-view or single-view [ 38 ]). As both these frameworks rely on pathway transformation (ssPA) of the input omics data, we first demonstrate the ability of univariate methods to detect pathway signals at higher power than molecular-level signals in low signal-to-noise scenarios. We then show that PathIntegrate models can precisely detect enriched pathways even at low effect sizes, as well as use this information to accurately classify samples. PathIntegrate was benchmarked against DIABLO [ 11 ], a popular multi-omics integration tool with a similar predictive framework, but which does not use pathway transformation. Finally, we showcase the benefits of using PathIntegrate to interpret complex data using case studies on Chronic Obstructive Pulmonary Disease (COPD) and COVID-19 multi-omics datasets, illustrating the ability of the method to identify important and relevant pathway signatures. The PathIntegrate Python package is freely available at https://github.com/cwieder/PathIntegrate , and is designed to be compatible with many SciKitLearn [ 39 ] functions, enabling fast and efficient model optimisation and evaluation. PathIntegrate models are fitted in minutes and can run on a laptop with standard hardware (e.g. 8GB RAM, 1.4 GHz processor).

A. Pathways are represented as sets of molecules, e.g. genes, proteins, and metabolites. B) Pathway transformation by ssPA facilitates a change of dimension of an omics dataset from a molecular space to a pathway space. C) This transforms a sample-by-molecule expression or abundance matrix to a sample-by-pathway matrix, where values represent the `activity’ of each pathway for each individual sample.

In this work we introduce PathIntegrate, a modelling framework and corresponding Python toolkit to facilitate pathway-based multi-omics integration. PathIntegrate employs single-sample pathway analysis approaches (ssPA) ( Fig 1 ), which transform molecular-level abundance data matrices into pathway-level matrices, by using summarisation approaches (e.g. principal component analysis (PCA)) to condense molecular-level measurements into pathway scores for each individual sample in a dataset [ 33 – 37 ]. By using pathway-transformed multi-omics datasets as input to multivariate supervised models, multi-omics data can be integrated at the pathway-level, providing the user with a range of outputs including i) interpretation of multi-omics pathways associated with the outcome, ii) prediction of outcomes, iii) contribution of each omics view to the model and prediction (in the case of multi-view models), iv) projection of the multi-omics data to a lower dimensional space (in the case of latent variable models). An inherent challenge in multi-omics integration is the heterogeneity between omics datatypes, both in terms of the number of features profiled and the range of numerical values. PathIntegrate substantially contributes to addressing these with the pathway-transformation step, where disparate omics datasets are brought to a common scale, i.e. in terms of pathway ‘activity’. Compared to their molecular-level counterparts, pathway-based multi-omics integration models can provide a more parsimonious model when there are fewer input pathways than molecules, while also enabling the detection of multiple small, correlated signals that may not be detected in the molecular-level data. Moreover, pathway-based modelling could increase robustness to data noise by maximising biological variation and simultaneously reducing technical variation [ 29 ].

Pathway analysis (PA) refers to computational methods that have been specifically developed to alleviate the task of analysing long lists of molecules by placing them into a functional context based on curated pathway collections [ 24 ]. Generally, conventional PA methods such as over-representation analysis or gene set enrichment analysis use statistical tests to determine which pathways are associated with a phenotype of interest [ 25 , 26 ]. The output is typically a list of significantly enriched pathways and their associated test statistics and p-values. PA methods are frequently used due to their convenient representation of omics data in the form of pathway descriptors, providing a straightforward interpretation of the biological processes that may contribute to disease phenotypes. Multi-omics pathway analysis is a relatively new but promising area of research [ 27 ]. Tools such as MultiGSEA [ 19 ], ActivePathways [ 17 ], PaintOmics [ 28 ], and IMPaLA [ 16 ] all leverage multiple layers of biological information to compute enrichment of multi-omics pathways, associated statistical significance levels, and visualisations as an end-result. While highly useful, these methods lack certain desirable features, including the ability to predict outcomes, enabling model performance evaluation, or obtaining a representation of the data in a lower-dimensional space. These goals can be achieved by using pathway-based predictive models, which use pathway rather than molecular-level features to model and predict new data, and infer pathway enrichment through feature importance [ 29 – 32 ]. We provide a detailed overview of related methods in supplementary information (Related work in S1 Supporting Information ), but to the best of our knowledge, we are unaware of any one method which provides predictive, integrative modelling of multi-omics data at the pathway-level.

Multi-omics data integration is rapidly becoming a mainstream strategy used to elucidate complex molecular mechanisms in biological systems. Data profiled using diverse modalities, including genomics, epigenomics, transcriptomics, proteomics, and metabolomics provides complementary insights into the regulation of diverse biomolecules and their cellular functions [ 1 ]. Multi-omics data integration can delineate the transition from genotype to phenotype, while providing a more holistic view of a biological system. Despite the promise that multi-omics integration holds, the field itself is relatively young and faces numerous challenges [ 1 – 6 ]. Among these is the question of which method to use, and how to interpret the results. Several review papers categorise multi-omics integration methods according to underlying concepts, models, or intended purposes [ 7 ]. The choice of method used will depend highly on the desired outcome, which can be broadly split into outcome prediction (e.g. sample stratification) or elucidating molecular mechanisms (but often a combination of these). Studies focused on outcome prediction may leverage integration methods based on kernels or deep learning to optimise predictive performance [ 8 – 10 ], whereas those where the goal is hypothesis generation may opt for more explainable models using classical supervised [ 11 , 12 ] or unsupervised learning approaches [ 12 – 15 ], joint pathway analysis [ 16 – 19 ], network models [ 12 , 20 ], or Bayesian statistics [ 7 ]. The latter ‘hypothesis generation’-based analysis, regardless of the method used, will often output results in the form of lists of molecules (i.e. genes, proteins, metabolites), typically ranked by their contribution to the model. Depending on the parameters and outputs of the model, the end-user may have multiple latent variables [ 13 ], clusters [ 21 , 22 ], or networks [ 23 ] composed of many molecules (genes, proteins, and metabolites) to analyse. Doing so is not only time consuming but requires expert domain knowledge to place biomolecules into a functional context.

Results

Pathway transformation increases sensitivity to coordinated, low signal-to-noise biological signals Aside from improvements in interpretability, we hypothesized that pathway-based modelling or transformation of data can also provide increased sensitivity in detection of pathway signals in the data, particularly in low signal-to-noise scenarios. By combining abundance levels of correlated individual molecules within a pathway, we anticipate that statistical methods will be able to detect the pathway signal with higher power than individual molecular signals alone. Throughout this work, we refer to ‘molecular-level’ models as those with individual molecular entities (such as genes, proteins, and metabolites) as input features, as opposed to ‘pathway-level’ models, which take ssPA pathway-transformed data as input and hence features represent a combination of molecules in each pathway. Briefly, ssPA methods require an X N×M matrix of molecules as input and combine the abundance values of molecules in a set of predefined pathways to provide an A N×P pathway-level matrix, where features represent pathways and each sample has an ‘activity score’ for each pathway (see Methods). The use of ‘semi-synthetic’ data, in which artificial biological signals are inserted into experimental multi-omics data, provides us with a ground truth we can use to benchmark methods throughout this work [33]. We used semi-synthetic multi-omics (metabolomics and proteomics) data derived from COPD and COVID-19 studies (see Methods) to examine whether pathway transformation of multi-omics data allowed pathway signals to be detected by univariate analysis (Mann Whitney-U tests (MWU)) at higher power than individual molecular signals (Fig 2 and Fig C in S1 Supporting Information). Each omics dataset was transformed to the pathway level using ssPA, using the kPCA ssPA method [33] (see Methods). At each realisation of the simulation, repeated for each Reactome pathway accessible in the datasets, we enriched all the molecules in the pathway (metabolites and/or proteins) in the simulated disease group for a range of effect sizes, corresponding to the range of log 2 fold changes observed in the original datasets (Fig A and Fig B in S1 Supporting Information). PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 2. Pathway transformation enhances sensitivity to low signal-to-noise signals. y axis shows proportion of MWU tests significant at Bonferroni p ≤ 0.05, performed either on the pathway-level data or the molecular level data, at varying effect sizes shown on x-axis. Semi-synthetic data based on COVID-19 dataset. https://doi.org/10.1371/journal.pcbi.1011814.g002 We applied MWU tests to detect differences between the simulated phenotype groups based on the enrichment of each of the individual molecules in the molecular level data or ssPA scores of the target pathway itself. For the molecular level simulation, we applied Fisher’s method to combine p-values in the target pathway if at least 50% constituent molecules were significant (p ≤ 0.05), otherwise the combined p-value was set to 1. Encouragingly, at lower effect sizes (i.e. 0.25–0.55), we observed a higher proportion of significant p-values in the pathway-transformed data than in the molecular level data. The same trends were observed irrespective of the dataset used to create the simulation (Fig 2 and Fig C in S1 Supporting Information). This suggests that pathway-transformation approaches could improve the detection of low signal-to-noise, correlated signals in multi-omics datasets, and motivates the use of PathIntegrate models in the remainder of this work, which use ssPA pathway transformation to enable pathway-based multi-omics integration.

PathIntegrate: Supervised pathway-based multi-omics integration frameworks In this study we present and investigate the use of the PathIntegrate modelling frameworks for multi-omics pathway-based integration (Fig 3). PathIntegrate provides two supervised models: Multi-View and Single-View. They are both designed to take two or more (k) X N×M sample-by-molecule omics abundance matrices as well as a labelled outcome vector y as input and apply a single-sample pathway analysis transformation (facilitated by our recently published ssPA Python package [33]) before a predictive model is applied to the data. PathIntegrate can model both continuous and binary outcomes using classification and regression models, but for simplicity we have demonstrated it using binary (e.g. case-control) outcomes throughout this work. Both frameworks achieve the same key outcomes: i) using pathway scores to predict an outcome, and ii) ranking multi-omics pathways by importance in the prediction. PathIntegrate Multi-View uses a multi-table integration model and can therefore provide interpretable insights both within and between omics views, whereas PathIntegrate Single-View provides more flexibility on the high-level predictive model applied and can be better tuned towards prediction. Both models use a single set of multi-omics pathways P, where each pathway has a unique identifier and description, and contains a set of molecular identifiers which can either belong to different omics (i.e. metabolites, proteins, and genes) or in some cases only one omics (i.e. only proteins). Using these pathways, PathIntegrate Multi-View computes pathway scores on each omics view separately, whereas Single-View computes them from multi-omics data. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 3. PathIntegrate Multi-View (left) and Single-View (right) modelling frameworks for multi-omics pathway-based integration. Frameworks are outlined in terms of their input data, pathway-transformation stage, statistical model, and outputs. Blue data blocks represent omics data which has been transformed from the molecular (X N×M ) space to the pathway (A N×P ) space using ssPA. Both Single-View and Multi-View make use of the same multi-omics pathway set. https://doi.org/10.1371/journal.pcbi.1011814.g003 PathIntegrate Multi-View uses a multi-block partial least squares (MB-PLS) latent variable model to integrate ssPA-transformed multi-omics data. Each omics block is transformed to the pathway level individually and the resulting k blocks are used as input to the MB-PLS model. This preserves the block structure of each omics view and importantly allows users to compute how much each view contributes to the prediction of the outcome variable y, as well as extract within- and between-omics level results such as pathway importances and latent variable representations (scores and superscores [40–42]). Importantly, the latent variable model used by Multi-View enables extraction of orthogonal biological effects, similar to PCA, possibly capturing contrasting processes. Furthermore, such models are ideal for pathway-level data, where there is expected to be a high degree of overlap and co-linearity which is accounted for by the PLS framework. PathIntegrate Single-View begins by computing multi-omics pathway scores by performing ssPA transformation on molecular abundance or expression profiles obtained across multiple omics data blocks (e.g. genes, proteins, and metabolites). A single A N×P pathway-level matrix is returned, in which each feature represents the ‘activity’ of each sample in a multi-omics pathway. The resulting multi-omics pathway scores are used as input to a predictive model (any SciKitLearn compatible model e.g., partial least squares discriminant analysis (PLS-DA), logistic regression, support vector machine, random forest, etc). Pathway importances can be obtained using variable selection approaches appropriate for the model used (e.g., Gini impurity for random forests or the β coefficient for regression-based models). By describing and evaluating the two PathIntegrate modelling frameworks we aim to help users select the method best suited to their study design and research questions.

[END]
---
[1] Url: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011814

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/