(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .
Integration of a multi-omics stem cell differentiation dataset using a dynamical model [1]
['Patrick R. Van Den Berg', 'Leiden Institute Of Physics', 'Leiden University', 'Leiden', 'Zuid-Holland', 'The Netherlands', 'Noémie M. L. P. Bérenger-Currias', 'Bogdan Budnik', 'Mass Spectrometry', 'Proteomics Resource Laboratory']
Date: 2023-06
Stem cell differentiation is a highly dynamic process involving pervasive changes in gene expression. The large majority of existing studies has characterized differentiation at the level of individual molecular profiles, such as the transcriptome or the proteome. To obtain a more comprehensive view, we measured protein, mRNA and microRNA abundance during retinoic acid-driven differentiation of mouse embryonic stem cells. We found that mRNA and protein abundance are typically only weakly correlated across time. To understand this finding, we developed a hierarchical dynamical model that allowed us to integrate all data sets. This model was able to explain mRNA-protein discordance for most genes and identified instances of potential microRNA-mediated regulation. Overexpression or depletion of microRNAs identified by the model, followed by RNA sequencing and protein quantification, were used to follow up on the predictions of the model. Overall, our study shows how multi-omics integration by a dynamical model could be used to nominate candidate regulators.
Pluripotent stem cells, which can be derived from an adult individual, can be grown indefinitely in a dish and turned into each cell type of the body. These abilities enable applications of stem cells in basic research and regenerative medicine. Differentiation, the conversion into a precisely defined cell type, typically requires complex protocols that often have low efficiency. A better understanding of the molecular mechanisms underlying differentiation could help us improve existing protocols. Here, we studied the differentiation of embryonic stem cells induced by a small molecule (retinoic acid). We measured the abundances of three important classes of biomolecules–micro RNAs, messenger RNAs and proteins–at multiple time points during a 96 h-long differentiation experiment. We observed changes in the abundances of thousands of molecules. To make sense of these measurements we developed a mathematical model that connects the different classes of biomolecules and aims to predict their dynamics. Such models might help us identify new opportuntities to control differentiation at the molecular level. The data set we created, which we provide through an easily accessible web application, will also be a useful resource for other researchers interested in stem cell biology.
Funding: P. vd B. and S.S. were supported by the Netherlands Organisation for Scientific Research (NWO/OCW), as part of the Frontiers of Nanoscience (NanoFront) program. N.S. was supported by a New Innovator Award from the NIGMS of the NIH under Award number DP2GM123497. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Data Availability: The RNA-seq data has been deposited in GEO (ID: GSE93301). A list of the RNA-seq samples can be found in S3 Table . The raw MS data has been deposited in MassIVE (ID: MSV000080461). The processed data can also be mined and accessed through a web application at www.semraulab.com/multi-omics .
For this study, we collected a bulk multi-omics data set of retinoic acid (RA)-driven differentiation of mouse embryonic stem cells (mESCs). Samples taken over a period of 96 h were subjected to: mass spectrometry, bulk RNA-sequencing of nuclear and cytoplasmic fractions, as well as small RNA sequencing to quantify microRNA (miR) abundance. To describe protein dynamics, we refined a birth-death model [ 17 – 22 ] by considering explicitly the cytoplasmic fraction of the mRNA and the influence of certain technical artefacts related to mass spectrometry. In contrast to steady-state models, dynamical models aim to infer kinetic rates for protein synthesis and degradation rather than explain absolute protein levels. Here, we show how such models can in principle be used to nominate candidate regulators of gene expression during stem cell differentiation. By assuming a specific effect of miRs on protein synthesis, we attempted to identify miRs with a potential regulatory function. Finally, we used mimics and inhibitors of several candidate miRs to follow up on the model’s predictions.
Bulk measurements in unperturbed, steady state conditions can only reveal across-gene correlations. Single-cell methods or bulk measurements of dynamic systems, on the other hand, reveal fluctuations across cells or time, respectively, which allows us to study mRNA-protein correlation of individual genes (“within-gene correlation”[ 10 ]). The variability of mRNA-protein ratios in single-cells was found to be influenced by various factors such as protein half-life [ 15 ], as well as phenotypic state or microenvironment [ 16 ]. Within-gene mRNA-protein correlation can also be studied by measuring mRNA and protein abundances at the population level across time in a highly dynamic system, such as differentiating stem cells.
A common first step towards finding novel gene regulatory relationships is the comprehensive, ideally genome-wide, measurement of gene expression dynamics. A large body of work has focused on charting transcriptome changes during differentiation, most recently down to the single-cell level [ 2 – 6 ]. While highly informative, such studies usually make the implicit assumption that mRNA levels are a good proxy for protein levels, despite widespread discordance observed in several mammalian systems [ 7 – 9 ]. The relationship between mRNA and protein abundance has been studied in various systems at different resolutions and time scales. Originally, mRNA-protein correlation was assessed across the genome (“across-gene correlation”[ 10 ]) in cell lines growing in steady state, where population-average abundances were measured with bulk omics methods. In this context, initial estimates claimed that only 40% of protein variability across the genome is explained by mRNA abundance in steady state [ 11 ]. Models of the protein to mRNA ratio explained up to two-thirds of the variability, when sequence features—such as the length of the coding sequence or amino acid frequencies—were taken into account [ 12 ]. Importantly, discordance between mRNA and protein abundance does not immediately imply specific regulation, as technical noise tends to reduce the observed correlation and conventional correction schemes typically ignore the effect of systematic, correlated errors [ 13 ]. In a comparison of relative protein levels across human tissues, about 50% of the variance in protein abundance was ascribed to post-transcriptional regulation [ 14 ]. All in all, a significant amount of protein abundance variability across the genome seems to remain unexplained, even if the effect of technical noise is considered.
Much of the medical potential of pluripotent stem cells is due to their ability to differentiate into all cell types of the adult body [ 1 ]. While tremendous progress has been made in guiding cells through successive lineage decisions, the regulatory mechanisms underlying these decisions often remain unknown, especially at the post-transcriptional level. This gap in knowledge hampers the streamlining and acceleration of differentiation protocols.
Results
Pervasive discordance between mRNA and protein in retinoic acid-driven mESC differentiation We used RA differentiation of mESCs as a generic model for in vitro differentiation. Previously, we characterized this differentiation assay in detail at the transcriptional level by single-cell RNA-seq [2] and showed that RA exposure induces a bifurcation into extraembryonic endoderm-like and ectoderm-like cells. Here, we collected RNA and protein samples during an RA differentiation time course (Fig 1A). For each time point, we quantified poly(A) RNA by RNA-seq and protein expression by tandem mass tag (TMT) labeling followed by tandem mass spectrometry (MS/MS). In total, we obtained RNA and protein abundance estimates for 6271 genes (S1A–S1E Fig) at 8 time points in duplicate. After correction for batch effects due to separate sequencing runs (S1F Fig), we achieved highly similar results for the two biological replicates. To investigate in how far protein expression can be predicted from RNA expression, we started with the simplest conceivable model (termed naive here), which assumes that protein abundance is connected to RNA expression by a constant, gene-dependent scaling factor. This model is justified, if protein synthesis and degradation rates are constant and RNA expression changes slowly on the time scale of protein turnover, thus resulting in a quasi-steady state. Consequently, the protein-to-RNA ratio would be approximately constant over time. To test this model, we scaled both protein and RNA to their respective means, which should result in a constant protein-to-RNA ratio of 1, if the naive model is valid. We observed that for a large fraction of genes, the naive model is inaccurate, resulting in low coefficients of determination (R2) and low correlation (Figs 1B, 1C and S2A). In many cases, R2 assumes negative values, which means that the naive model performs worse than a model predicting a scaled protein level of 1 for all time points. (Note that only for linear regression models R2 is guaranteed to be positive and interpretable as the fraction of variance explained by the model.) For some genes, we even observed significant anti-correlation between RNA and protein (Fig 1B). While technical noise certainly contributes to this result, the high quality of our data sets (S1 Fig) suggests that a substantial part of the observed discordance is of biological origin. The assumptions of the naive model are therefore likely wrong for the majority of genes and a more sophisticated model is necessary to explain the relationship between RNA and protein. PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 1. Birth-death models outperform the naive model in predicting dynamics. (A) Schematic overview of RA differentiation time course and subsequent omics measurements. (B) Example fit of the naive model. The naive model is a smoothing spline fit of RNA scaled to match the mean protein expression. (C) R2 distribution of the naive model. (D) Example fit of the total RNA (totRNA) model. (E) R2 distributions of the naïve and total RNA totRNA models. (F) Example fit of the total RNA and cytoplasmic RNA model, replicate 1. (G) R2 distributions of the total RNA and cytoplasmic RNA model. (H) Example fit of the cytoplasmic RNA and ci model, replicate 1. The height of the grey bar indicates the fitted ci parameter. (I) R2 distributions of the cytoplasmic RNA and ci model. Only genes that are improved by the ci model are shown. The distribution of all genes is shown in S2E Fig. (C,E,G,I) Some genes with extremely low R2 values are set to the minimum value of the plot for clarity. Corresponding Pearson’s r distributions are plotted in S2A–S2D Fig.
https://doi.org/10.1371/journal.pgen.1010744.g001
A simple birth-death model explains mRNA-protein discordance for most genes To relax the assumption that expression is in steady state, we next considered a kinetic model that implements a birth-death process for protein dynamics (Eq 1). (1) In this total RNA model, protein production of gene g depends on total mRNA abundance R and the protein synthesis rate k s , while protein degradation depends on protein abundance P and the degradation rate k d . Synthesis and degradation rate are taken to be constant in time but gene-specific (as indicated by the index g). All processes related to protein production (initiation, elongation, etc.) are lumped into k s , while k d represents all processes leading to a reduction in protein levels (dilution due to cell division, active degradation, etc.). Similar models have been used previously to describe protein dynamics during the stress response in yeast [18], as well as embryonic development of Xenopus [19] and Drosophila [22]. We do not consider simpler, degenerate models (without k s and/or k d [19]), because these models are not relevant in our biological system: Synthesis and degradation always occur to some degree during stem cell differentiation. To reduce the influence of uninformative small fluctuations, we applied a smoothing spline to the abundance estimates prior to inferring model parameters by non-linear least-squares fitting. Compared to the naive model, R2 and correlation improved markedly for the total RNA model (Figs 1D, 1E and S2B), which is to be expected given the increase in model flexibility. To correct for a difference in the number of fit parameters and thus compare model performance fairly, we used the Bayesian information criterion (BIC, see Methods). According to the BIC, 3551 out of 4580 proteins were better fit by the kinetic model. These proteins are thus likely out of steady state, for the duration of the experiment, as a result of the differentiation cue. In summary, these results showed that a simple birth-death model outperforms the naive model of protein dynamics. Despite the overall improvement observed with the total RNA model, R2 and correlation were still low for many proteins. We hypothesized that the remaining discrepancies could be explained by kinetic rates changing over time. To evaluate this hypothesis, we first sought to exclude technical limitations of our measurements as possible alternative explanations. We first considered the subcellular localization of mRNA. In our first experiment, we measured total poly(A) RNA, whereas only cytoplasmic mRNA is available for translation. Nuclear retention of mRNA is known to reduce transcriptional noise [23] and has been shown to contribute to translational regulation for specific genes [24–26]. To measure the cytoplasmic mRNA fraction of each gene, we repeated the differentiation experiment in triplicate and separated cell lysates into a nuclear and a cytoplasmic fraction before performing RNA-seq. To obtain a global scaling factor between cytoplasmic and nuclear expression, we regressed total RNA (totRNA) reads, measured previously, on nuclear RNA (nuRNA) reads and cytoplasmic RNA (cyRNA) reads across all genes (see Methods). Then, the cytoplasmic fraction C was calculated for each gene and each time point. To our surprise, C did not vary substantially between genes (mean = 0.82, std = 0.02, calculated for a subset of 3,563 genes without any missing values) (S1G Fig). In addition, C also did not fluctuate much in time for individual genes (S1H Fig). Despite the low variability of C, we incorporated this parameter, leading to the cytoplasmic RNA model (Eq 2). (2) As to be expected, adding C brought overall only a subtle improvement (Fig 1G), although for individual cases, the improvement was significant (Fig 1F). We opted to fit further models including the cytoplasmic fraction, due to the overall slightly better performance. Another potential confounder is inherent to the proteomics method we employed. TMT-based proteomics suffers from co-isolation interference (ci), a process in which two peptides are co-isolated in the second MS step. The contaminating peptide can interfere with the quantification of the peptide of interest and cause a constant offset [27,28]. To model a possible offset, we extended the model by an additional parameter (ci), quantifying the degree of fold-change compression for a protein. Thus, we assume ci to be constant for all TMT tags, i.e. time points, resulting in the ci model (Eq 3). (3) Effectively, including the parameter ci allows protein expression to have a bigger dynamic range, which can improve the fit for certain genes significantly (Figs 1H, 1I, S2D and S2E). Judged by the BIC, 557 genes were fit better including ci. All in all, this result reinforces the importance of considering co-isolation interference.
Including miRs improves model performance and identifies miR-mRNA interactions Having incorporated important confounding factors, we sought to further extend the model by relieving the assumption of constant kinetic rates. Such a model would be able to capture protein synthesis and degradation varying across time, which likely happens during a process as dynamic as stem cell differentiation. A model in which k s and k d are time-dependent and completely arbitrary cannot be sufficiently constrained by our mRNA and protein measurements. Hence, we decided to focus on a specific regulatory mechanism, for which an additional, complementary data set could be obtained. Specifically, we explored the influence of miRs, which are known to be involved in gene regulation during differentiation [29]. In order to study the role of miRs in our system, we repeated the RA differentiation assay and measured the miRnome by small RNA-seq in quadruplicate. We quantified around 1000 mature miRs per time point (S1A, S1C and S1E Fig). For further analysis, we retained miRs with high reproducibility across replicates and high variance across time (S1I Fig). To identify possible effects of miRs, we focused on computationally predicted miR target genes. We further reduced the number of miR-mRNA interactions by filtering based on the “context score” provided by TargetScanMouse [30] (S1D Fig). This score combines information from mutiple sequence features and can be used to rank hits. In the end we retained 4527 genes, 560 unique mature miRs and 45,882 potential interactions between them (S1D and S1E Fig). If multiple miRs with similar temporal profiles targeted the same gene, we considered them to be indistinguishable. Therefore, we grouped all miRs into six clusters by their temporal expression profiles (Fig 2A) and averaged over miRs from the same cluster targeting the same mRNA (Fig 2B). As miR binding is known to trigger translational repression [31], we modified the term describing protein synthesis in our model. In the interest of parsimony, we assumed a linear dependence of protein synthesis on miR abundance for this miR model (Eq 4). (4) Here, Mg m is the geometric mean abundance (scaled to its maximum across time) of miRs in cluster m targeting gene g and αg m parametrizes the interaction between those miRs and the target gene g. For each gene, we fit a model with one of the six miR clusters at a time and identified the improvement in model performance. Including miRs greatly improved the fits for some proteins, especially when there was a transient discordance between RNA and protein expression (Fig 2B). Typically, the "effective" mRNA abundance (cytoplasmic mRNA corrected for miR effects) was more dynamic than nominal mRNA abundance. For many of the genes that benefited from the addition of miRs, their influence is typically large: For these genes, translation was reduced up to 50% by the miR term in the model at peak miR expression (Fig 2C). Overall, the addition of a miR-dependent term significantly improved the coefficient of determination for a quarter of the proteins as determined by the BIC (Fig 2D). At this point we would like to stress that we made particular assumptions about the influence of miRs (miR binding only affects k s and the effect is linear in the abundance etc.). A more general model would have been under-constrained by the available data. Other conceivable interaction terms might result in improvements for different sets of genes. A better performance of the miR model, compared to simpler models, therefore does not prove the assumed regulatory mechanism. PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 2. The addition of miRs further improves the dynamical model for a subset of genes and suggests potential miR-mRNA interactions. (A) Expression profiles of 560 miRs in six clusters. (B) Example fit of miR model for the gene Rab8a, replicate 1. First panel: expression of the assigned miRs of a single cluster. Colored lines are individual smoothing spline fits. Second panel: Cytoplasmic RNA expression and the effective RNA concentration available for translation (see Methods). Solid lines represent smoothing splines. Third/fourth panel: cytoplasmic RNA and miR model fits. (C) Distribution of inferred α for genes that benefit from miR model. (D) R2 distribution of the miR model and the next best model (either naive, total RNA, cytoplasmic RNA or ci). Only genes that benefit from the miR model are shown. Some genes with extremely low R2 values are set to the minimum value of the plot for clarity. The corresponding Pearson’s r distribution is shown in S2F Fig.
https://doi.org/10.1371/journal.pgen.1010744.g002
[END]
---
[1] Url:
https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1010744
Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.
via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/