(C) PLOS One

(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .

Curated single cell multimodal landmark datasets for R/Bioconductor [1]

['Kelly B. Eckenrode', 'Graduate School Of Public Health', 'Health Policy', 'City University Of New York', 'Ny', 'United States Of America', 'Institute For Implementation Science In Public Health', 'Dario Righelli', 'Department Of Statistical Sciences', 'University Of Padova']

Date: 2023-09

Summary of landmark datasets in SingleCellMultiModal

To evaluate and design new statistical methods that accompany experimental single-cell multimodal data, it is important to establish landmark datasets. The goal of this section is to provide an overview of the landmark datasets currently in SingleCellMultiModal as well as to introduce the experimental and technological context for each experimental assay (Table 1). For more information concerning the details of the technologies, consult [22]. We briefly describe each landmark experiment including context, major findings from the publication, and challenges in its analysis, then summarize its accompanying dataset in SingleCellMultiModal including number of cells and features (Fig 1B).

PPT PowerPoint slide

PNG larger image

TIFF original image Download: Table 1. Single-cell multimodal datasets included SingleCellMultiModal package. Modalities refer to the molecular feature measured in the experimental assay. Cell/process type provides information on the type of material or development event data was collected. Datatype name column refers to the dataset name in SingleCellMultiModal. https://doi.org/10.1371/journal.pcbi.1011324.t001

RNA and protein: mass spectrometry-based. Purpose and goals: CITE-Seq offers valuable information about the expression of surface proteins. However, the acquisition is limited to tens of targets as the identification relies on antibodies. Furthermore, it cannot provide information on intracellular markers. Mass spectrometry (MS)-based single-cell proteomics (SCP) provides a means to overcome these limitations and to perform unbiased single-cell profiling of the soluble proteome. MS-SCP is emerging thanks to recent advances in sample preparation, liquid chromatography (LC) and MS acquisition. The technology is in its infancy and protocols still need to be adapted in order to acquire multiple multimodalities from a single-cell. In this section the multimodality is achieved by subjecting similar samples to MS-SCP and Single-cell RNA-seq. Technology: The current state-of-the-art protocol for performing MS-SCP is the SCoPE2 protocol [9]. Briefly, single-cells are lysed, proteins are extracted and digested into peptides. The peptides are then labeled using tandem mass tags (TMT) in order to multiplex up to 16 samples per run (Fig 3A). The pooled peptides are then analysed by liquid chromatography-tandem mass spectrometry (LC-MS/MS). LC separates the peptides based on their mass and affinity for the chromatographic column. The peptides are immediately ionized as they come out (Fig 3B) and are sent for two rounds of MS (MS/MS, Fig 3C). The first round isolates the ions based on their mass to charge (m/z) value. The isolated ions are fragmented and sent to the second round of MS that records the m/z and intensity of each fragment. The pattern of intensities over m/z value generated by an ion is called an MS2 spectrum. The MS2 spectra are then computationally matched to a database to identify the original peptide sequence from which they originated. The spectra that were successfully associated to a peptide sequence are called peptide to spectrum matches (PSMs, Fig 3D). Next to that, a specific range of the MS spectrum holds the TMT label information where each label generates a fragment with an expected m/z value. The intensity of each label peak is proportional to the peptide expression in the corresponding single cell and this allows for peptide quantification (Fig 3D). Finally, the quantified PSM data go through a data processing pipeline that aims to reconstruct the protein data that can be used for downstream analyses (Fig 3E). PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 3. SCoPE2 workflow. The workflow consists of 4 main steps. (A) Sample preparation extracts and labels peptides from single-cells. (B) LC separates the peptides based on their mass and affinity for the column. Note that the TMT tag does not influence those properties. Peptides that are eluting are ionised thanks to an electrospray. (C) MS/MS performs an m/z scan of the incoming ions to select the most abundant ones that are then fragmented separately. A second round of MS acquires the spectrum generated by the ion fragments. (D) Each spectrum is then computationally processed to obtain the cell-specific expression values and the peptide identity. (E) The data processing pipeline reconstructs the protein data from the quantified PSMs. Abbreviations: TMT: tandem mass tags; LC: liquid chromatography; MS: mass spectrometry; MS/MS: tandem MS; m/z: mass over charge; PSM: peptide to spectrum match. https://doi.org/10.1371/journal.pcbi.1011324.g003 The major challenge in MS-SCP is to recover sufficient peptide material for accurate peptide identification and quantification. SCoPE2 solves this issue by optimizing the sample preparation step to limit samples loss, by providing analytical tools to optimize the MS/MS settings, and most importantly by introducing a carrier sample into the pool of multiplexed samples. The carrier is a sample that contains hundreds of cells instead of a single-cell and allows to boost the peptide identification rate by increasing the amount of peptide material delivered to the MS instrument. Parallel to SCoPE2, other groups have developed a label-free MS-SCP, where each LC-MS/MS run contains unlabelled peptides from a single cell [27]. Although it allows for more accurate quantifications, it suffers from low throughput. The current methodological advances in MS-SCP have extensively been reviewed elsewhere [28]. Landmark data: The SCoPE2 dataset we provide in this work was retrieved from the supplementary information of the landmark paper [9]. This is a milestone dataset as it is the first publication where over a thousand cells are measured by MS-SCP. The research question is to understand whether a homogeneous monocyte population (U-937 cell line) could differentiate upon PMA treatment into a heterogeneous macrophage population, namely whether M1 and M2 macrophage profiles could be retrieved in the absence of differentiation cytokines. Different replicates of monocyte and macrophage samples were prepared and analyzed using either MS-SCP or Single-cell RNA-seq. The MS-SCP data was acquired in 177 batches with on average 9 single-cells per batch. The Single-cell RNA-seq data was acquired in 2 replicates with on average 10,000 single-cells per acquisition using the 10x Genomics Chromium platform. Cell type annotations are only available for the MS-SCP data. Note also that MS-SCP data provides expression information at protein level meaning that the peptide data has already been processed. The processing includes filtering high quality features, filtering high quality cells, log-transformation, normalization, aggregation from peptides to proteins, imputation and batch correction (Fig 3E). More details on the protein data processing can be found in the original paper or in the paper that reproduced that analysis [29]. Count tables were provided for the Single-cell RNA-seq dataset with no additional processing. The data can be accessed in the SingleCellMultiModal package by calling SCoPE2("macrophage_differentiation") (Table 4). Relevant cell metadata is provided within the MultiAssayExperiment object. The MS-SCP dataset contains expression values for 3,042 proteins in 1,490 cells. The Single-cell RNA-seq contains expression values for 32,738 genes (out of which 10,149 are zero) for 20,274 cells. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Table 4. SCoPE2 dataset descriptions, with assay types, molecular modes, specimens, dataset version provided, number of features and number of cells. https://doi.org/10.1371/journal.pcbi.1011324.t004

Single-cell nucleosome, methylation and transcription sequencing (scNMT-seq). Purpose and goals: The profiling of the epigenome at single-cell resolution has received increasing interest, as it provides valuable insights into the regulatory landscape of the genome [30,31]. Although the term epigenome comprises multiple molecular layers, the profiling of chromatin accessibility and DNA methylation have received the most attention to date. Technology: DNA methylation is generally measured using single-cell bisulfite sequencing (scBS-seq) [32]. The underlying principle of scBS-seq is the treatment of the DNA with sodium bisulfite before DNA sequencing, which converts unmethylated cytosine (C) residues to uracil (and after retro-PCR amplification, to thymine (T)), leaving 5-methylcytosine residues intact. The resulting C→T transitions can then be detected by DNA sequencing. Further methodological innovations enabled DNA methylation and RNA expression to be profiled from the same cell, demonstrated by the scM&T-seq assay [33]. Chromatin accessibility was traditionally profiled in bulk samples using DNase sequencing (DNase-seq) [34]. However, in recent years, transposase-accessible chromatin followed by sequencing (ATAC-seq) has displaced DNase-seq as the de facto method for profiling chromatin accessibility due to its fast and sensitive protocol, most notably in single-cell genomics [35]. Briefly, in ATAC-seq, cells are incubated with a hyperactive mutant Tn5 transposase, an enzyme that inserts artificial sequencing adapters into nucleosome-free regions. Subsequently, the adaptors are purified, PCR-amplified and sequenced. Notably, single-cell ATAC-seq has also been combined with Single-cell RNA-seq to simultaneously survey RNA expression and chromatin accessibility from the same cell, as demonstrated by SNARE-seq [36], SHARE-seq [37] and the recently commercialized Multiome Kit from 10x Genomics [6]. Finally, some assays have been devised to capture at least three molecular layers from the same cell, albeit at a lower throughput than SNARE-seq or SHARE-seq. An example is scNMT-seq (single-cell nucleosome methylation and transcriptome sequencing) [5]. scNMT captures a snapshot of RNA expression, DNA methylation and chromatin accessibility in single-cells by combining two previous multi-modal protocols: scM&T-seq [33] and Nucleosome Occupancy and Methylation sequencing (NOMe-seq) [38] In the first step (the NOMe-seq step), cells are sorted into individual wells and incubated with a GpC methyltransferase. This enzyme labels accessible (or nucleosome depleted) GpC sites via DNA methylation. In mammalian genomes, cytosine residues in GpC dinucleotides are methylated at a very low rate. Hence, after the GpC methyltransferase treatment, GpC methylation marks can be interpreted as direct readouts for chromatin accessibility, as opposed to the CpG methylation readouts, which can be interpreted as endogenous DNA methylation. In a second step (the scM&T-seq step), the DNA molecules are separated from the mRNA using oligo-dT probes pre-annealed to magnetic beads. Subsequently, the DNA fraction undergoes scBS, whereas the RNA fraction undergoes Single-cell RNA-seq. Landmark data: The scNMT landmark paper reported simultaneous measurements of chromatin accessibility, DNA methylation, and RNA expression at single-cell resolution during early embryonic development, spanning exit from pluripotency to primary germ layer specification [23]. This dataset represents the first multi-omics roadmap of mouse gastrulation at single-cell resolution. Using multi-omic integration methods, the authors detected genomic associations between distal regulatory regions and transcription activity, revealing novel insights into the role of the epigenome in regulating this key developmental process. One of the challenges of this dataset is the complex missing value structure. Whereas RNA expression is profiled for most cells (N = 2480), DNA methylation and chromatin accessibility is only profiled for subsets of cells (N = 986 and N = 1105, respectively). This poses important challenges to some of the conventional statistical methods that do not handle missing information. The output of the epigenetic layers from scNMT-seq is a binary methylation state for each observed CpG (endogenous DNA methylation) and GpC (a proxy for chromatin accessibility). However, instead of working at the single nucleotide level, epigenetic measurements are typically quantified over genomic features (i.e. promoters, enhancers, etc.). This is done assuming a binomial model for each cell and feature, where the number of successes is the number of methylated CpGs (or GpCs) and the number of trials is the total number of CpGs (or GpCs) that are observed. Here we provide DNA methylation and chromatin accessibility estimates quantified over CpG islands, gene promoters, gene bodies and DNAse hypersensitive sites (defined in Embryonic Stem Cells). The pre-integrated scNMT dataset is accessed from the SingleCellMultiModal package by calling e.g. scNMT("mouse_gastrulation", version = "1.0.0") (Table 5). Relevant cell metadata is provided within the MultiAssayExperiment object. The overall dataset is 277MB. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Table 5. scNMT-seq dataset description, with of assay types, molecular modes, number of specimens, number of features and number of cells. https://doi.org/10.1371/journal.pcbi.1011324.t005

Chromium Single-cell Multiome ATAC and gene expression. Purpose and goals: A new commercial platform introduced in late 2020 by 10X Genomics, the Chromium Single Cell Multiome ATAC and gene expression (10x Multiome), provides simultaneous gene expression and open chromatin measurements from the same cell at high throughput. This technology is well suited to identify gene regulatory networks by linking open chromatin regions with changes in gene expression, a task which is harder to perform when the two modalities are derived from separate groups of cells. However, very few datasets have been published to date using the 10x Multiome technology, and so how much information can be obtained by simultaneously profiling both modalities in the same cell remains an open question. Technology: First, cells are purified and single nuclei are isolated, chromosomes are transpositioned. Next, ATAC and mRNA sequencing libraries are prepared with 10X Genomics Chromium microfluidic controller device where nuclei are partitioned and embedded in a droplet with a decorated gel bead with DNA 16nt 10X barcode that allows for pairing ATAC and mRNA signals to the same nuclei. mRNA is tagged with an 12nt Unique Molecular Identifier sequence (UMI), and a poly(dT)VN for poly-adenylated 3’ends. ATAC fragments are tagged with a Illumina primer sequence and an 8nt space sequence. All barcoded products are amplified in two rounds of PCR and then processed for sequencing. According to the Chromium Single-Cell Multiome ATAC and gene expression assay product information, it has a flexible throughput of 500–10,000 nuclei per channel and up to 80,000 per run with a 65% recovery rate and low multiplet rate of <1% per 1000 cells (10Xgenomics.com). Landmark data: 10X genomics has released a dataset of ~10k peripheral blood mononuclear cells (PBMCs) from a human healthy donor. Here we provide the RNA expression matrix and the binary matrix of ATAC fragments for each cell, quantified over a set of pre-computed peaks (Table 6). To access data in the SingleCellMultiModal package, call the scMultiome("pbmc_10x") command. Relevant cell metadata is provided within the MultiAssayExperiment object. The overall dataset is 1.1 GB. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Table 6. 10X Multiome dataset descriptions, with assay types, molecular modes, number of features and number of cells. https://doi.org/10.1371/journal.pcbi.1011324.t006

RNA and spatial sequencing assays. Purpose and goals: The power of microscopy to resolve spatial information has been paired with single-cell sequencing to measure transcriptomic activity. These microscopy-based sequencing technologies capture a cell population’s heterogeneous gene expression typically lost in bulk assays. Technologies like seqFISH(+) (sequential Fluorescence In Situ Hybridization), fluorescence in situ hybridization sequencing [7], Multiplexed error-robust fluorescence in situ hybridization (MERFISH) [39], Slide-seq [40,41] combine sequential barcoding with in situ molecular fluorescence probing, allowing the identification from tens to thousands of mRNAs transcripts while preserving spatial coordinates at micrometer resolution. We refer to this family of technologies as molecular-based spatial transcriptomics. Another family of spatial omics technologies can be described as spot-based; it includes the 10x Visium Spatial Gene Expression and Slide-seq [40]. In this family, the spatial coordinates are typically associated with barcoded spot-like identities, where the transcripts are amplified and sequenced. Currently, our package does not include any spot-based spatial transcriptomics dataset. The TENxVisiumData package [42] (available at https://github.com/HelenaLC/TENxVisiumData) contains several such datasets. See [43] for a comprehensive review of spatial transcriptomics technologies. Technology: The seqFISH technology makes use of temporal barcodes to be read in multiple rounds of hybridization where mRNAs are labeled with fluorescent probes. During the hybridization rounds, the fluorescent probes are hybridized with the transcripts to be imaged with microscopy. Then they are stripped to be re-used and coupled with different fluorophores, during further rounds. In this case, the transcript abundance is given by the number of colocalizing spots per each transcript. The main differences between the technologies are due to the barcoding of RNAs. In seqFISH they are detected as a color sequence while in MERFISH the barcodes are identified as binary strings allowing error handling but requiring longer transcripts and more rounds of hybridizations [44]. Landmark data: The provided seqFISH dataset is designed on a mouse visual cortex tissue and can be retrieved in two different versions. Both versions include Single-cell RNA-seq and seqFISH data. Single-cell RNA-seq data in version 1.0.0 are part of the original paper [24] of 24057 genes in 1809 cells, while version 2.0.0 is a pre-processed adaptation of version 1.0.0 [22] where the authors analyzed it in order to provide the 113 genes in common with seqFISH data in 1723 cells. The provided seqFISH data are the same for both versions as part of their original paper [45,46] made of 1597 cells and 113 genes. The dataset is accessible via the SingleCellMultiModal Bioconductor package by using the seqFISH(DataType="mouse_visual_cortex", version = "1.0.0") function call, which returns a MultiAssayExperiment object with a SpatialExperiment object for the seqFISH data and a SingleCellExperiment object for the Single-cell RNA-seq data (Table 7). PPT PowerPoint slide

PNG larger image

TIFF original image Download: Table 7. seqFISH dataset descriptions, with assay types, molecular modes, specimens, dataset version provided, number of features and number of cells. https://doi.org/10.1371/journal.pcbi.1011324.t007

RNA and DNA sequencing assays. Purpose and goals: Parallel genome and transcriptome sequencing (G&T-seq) of single-cells [8] opens new avenues for measuring transcriptional responses to genetic and genomic variation resulting from different allele frequencies, genetic mosaicism [47], single nucleotide variants (SNVs), DNA copy-number variants (CNVs), and structural variants (SVs). Although current experimental protocols are low-throughput with respect to the number of cells, simultaneous DNA and RNA sequencing of single-cells resolves the problem of how to associate cells across each modality from independently sampled single-cell measurements [48]. Technology: Following cell isolation and lysis, G&T-seq measures DNA and RNA levels of the same cell by physically separating polyadenylated RNA from genomic DNA using a biotinylated oligo-dT primer [49]. This is followed by separate whole-genome and whole-transcriptome amplification. Whole-genome amplification is carried out via multiple displacement amplification (MDA) or displacement pre-amplification and PCR (DA-PCR) for DNA sequencing, providing targeted sequencing reads or genome-wide copy number estimates. Parallel Smart-seq2 whole-transcriptome amplification is used for Illumina or PacBio cDNA sequencing, providing gene expression levels based on standard computational RNA-seq quantification pipelines. While pioneering technologies such as G&T-seq [8] and DR-seq [50] sequence both the DNA and RNA from single-cells, they currently measure only few cells (50–200 cells [51]) compared to assays that sequence DNA or RNA alone (1,000–10,000 cells [51]) such as Direct Library Preparation [52] or 10x Genomics Single-cell RNA-seq [53]. Landmark data: G&T-seq has been applied by Macaulay et al. [8] for parallel analysis of genomes and transcriptomes of (i) 130 individual cells from breast cancer line HCC38 and B lymphoblastoid line HCC38-BL, and (ii) 112 single cells from a mouse embryo at the eight-cell stage. Publicly available and included in the SingleCellMultiModal package is the mouse embryo dataset, assaying blastomeres of seven eight-cell cleavage-stage mouse embryos, five of which were treated with reversine at the four-cell stage of in vitro culture to induce chromosome mis-segregation. The dataset is stored as a MultiAssayExperiment [11] consisting of (i) a SingleCellExperiment [10] storing the single-cell RNA-seq read counts, and (ii) a RaggedExperiment [54] storing integer copy numbers as previously described [55] (Table 8). Although assaying only a relatively small number of cells, the dataset can serve as a prototype for benchmarking single-cell eQTL integration of DNA copy number and gene expression levels, given that Macaulay et al. [8] reported copy gains or losses with concomitant increases and decreases in gene expression levels. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Table 8. G&T-seq dataset description, with assay types, molecular modes, number of specimens, number of features and number of cells. https://doi.org/10.1371/journal.pcbi.1011324.t008

[END]
---
[1] Url: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011324

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/