(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .



The specious art of single-cell genomics [1]

['Tara Chari', 'Division Of Biology', 'Biological Engineering', 'California Institute Of Technology', 'Pasadena', 'California', 'United States Of America', 'Lior Pachter', 'Department Of Computing', 'Mathematical Sciences']

Date: 2023-09

Dimensionality reduction is standard practice for filtering noise and identifying relevant features in large-scale data analyses. In biology, single-cell genomics studies typically begin with reduction to 2 or 3 dimensions to produce “all-in-one” visuals of the data that are amenable to the human eye, and these are subsequently used for qualitative and quantitative exploratory analysis. However, there is little theoretical support for this practice, and we show that extreme dimension reduction, from hundreds or thousands of dimensions to 2, inevitably induces significant distortion of high-dimensional datasets. We therefore examine the practical implications of low-dimensional embedding of single-cell data and find that extensive distortions and inconsistent practices make such embeddings counter-productive for exploratory, biological analyses. In lieu of this, we discuss alternative approaches for conducting targeted embedding and feature exploration to enable hypothesis-driven biological discovery.

Funding: L.P. received the National Institutes of Health ( nih.gov ) award U19MH114830, administered by the National Institute of Mental Health ( nimh.nih.gov ). T.C. and L.P. were partially funded by this award. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Data Availability: Download links for the original data used to generate the figures and results in the paper are listed in Table A in S1 Text . Processed and normalized versions of the count matrices are available on CaltechData, with links provided in Table B in S1 Text . All analysis code used to generate the figures and results in the paper is available at https:// github.com/pachterlab/CP_2023 and deposited at Zenodo (DOI https://doi.org/10.5281/zenodo.8087950 ). Code is provided in Colab notebooks which can be run for free on the Google cloud.

Copyright: © 2023 Chari, Pachter. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Here, we assess dimensionality reduction for single-cell gene expression, first investigating the preservation of the necessary properties comprising the columns of Fig 1 , then assessing the impact of these embeddings across the applications comprising the rows of Fig 1 .

Yet, despite the goals of these methods [ 2 , 3 , 6 ] to preserve local and/or global structure, there is little theory or empirical analysis to support these claims. For example, while the popular t-SNE and UMAP methods claim faithful representation of local and/or global structure in low dimensions [ 1 , 2 , 5 ], there is evidence they fail in this regard [ 1 , 35 ], and theorems providing guarantees on the embeddings rely on numerous assumptions unlikely to hold in practice and ignore the preprocessing by PCA prior to nonlinear reduction [ 36 ].

Inherent in these applications are assumptions of preservation of local and global cell properties, as well as distances, delineated in Fig 1 . For each application, we demarcate which of these are the “necessary” or key geometric properties that each task inherently assumes to be represented (and preserved). Based on previous works [ 6 , 13 , 27 , 28 ] and the objective functions of UMAP and t-SNE [ 4 , 5 ], “local” is defined as nearest neighbor relationships, “global” as neighbor relationships and properties of groups of cells (e.g., cell types), and “distance” as Euclidean distance (L 2 norm) or Manhattan distance (L 1 norm) between points. Note that preservation of distance implies preservation of local and global properties. We utilize the L 2 norm as it is the default metric of UMAP/t-SNE. We also present results with the L 1 norm (see S1 Text ), as L 1 is more suitable for measuring distance in high dimensions, particularly in comparison to other L k norms [ 29 , 30 ], and is commonly applied to transcriptomic data [ 31 – 33 ], with comparable performance to the probabilistic Jensen–Shannon divergence in single-cell distance calculations [ 34 ].

Embeddings are used to justify or measure changes in cell populations between different conditions, by comparing contour locations and sizes in the density diagrams, as well as changes in intensity or spread of gene expression [ 16 – 20 ].

Visual applications range from assessing the existence of and relationships between predefined clusters, to inferring properties of the clusters (e.g., spread or heterogeneity) [ 1 , 2 , 13 ], and to generating the clusters themselves from the 2D space (e.g., to define cell types or to detect doublets) [ 3 , 14 , 15 ].

Embeddings are used to visually assess the extent of integration, mixing, or similarities between cells from different batches [ 7 – 9 ] and to compare methods of integration/batch-correction [ 10 ]. For query dataset(s) mapped onto reference datasets/embeddings, visuals likewise provide an assessment of merged data similarities or differences [ 11 , 12 ].

The high-dimensionality of “big data” genomics datasets has led to the ubiquitous application of dimensionality reduction to filter noise, enable tractable computation, and to facilitate exploratory data analysis (EDA). Ostensibly, the goal of this reduction is to preserve and extract local and/or global structures from the data for biological inference [ 1 – 3 ]. Trial and error application of common techniques has resulted in a currently popular workflow combining initial dimensionality reduction to a few dozen dimensions, often using principal component analysis (PCA), with further nonlinear reduction to 2 dimensions using t-SNE [ 4 ] or UMAP [ 1 , 2 , 5 , 6 ]. For single-cell genomics in particular, these embeddings are used extensively in qualitative and quantitative EDA tasks that fall into 4 main categories of applications ( Fig 1 , “Application”):

In practice, measuring these “max/min ratios” in 2D embeddings, for the ex and in utero data (E10.5) as well as the 10× VMH neurons, revealed 4- to 200-fold increases in these ratios whether compared to the relevant PCA space or ambient space (with or without PCA-preprocessing). This was the case in groups of equidistant cells as well as groups of nearest neighbors (Figs F and G in S1 Text ) and can result in trends such as displayed in Fig 2C , with cells shot out across the embedding. For both datasets, we empirically verified the growth of this distortion with the number of cells considered in each equidistant group, i.e., as more cells are considered in 2D, the distortion grows (Fig H in S1 Text ). Higher dimensional PCA spaces largely maintained similar max/min ratios to the ambient space (Figs G and H in S1 Text ). However, we note that in low dimensions PCA embedding of equidistant points is tantamount to applying a random projection, similarly resulting in projected points displaying numerous mirages of structure or outliers (Fig I in S1 Text ).

This is not surprising, given previous theoretical work on the limits of distance preservation in low dimensions, particularly for equidistant points [ 42 – 44 ]. The Johnson–Lindenstrauss lemma on the optimality of linear embedding [ 45 – 47 ] shows that preservation of pairwise distances with a margin of error of at most 20% for a modestly sized dataset of 10,000 cells would require at least 1,842 dimensions [ 48 ]. Distortion is inevitable: given n points embedded in 2 dimensions, the distortion of the ratio of their maximum distance, D, to minimum distance, d (“max/min ratio”), grows as [ 49 ] (see Note in S1 Text ).

To examine distance preservation, we extracted groups of cells with quantitatively distinct relationships in the ambient space of the Seurat-integrated [ 7 ] ex and in utero mouse embryo dataset (at the E10.5 stage) [ 8 ], specifically equidistant groups of cells, where the distances between cells were all either equally small (“near”) or large (“far”) ( Fig 2C ) (see Methods in S1 Text ). This revealed upwards of 2.5 million such groups, with 3 to 8 cells in each (Figs Fa and Fe in S1 Text ). However, once embedded into 2 dimensions, these quantitatively distinct groups of cells (orange dots on UMAPs, Fig 2C ) displayed the same dispersion patterns, violating distance preservation, and rendering these distinct, transcriptomic relationships indistinguishable.

Turning to global relationships, we measured the preservation of the rankings of neighbors of cell “types” rather than individual cells. Cell “types” denote either author-provided cell type ( Fig 2Bii ) or cell condition annotations. Rankings were constructed from average pairwise distances between the cells of the different types, across replicate 2D embeddings (see Methods in S1 Text ). For the same datasets as above, and a multiplexed dataset of human monocytes treated with 40 drugs [ 41 ], correlation of cell type neighbor rankings to that of the ambient space were low (≤ 0.4) in PCA-preprocessed 2D embeddings, and at least 33% lower than those of the higher dimensional PCA spaces, with warped or even reversed correlations in comparison to the ambient ( Fig 2Bi ) or relevant PCA space ( Fig 2Bii , Fig Ca in S1 Text ). These distortions were not specific to the distance measure used; we observed similar results when using the L 1 norm to determine cell type neighbors (Fig Cb in S1 Text ). This is consistent with observations made in other studies [ 6 , 28 ]. In general, correlation decreased over each step in the reduction process though there was not a clear trend related to other dataset properties (Figs Da and Ea in S1 Text ). For analyses of recapitulation of cluster properties such as inferred heterogeneity or spread, see “Clustering validation and relationships” and “Embedding properties are arbitrary” below.

( a ) (i) Distribution of Jaccard distance of cell neighbors in PCA-preprocessed 2D embeddings and the relevant PCA space, as compared to ambient space. (ii) Distribution of Jaccard distance of cell neighbors in PCA-preprocessed 2D embeddings, as compared to the higher dimensional PCA space. ( b ) (i) Boxplot of correlations of cell type neighbor rankings to ambient space for the PCA-preprocessed 2D embeddings and the relevant PCA space. (ii) Boxplot of correlations of cell type neighbor rankings to the relevant higher dimensional PCA space for the PCA-preprocessed 2D embeddings. Embeddings generated n = 3 times. ( c ) Selection of equidistant groups with “near” or “far” distances in ambient space. UMAP embedding of the data in gray circles, with orange circles denoting all cells within the previously determined equidistant groups.

The 2D t-SNE/UMAP embeddings (e.g., “PCA-50D→UMAP” in Fig 2A ) displayed large Jaccard distances with respect to the neighbors in ambient dimension, with an average consistently above 0.7 (70%). Generally, dissimilarity increased with the size of the dataset ( Fig 2A , Figs A and Ba in S1 Text ). When the number of neighbors (k), considered in the dissimiliarity calculation, was varied between 5 to 100, smaller dataset embeddings displayed slightly improved dissimilarity scores with larger k (Figs Bb and Bc in S1 Text ). Interestingly, the embeddings of the more homogeneous mESCs dataset displayed relatively higher dissimilarity despite the small number of cells (Figs Bb and Bc in S1 Text ). Poor neighborhood overlap was additionally retained, and often worsened, without PCA-preprocessing (i.e., direct reduction to 2D from ambient space). In some cases, the dissimilarity of neighbors was worse for 2D PCA (“PCA-2D”) as compared to t-SNE or UMAP reduction without PCA-preprocessing, consistent with other findings on the poor preservation of local neighborhoods by both PCA and the nonlinear reduction methods [ 1 , 35 ] (Figs A and Bc in S1 Text ). Similarly poor neighbor retention from the ambient space was observed in the higher dimensional PCA spaces as well (“PCA-50D” Fig 2Ai , Figs A and B in S1 Text ) [ 35 ], particularly for larger datasets. Even between the PCA-preprocessed 2D embeddings and their corresponding PCA space, Jaccard distances were consistently above 0.8 on average, regardless of the dimension of the initial PCA reduction ( Fig 2Aii , right panels Fig A in S1 Text ).

Given the focus on preserving local nearest neighbors in the objectives of the UMAP and t-SNE methods, we first measured the recapitulation of nearest neighbors in 2D embeddings, as compared to the neighbors defined in ambient space. We used Euclidean (L 2 ) distance, the default for these nonlinear reduction methods, to define each cell’s 30 nearest neighbors and measured Jaccard distance (dissimilarity) between the neighbors in embedding and ambient space (where 1.0 denotes no overlap). Several in vivo datasets were reduced to 2D, with PCA-preprocessing, including 10× Genomics and SMART-Seq assayed mouse ventromedial hypothalamus (VMH) neuron datasets [ 37 ], an ex utero cultured mouse embryo dataset (at the E8.5 stage) and an ex and in utero mouse embryo dataset (at the E10.5 stage) from [ 8 ], and a mouse primary motor cortex (MOp) dataset [ 38 ]. We additionally reduced cell culture-derived datasets, with and without external perturbations, including mouse embryonic stem cells (mESCs) treated in DMSO from [ 39 ] and multiplexed mouse neural stem cells (NSCs) in 96 drug combination conditions (labeled “96-plex”) [ 40 ] (see Table A in S1 Text ).

“Ambient” space refers to the gene count matrix after highly variable gene selection and log-normalization of the counts (see Methods in S1 Text ). We denote “PCA-preprocessing” as the higher dimensional reduction of the ambient space by PCA, followed by a (nonlinear) reduction to 2D (e.g., “PCA-50D→UMAP”) which mimics standard practice. Additionally, cell annotations or labels (such as cell type or condition) used in the following analyses were taken from the original studies.

We begin with the columns of Fig 1 , and assess the preservation of these properties by 2D embedding, as compared to the ambient space or higher-dimensional PCA space to which the ambient space is initially reduced prior to reduction to 2D (see Methods in S1 Text ).

Distortion of trends in applications

Given the distortions of the necessary properties in Fig 1, we then investigated their impact on each row or application, i.e., how in practice such embeddings affect the inferences and implications made in each application.

[END]
---
[1] Url: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011288

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/