This story was originally published in Plos One Journal:

This story was originally published in Plos One Journal:
URL: plosone.org. The content has not been altered
Licensed under Creative Commons Attribution (CC BY) license .
url:https://journals.plos.org/plosone/s/licenses-and-copyright
(C) Plos One [1]
--------------------

Assessing the replicability of spatial gene expression using atlas data from the adult mouse brain

['Shaina Lu', 'Cold Spring Harbor Laboratory', 'Cold Spring Harbor', 'New York', 'United States Of America', 'Cantin Ortiz', 'Department Of Neuroscience', 'Karolinska Institutet', 'Solna', 'Daniel Fürth']

Date: None

Allen Reference Atlas brain areas are classifiable using gene expression alone

With the advent of new high-throughput capture technologies for ST, we present, as is necessary for all new biological assays, a cross-technology assessment of generalizability in a well-characterized model system: the adult mouse brain. These new technologies allow, for the first time, the cross-platform assessment of canonical, atlas brain area subdivisions relative to gene expression at a whole-brain scale. Traditionally, parcellation of the mouse brain has depended on anatomical landmarks and cytoarchitecture, at times, including interregion connectivity and molecular properties [17,22,23]. By enabling the relatively rapid and high-throughput collection of spatially resolved, whole-transcriptome data in the adult mouse brain, these new spatial assays pave the way for a multimodality assessment of canonical brain area labels. Specifically, in the present work, we ask if brain areas from the ARA [17] are classifiable using 2 spatial gene expression datasets: the Allen Institute’s own ISH data [17,18] and a second dataset collected using ST [1,21] (Fig 1A and 1B). After filtering, the ABA consists of 62,527 voxels (rows) with expression from 19,934 unique genes (columns) mapping to 569 nonoverlapping brain area labels, and the ST consists of 30,780 spots (rows) with 16,557 genes (columns) mapping to 461 brain area labels (see Methods for details). The ABA dataset consists of a minimum of roughly 3,260 brains, while the ST dataset is collected from 3 mice (17,21) (see Methods). Comparing accuracy in classification of ARA brain areas across 2 technological platforms and datasets allows us to draw conclusions about spatial expression that are more likely to be biological and generalizable than subject to the technical biases of any one dataset.

To determine if we could more generally determine canonical brain areas from spatial gene expression, we first asked if we could do so within each of the 2 datasets independently. Given the known high correlation structure of gene expression [24], we hypothesized that we could determine the brain area of origin of a gene expression sample using only a subset of the total genes. Fitting these criteria, we chose least absolute shrinkage and selection operator, or LASSO regression [25]. LASSO is a regularized linear regression model that minimizes the L1 norm of the coefficients (i.e., the sum of the absolute values of the coefficients). LASSO typically drives most coefficients toward zero and thus leaves few genes contributing to the final model; LASSO in effect picks “marker genes” of spatial expression in the brain. We use LASSO in a supervised learning framework with a random 50/50 train–test split for two-class classification of all pairwise brain areas successively (Fig 1C) (see Methods). The brain areas included here are nonoverlapping and are the smallest brain areas present in the ARA naming hierarchy. We subsequently refer to these areas as leaf brain areas since they form the leaves of the tree-based representation of the ARA-named brain areas [17]. The performance of the test set classification is reported using the area under the receiver operating curve (AUROC). The AUROC can be thought of as the probability of correctly predicting a given brain region from its gene expression in a comparison with an outgroup (here, a different brain region) and is calculated by taking the predictions from the trained LASSO model and evaluating their correspondence with the known labels in the test fold (see Methods). For example, if ranking the samples by the LASSO predictions separates the samples from the 2 classes perfectly without being interspersed, we would get perfect classification with an AUROC of 1, while a score of 0.5 is random. More generally, in this manuscript, we say a brain area pair is classifiable with respect to each other to indicate a high performance in classification with an AUROC greater than 0.5 and generally closer to 1.

After preliminary filtering (see Methods), we use this approach in both the ST and ABA to classify all the leaf brain areas against each of the others (461 ST areas; 560 ABA areas) (Fig 1C; see Methods). ARA leaf brain areas are classifiable using LASSO (lambda = 0.1) from all other leaf brain areas using only gene expression data from (1) the ABA (mean AUROC = 0.996) (Fig 2A, S1A Fig) and from (2) the ST (mean AUROC = 0.883) (Fig 2B, S1B Fig). These results are consistent across an additional, independent train/test fold split for both datasets (ABA mean AUROC = 0.996, correlation to first split, rho = 0.732; ST mean AUROC = 0.882, correlation to first split, rho = 0.860) (S1C–S1F Fig). As expected, performance falls to chance when brain area labels are permuted as a control (ABA mean AUROC = 0.510; ST mean AUROC = 0.501) (S2A–S2D Fig). Together, these results indicate that there is a set of genes whose expression level can be used to identify it and suggests that canonical brain area labels do reflect spatial patterning of gene expression assayed in both the ABA and ST datasets.

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 2. Canonical brain areas are classifiable using gene expression alone in the ABA and ST datasets. Heat map of AUROC for classifying leaf brain areas from all other leaf brain areas in (A) ABA and (B) ST using LASSO (lambda = 0.1). Dendrograms on the far left side represent clustering of leaf brain areas based on the inverse of AUROC; areas with an AUROC near 0.5 get clustered together, while areas with an AUROC near 1 are further apart. Color bar on the left represents the major brain structure that the leaf brain area is grouped under. These areas include CTX, MB, CB, CNU, HB, and IB. (C) Average AUROC (y-axis) of classifying all brain areas from all other brain areas using LASSO across various values of lambda (x-axis): 0, 0.01, 0.05, and 0.1 for ABA train (blue diamond), ABA test (blue dot), ST train (orange diamond), and ST test (orange dot). (D) Number of principal components to capture at least 80% of variance of genes in each of the leaf brain areas after applying PCA to ABA (blue) and ST (orange). ABA brain areas that are larger than ST are randomly down-sampled to have the same number of samples as ST prior to applying PCA. (E) Gene–gene correlations calculated as Spearman’s rho between all pairwise genes across the whole dataset for both the ABA (blue) and ST (orange) independently. ABA, Allen Brain Atlas; AUROC, area under the receiver operating curve; CB, cerebellum; CNU, striatum and pallidum; CTX, cortex; HB, hindbrain; IB, thalamus and hypothalamus; ISH, in situ hybridization; LASSO, least absolute shrinkage and selection operator; MB, midbrain; PCA, principal component analysis; ST, spatial transcriptomics. https://doi.org/10.1371/journal.pbio.3001341.g002

Since our task can be conceived as a multiclass classification problem, we asked if brain area classification performance could be improved using a true multiclass classifier. To test this question, we used the k-nearest neighbors (k-NN) algorithm, which simply assigns the class identity of a test sample based on the majority class label (brain area) of its k closest neighbors in feature (here, expression) space. Using k-NN (k = 5), classification of leaf brain areas fell in ABA (mean AUROC = 0.695; S2E Fig) and ST (mean AUROC = 0.508; S2F Fig) (see Methods). Given the lack of increase in performance and the preferability of our biologically interpretable approach, we choose to continue most analyses using LASSO.

We next asked if single-gene marker selection strategies could outperform LASSO. Highlighting specific brain areas where such markers are known, we looked at classifying the CA2 of the hippocampus and arcuate hypothalamic nucleus with Amigo2 and Pomc, respectively [26–28]. Following long-standing anatomical divisions of the mouse brain, the hippocampal subregions were redefined in the mid-2000s using differences in gene expression [29,30]. Follow-up to the early redefinitions found that while not exclusively expressed in the CA2, Amigo 2 showed high expression levels in the CA2 [28]. Indeed, in the CA2 of the hippocampus, Amigo2 performs better than any other single gene in the ABA (Amigo2 ABA AUROC = 0.920) and ST datasets (Amigo2 ST AUROC = 0.612) (S3A Fig). However, classification of the CA2 using Amigo 2 is still outperformed by the average performance of genes selected by LASSO. One of the major neuronal populations of the arcuate hypothalamic nucleus are the POMC-expressing neurons, shown to have a role in food intake and metabolism [27]. In the arcuate hypothalamic nucleus, Pomc performance in the ABA (Pomc ABA AUROC = 0.993) and ST (Pomc ST AUROC = 0.910) is better than most other single genes and comparable or less than the average LASSO performance for each dataset (S3B Fig). Given the comparable performance and, more importantly, since there are not such known markers for most brain areas, we again turned our attention to using LASSO for classifying brain areas.

Notably, performance using LASSO in the ABA is nearly perfect. That the classification in the ABA performs so well is striking, especially considering the potential loss of ISH-level resolution in the voxel representation of the ABA. For the median-performing pair of brain areas in ABA (median AUROC = 1), there is a threshold in classification that can be drawn where all instances of one class can be correctly predicted without any false positives (precision = 1). In contrast, in the ST, no such threshold can be found for the median-performing (median AUROC = 0.959) brain areas (average precision = 0.846) (see Methods). Further, performance in the ABA is consistently higher than the ST across various parameterizations of LASSO (Fig 2C) (see Methods). Despite the comparatively lower performance in the ST, clustering brain areas by AUROC shows brain areas belonging to the same major anatomical region grouping together (Fig 2B) (see Methods). For example, most brain areas belonging to the cortex group together in the middle of the heat map (green bar on left) with a few interspersed areas. This grouping suggests that patterns of expression track with broad anatomical labels. Examining the relative expression of genes that are assayed in both datasets, we see that ranked mean expression is comparable across the 2 datasets (Spearman’s ρ = 0.599) (S3C Fig), suggesting that the observed difference in performance is not due to poorly detected genes being well detected in the opposite dataset or vice versa.

Observing the nearly perfect performance in the ABA, we next hypothesized that this dataset may be more low dimensional than suggested by its feature size and may contain many highly correlated features when compared to the ST dataset. We applied principal component analysis (PCA) in each brain area separately by subsetting the data by brain areas, then calculating PCA in each of these subsets independently. Using this approach, we find that on average in individual brain areas, 2 PCs are enough to summarize 80% of the variance per brain area in ABA versus 21 PCs in ST (Fig 2D, S3D Fig) (see Methods). In other words, within each brain area in the ABA, many genes are highly coexpressed. Zooming out to the whole brain, using 200 PCs captures nearly 70% of the variance in ABA compared to nearly 20% in ST (S3E Fig). Further, gene–gene coexpression across the whole dataset is on average higher in the ABA (gene–gene mean Spearman’s rho = 0.525) than in the ST (gene–gene mean Spearman’s rho = 0.049) (Fig 2E). The perfect performance, low dimensionality on a per brain area basis, and high coexpression all support the idea that although there is meaningful variation in the ABA, it can be captured in few dimensions. In summary, canonical ARA brain areas are classifiable from each other using gene expression alone, but performance is likely inflated in the ABA.

An aside of note is that in the ABA, the one brain area that is consistently lower performing when classified against most other brain areas is the Caudoputamen (mean AUROC = 0.784) (Fig 2A, black arrows). In the ST, the Caudoputamen is not the lowest performing area, but also has a low mean AUROC (AUROC = 0.619) relative to the other brain areas in ST. In both datasets, the Caudoputamen is the largest leaf brain area composed of the most samples (ABA CP number of voxels = 3,012 versus an average of 85.6 voxels; ST number of spots = 2,051 versus an average of 57 spots). The Caudoputamen is similarly large in other rodent brain atlases, reflecting its lack of cytoarchitectural features [31]. We hypothesized that its relatively larger size could mean that it consists of transcriptomically disparate subsections that are not captured with canonical ARA labeling. Although not an outlier, we do observe that the mean sample correlation for the Caudoputamen in both the ST (mean Pearson’s r = 0.727) and ABA (mean Pearson’s r = 0.665) is slightly lower than the mean in either case (ST mean Pearson’s r = 0.783; ABA mean Pearson’s r = 0.696) (S4A Fig). More generally, however, we observe that there is no relationship between size and performance across brain regions (S4B and S4C Fig). In addition to being an outlier in terms of size, the Caudoputamen is the dorsal part of the striatum that encompasses many different functional subdivisions evident through the various corticostriatal projections [31]. Together with the low classification performance of the Caudoputamen using gene expression, this reflects the shortcomings of the ARA Caudoputamen label and the likely need to subdivide the Caudoputamen functionally.

[1] Url: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001341

(C) GlobalVoices
Licensed under Creative Commons Attribution (CC BY 4.0)
URL: https://creativecommons.org/licenses/by/4.0/

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/