(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .
End-to-end deep learning approach to mouse behavior classification from cortex-wide calcium imaging [1]
['Takehiro Ajioka', 'Department Of Physiology', 'Cell Biology', 'Kobe University School Of Medicine', 'Chuo', 'Kobe', 'Nobuhiro Nakai', 'Okito Yamashita', 'Department Of Computational Brain Imaging', 'Atr Neural Information Analysis Laboratories']
Date: 2024-04
Deep learning is a powerful tool for neural decoding, broadly applied to systems neuroscience and clinical studies. Interpretable and transparent models that can explain neural decoding for intended behaviors are crucial to identifying essential features of deep learning decoders in brain activity. In this study, we examine the performance of deep learning to classify mouse behavioral states from mesoscopic cortex-wide calcium imaging data. Our convolutional neural network (CNN)-based end-to-end decoder combined with recurrent neural network (RNN) classifies the behavioral states with high accuracy and robustness to individual differences on temporal scales of sub-seconds. Using the CNN-RNN decoder, we identify that the forelimb and hindlimb areas in the somatosensory cortex significantly contribute to behavioral classification. Our findings imply that the end-to-end approach has the potential to be an interpretable deep learning method with unbiased visualization of critical brain regions.
Deep learning is used in neuroscience, and it has become possible to classify and predict behavior from massive data of neural signals from animals, including humans. However, little is known about how deep learning discriminates the features of neural signals. In this study, we perform behavioral classification from calcium imaging data of the mouse cortex and investigate brain regions important for the classification. By the end-to-end approach, an unbiased method without selecting regions of interest, we clarify that information on the somatosensory areas in the cortex is important for distinguishing between resting and moving states in mice. This study will contribute to the development of interpretable deep-learning technology.
Funding: KAKENHI from Japan Society for the Promotion of Science (JSPS) (
https://www.jsps.go.jp/english/ ), JP16H06316, JP16H06463, JP21H04813, JP23H04233 and JP23KK0132 to TT; Japan Agency for Medical Research and Development (
https://www.amed.go.jp/en/index.html ), JP21wm0425011 to TT; Japan Science and Technology Agency (
https://www.jst.go.jp/EN/ ), JPMJMS2299, JPMJMS229B to TT; Intramural Research Grant (30-9) for Neurological and Psychiatric Disorders of National Center of Neurology and Psychiatry (
https://www.ncnp.go.jp/en/ ) to TT; The Takeda Science Foundation (
https://www.takeda-sci.or.jp/en/ ) to TT; Taiju Life Social Welfare Foundation (
http://www.kousei-zigyodan.or.jp/9 ) to TT. KAKENHI from JSPS: JP19K16886, JP23K14673 and JP23H04138 to NN. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
Preprocessing calcium imaging data, encompassing actions such as downsampling spatiotemporal dimensions and selecting specific regions of interest (ROIs) within the images, can refine data and generally enhance decoder performance, whereas it may also obscure valuable spatiotemporal information. Conversely, employing images with minimal to no processing preserves the integrity of the original data, facilitating more immediate decoding capabilities. This approach is suitable for near real-time behavior decoding and identification of significant image areas for neural decoding without arbitrary data handling. CNN most applies to image data, while RNN is often used for sequential inputs, including time-variable data [ 2 ]. By combining these architectures, CNN-RNN decoders better capture temporal dynamics of behavioral features such as hand and finger movements from intracortical microelectrode array, electrocorticography, and electromyogram recordings, compared with classical machine learning methods [ 6 , 7 , 15 , 16 ]. Given these technological advances, we designed a two-step CNN-RNN model for decoding mouse behavioral states from the mesoscopic cortical fluorescent images without intermediate processing. Moreover, it is desired to identify biologically essential features for deep learning classification to make the models interpretable and transparent for explanations of neural decoding as suggested by XAI-Explainable Artificial Intelligence [ 17 ]. To this end, we applied a visualization strategy to identify the features that contributed to the performance of the CNN-RNN-based classifications for our calcium imaging data [ 18 ], which was applied to electrophysiology in the neuroscience field [ 19 ]. We identified the somatosensory areas are the most significant features for the type of behavioral states during voluntary locomotion behavior. This unbiased identification was supported by separate analyses of regional cortical activity using deep learning with RNN and the assessment by Deep SHAP, a developed Shapley additive explanations (SHAP) for deep learning [ 20 , 21 ]. Our findings demonstrate possibilities for neural decoding of voluntary behaviors with the whole-body motion from the cortex-wide images and advantages for identifying essential features of the decoders.
The calcium imaging technique allows us to measure in vivo neural activity during behavioral conditions from microscopic cellular to mesoscopic cortex-wide scales [ 8 , 9 ]. Recent studies suggest that cellular activities have enough resolution for decoding behaviors. The cellular imaging data using microendoscopy in the hippocampal formation was used to decode free-moving mouse behaviors [ 10 – 12 ] by a Baysian- and a recurrent neural network (RNN)-based decoders. In addition, a convolutional neural network (CNN) is also used to predict the outcome of lever movements from microscopic images of the motor cortex in mice [ 13 ]. On the other hand, it is little known whether mesoscopic cortex-wide calcium imaging that contains neural activity at the regional population- but not the cellular resolution is applicable for neural decoding of animal behaviors. Our recent study suggests the potential to classify mouse behavioral states from mesoscopic cortex-wide calcium imaging data using a support vector machine (SVM) classifier [ 14 ]. This mesoscopic strategy may be appropriate for end-to-end analyses since it deals with substantial spatiotemporal information of neural activity over the cortex.
Neural decoding is a method to understand how neural activity relates to perception systems and the intended behaviors of animals. Deep learning is a powerful tool for accurately decoding movement, speech, and vision from neural signals from the brain and for neuroengineering such as brain-computer interface (BCI) technology that utilizes the correspondence relationship between neural signals and their intentional behavioral expressions [ 1 – 3 ]. In clinical studies, electrical potentials measured by implanted electrodes in a specific brain area, such as the motor cortex, were often used to decode the intended movements such as finger motion, hand gesture, and limb-reaching behavior [ 4 – 7 ]. In contrast, neural decoding for whole-body movements such as running and walking remains uncertain due to technical difficulties. For example, contamination of noise signals (e.g., muscular electrical signals during muscular contraction) detected in electroencephalography (EEG) recording disturbs the decoding of behaviors, and the immobilized conditions in functional magnetic resonance imaging (fMRI) and magnetoencephalography (MEG) scanners prevent neural recording for whole-body movement. It is challenging to decode voluntary behaviors during whole-body movements from brain dynamics that contain complex information processing from motor planning to sensory feedback.
Results
To perform behavior classification from the cortical activity with deep learning, we used the previously reported data composed of mesoscopic cortex-wide 1-photon calcium imaging in the mouse, which exhibits voluntary locomotion behavior in a virtual environment under head-fixed conditions [14]. The fluorescent calcium signals from most of the dorsal cortex were imaged at a frame rate of 30 frames/s during a 10-min session (18,000 frames/session) from behaving mice (Figs 1A–1B). Two behavioral states (run or rest) were defined by a threshold of the speed of locomotion (>0.5 cm/s) and binarized as 1 for a run and 0 for rest in each frame. The proportion of all run states during a session differed according to individual mice (mean ± SD; mouse ID1, 36 ± 8% (n = 11 sessions); ID2, 66 ± 22% (n = 12 sessions); ID3, 65 ± 16% (n = 14 sessions); ID4, 58 ± 11% (n = 15 sessions); ID5, 80 ± 8% (n = 12 sessions); Fig 1C). We used all image data (1,152,000 images from 64 sessions) for deep learning decoding. To generalize decoding across individuals, we assigned the data to training, validation, and testing at the ratio of 3:1:1 on a per-mouse basis (Fig 1D). Thus, we generated 20 models for all combinations and classified the test data with each.
PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 1. Cortical activity and behavioral states in behaving mice. (A) A schematic illustration of the experimental setup for measuring mesoscopic cortical calcium imaging and locomotor activity. (B) Images were obtained at 30 frames per second during a 600 s session. The label of behavioral state was based on locomotion speed (>0.5 cm/s) at the corresponding frame. (C) Proportions of the behavioral states in each mouse (n = 11–14 sessions from 5 mice). (D) The data allocation on a per-mouse basis. The data of each mouse was assigned at the ratio of 3:1:1 for training (Train), validation (Valid), and testing (Test).
https://doi.org/10.1371/journal.pcbi.1011074.g001
CNN-based end-to-end deep learning accurately classified behavioral states from functional cortical imaging signals We tried to classify the behavioral states from images of cortical fluorescent signals using deep learning with CNN. A pre-trained model for CNN, such as an EfficientNet [22], allows for efficient learning. To handle the single-channel images obtained from calcium imaging, we converted a sequence of three images into a pseudo-3-channel RGB image by combining the previous and next images with the target image (Fig 2A). First, we trained CNN with EfficientNet B0, where the individual RGB images were used for input data. The binary behavior labels were used for output (Fig 2B). We used the pre-trained model on ImageNet for the initial weight values in training. In training, the loss was reduced by increasing epochs in CNN decoders (Fig 2D, left). However, in validation, the loss increased with every epoch (Fig 2D, left). These results suggest that when using a large amount of input data (more than 1 million images), CNN learning efficiently progresses even in one epoch, and the models easily fall into overlearning during training. We chose a model with the lowest loss in the validation as a decoder at each data allocation. The decoder’s performance was evaluated by the area under the receiver operating characteristic curve (AUC) for all test data frames. The decoder using CNN alone classified the behavioral states with about 90% accuracy (0.896 ± 0.071, mean ± SD, n = 20 models; Fig 2E). PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 2. Behavioral state classification using deep learning with CNN. (A) Image preprocessing for deep learning with CNN. An image at frame t with images at neighboring frames (frame t −1 and t +1) was converted to an RGB image (image I t ) labeled with the behavioral state. (B) Schematic diagram of the CNN decoder. CNN was trained with individual RGB images. Then, CNN outputs the probability of running computing from the 1,280 extracted features for each image. (C) Schematic diagram of the CNN-RNN decoder. The pre-trained CNN extracted 1,280 features from individual RGB images in the first step. In the second step, a series of 1,280 extracted features obtained from consecutive images (e.g., eleven images from I t −5 to I t +5 (= input window, length ±0.17 s)) were input to GRU-based RNN. Then, the RNN output probability of running. (D) Loss of CNN and CNN-GRU during training and validation across three epochs. (E) The area under the receiver operating characteristic curves (AUC) was used to indicate the accuracy of decoders. The performance of decoders with CNN, CNN-LSTM, and CNN-GRU. ***P < 0.001, Wilcoxon rank-sum test with Holm correction, n = 20 models. (F) The performance of CNN-GRU decoders using smaller time windows gradually deteriorated while not above the 0.17 s lengths of the input window. **P < 0.01, N.S., not significant, Wilcoxon rank-sum test with Holm correction, n = 20 models.
https://doi.org/10.1371/journal.pcbi.1011074.g002 To improve the performance of decoding, we then created a two-step deep learning architecture that combines CNN with long short-term memory- (LSTM) [23] or gated recurrent unit- (GRU) [24] based RNN, in which the output at the final layer of the CNN was compressed by average pooling and connected to the RNN (Fig 2C). In this stage, input data was the sequential RGB images from −0.17 s to 0.17 s from the image t, located at the center of the input time window. We chose this time window size for decoder tests because the performance has deteriorated when using smaller time windows (see Fig 2F). We used weights of the former CNN decoders for setting the initial values in two-step CNN-RNN. As with CNN decoders, the loss of two-step CNN-RNNs was reduced by the increment of epochs in training, whereas it was increased in validation (Fig 2D, right). The performance of behavior state classification was upgraded using two-step CNN-RNNs regardless of individual cortical images and behavioral activities (GRU, 0.955 ± 0.034; LSTM, 0.952 ± 0.041; mean ± SD, n = 20 models; Fig 2E). In addition, we confirmed that the classification accuracy slightly deteriorated when using smaller time windows in the two-step deep learning (mean ± SD; 0.033s, 0.896 ± 0.100; 0.067 s, 0.929 ± 0.072; n = 20 models; Fig 2F). The performance was gradually improved but not significantly changed when the time windows ranged from 0.17 s to 0.50 s (0.17 s, 0.955 ± 0.034; 0.33 s, 0.960 ± 0.040; 0.50 s, 0.960 ± 0.044; Fig 2F). These results demonstrate that deep learning decoding with CNN classifies locomotion and rest states accurately from functional cortical imaging consistently across individual mice, and the performance can be improved by combining it with RNN.
The somatosensory area contains valuable information on the behavioral classification To make deep learning decoding interpretable, we tried to quantify the critical areas of images that contributed to the behavioral classification in the CNN-RNN decoder. Zeiler and Fergus proposed the validation method by removing the information of masking areas from images for CNN decoders [18]. Similarly, we calculated and visualized the importance score in subdivisions of images in each decoder using a method named cut-out importance (see Methods for details). Briefly, a subdivision of the image was covered with a mask filled with 0 before evaluation. The decoder tested with the masked images was compared with the decoder tested with the original unmasked images (Fig 3A). The importance score indicates how much the decoder’s performance was affected by the masked area. As a result, the highest importance score was detected slightly above the middle of the left hemisphere (0.054 ± 0.045; mean ± SD, n = 20 models; Fig 3B). The symmetrical opposite area is also higher than other subdivisions within the right hemisphere (0.024 ± 0.014). This laterality seemed to be derived from individual differences (S1 Fig). These subdivisions corresponded to the anterior forelimb and hindlimb areas of the somatosensory cortex (Fig 3C and S2 Fig), which were listed in one of the essential cortical areas in our previous study using SVM machine learning classification [14]. When both subdivisions with adjacent areas in the middle left and right hemispheres were occluded simultaneously, the decoding performance was significantly dropped (S3 Fig), suggesting that the middle left and right hemispheres are crucial for behavioral classification. PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 3. Visualization of essential features in CNN-RNN decoder. (A) An importance score was calculated by averaging differences from classification accuracy using a 1/16 masking area in each image (see Methods for details). (B) Importance scores in each subdivision (mean ± SD, n = 20 models). (C) Overlay of importance scores on the cortical image with ROI positions. See S2 Fig for ROIs 1–50.
https://doi.org/10.1371/journal.pcbi.1011074.g003
[END]
---
[1] Url:
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011074
Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.
via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/