(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org.

(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org.
Licensed under Creative Commons Attribution (CC BY) license.
url:https://journals.plos.org/plosone/s/licenses-and-copyright

------------

Benchmarking of deep learning algorithms for 3D instance segmentation of confocal image datasets

['Anuradha Kar', 'Laboratoire Rdp', 'Université De Lyon', 'Ens-Lyon Inrae', 'Inria', 'Cnrs', 'Ucbl', 'Lyon', 'Institut Du Cerveau Paris Brain Institute', 'Paris']

Date: 2022-05

Segmenting three-dimensional (3D) microscopy images is essential for understanding phenomena like morphogenesis, cell division, cellular growth, and genetic expression patterns. Recently, deep learning (DL) pipelines have been developed, which claim to provide high accuracy segmentation of cellular images and are increasingly considered as the state of the art for image segmentation problems. However, it remains difficult to define their relative performances as the concurrent diversity and lack of uniform evaluation strategies makes it difficult to know how their results compare. In this paper, we first made an inventory of the available DL methods for 3D cell segmentation. We next implemented and quantitatively compared a number of representative DL pipelines, alongside a highly efficient non-DL method named MARS. The DL methods were trained on a common dataset of 3D cellular confocal microscopy images. Their segmentation accuracies were also tested in the presence of different image artifacts. A specific method for segmentation quality evaluation was adopted, which isolates segmentation errors due to under- or oversegmentation. This is complemented with a 3D visualization strategy for interactive exploration of segmentation quality. Our analysis shows that the DL pipelines have different levels of accuracy. Two of them, which are end-to-end 3D and were originally designed for cell boundary detection, show high performance and offer clear advantages in terms of adaptability to new data.

In recent years, a number of deep learning (DL) algorithms based on computational neural networks have been developed, which claim to achieve high accuracy and automatic segmentation of three-dimensional (3D) microscopy images. Although these algorithms have received considerable attention in the literature, it is difficult to evaluate their relative performances, while it remains unclear whether they really perform better than other, more classical segmentation methods. To clarify these issues, we performed a detailed, quantitative analysis of a number of representative DL pipelines for cell instance segmentation from 3D confocal microscopy image datasets. We developed a protocol for benchmarking the performances of such DL-based segmentation pipelines using common training and test datasets, evaluation metrics, and visualizations. Using this protocol, we evaluated and compared 4 different DL pipelines to identify their strengths and limitations. A high performance non-DL method was also included in the evaluation. We show that DL pipelines may show significant differences in their performances depending on their model architecture and pipeline components but overall show excellent adaptability to unseen data. We also show that our benchmarking protocol could be extended to a variety of segmentation pipelines and datasets.

Funding: This study was supported by the Agence Nationale de la Recherche-ERA-CAPS grant, Gene2Shape (17-CAPS-0006-01) attributed originally to JT. AK was funded with a Post-doctoral fellowship under this grant. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2022 Kar et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Introduction

The use of three-dimensional (3D) quantitative microscopy has become essential for understanding morphogenesis at cellular resolution, including cell division and growth as well as the regulation of gene expression [1]. In this context, image segmentation to identify individual cells in large datasets is a critical step. Segmentation methods broadly belong to 2 types, namely “semantic segmentation” in which each pixel within an image is associated with one of the predefined categories of objects present in the image. The other type, which is of interest in this paper, is “instance segmentation” [2]. This type of method goes one step further by associating each pixel with an independent object within the image. Segmenting cells from microscopy images falls within this second type of problem. It involves locating the cell contours and cell interiors such that each cell within the image may be identified as an independent entity [3]. High accuracy cell instance segmentation is essential to capture significant biological and morphological information such as cell volumes, shapes, growth rates, and lineages [4].

A number of computational approaches have been developed for instance segmentation (e.g., [1,5–7] such as, for example, the commonly used watershed, graph partitioning, and gradient-based methods. In watershed approaches, seed regions are first detected using criteria like local intensity minima or user provided markers. Starting from the seed locations, these techniques group neighboring pixels by imposing similarity measures until all the individual regions are identified. In graph partitioning, the image is treated as a graph, with the image pixels as its vertices. Subsequently pixels with similar characteristics are clustered into regions also called superpixels. Superpixels represent a group of pixels sharing some common characteristics such as pixel intensity. In some graph-based approaches such as [8–10], superpixels are first estimated by oversegmenting an image followed by graph partitioning to aggregate these superpixels into efficiently segmented regions of the image. Gradient-based methods use edge or region descriptors to drive a predefined contour shape (usually rectangles or ellipses) and progressively fit them to accurate object boundaries, based on local intensity gradients [11,12].

Common challenges faced by these segmentation methods arise in low-contrast images containing fuzzy cell boundaries. This might be due to the presence of nearby tissue structures as well as anisotropy of the microscope that perturb signal quality, poor intensity in deeper cell layers as well as blur and random intensity gradients arising from varied acquisition protocols [13,14]. Some errors can also be due to the fact that cell wall membrane markers are not homogenous at tissue and organ level: In some regions, the cell membrane is very well marked, resulting in an intense signal, while in the other regions, this may not be the case. These different problems lead to segmentation errors such as incorrect cell boundary estimation, single cell regions mistakenly split into multiple regions (oversegmentation), or multiple cell instances fusing to produce a condensed region (undersegmentation).

In recent years, a number of computational approaches based on large neural networks (commonly known as deep learning or DL) [15] have been developed for image segmentation [16–18]. The key advantages of DL-based segmentation algorithms include automatic identification of image features, high segmentation accuracy, requirement of minimum human intervention (after the training phase), no need for manual parameter tuning during prediction, and very fast inferential capabilities. These DL algorithms are made of computational units (“neurons”), which are organized into multiple interconnected layers. For training a network, one needs to provide input training data (e.g., images) and the corresponding target output (ground truth). Each network layer transforms the input data from the previous level into a more abstract feature map representation for the next level. The final output of the network is compared with the ground truth using a loss (or cost) function. Learning in a neural network involves repeating this process and automated tuning of the network parameters multiple times. By passing the full set of training data through the DL network a number of times (also termed “epochs”), the network estimates the optimal mapping function between the input and the target or ground truth data. The number of epochs can be in the order of hundreds to thousands depending on the type of data and the network. The training will run until the training error is minimized. Thereafter, in a “recognition” phase, the neural network with these learned parameters can be used to identify patterns in previously unseen data.

In case of application of DL for image segmentation, the training and inference process is identical as above. The input data comprise raw images (grayscale or RGB), and the ground truth data are composed of highly precise segmentations of these input images where the desired regions are labeled.

Instance segmentation using DL is a challenging task especially for 3D data due to large computational time and memory requirements for extracting individual object instances from 3D data [19,20]. Therefore, the current trend in DL-based segmentation methods is to proceed in 2 steps. First, deep networks are used to provide high-quality semantic segmentation outputs. This involves the extraction of several classes of objects within an image such as cell boundaries, cell interiors, and background. These DL outputs are then used with traditional segmentation methods to achieve the final high accuracy and automatic instance segmentation even in images with noise and poor signal quality [21]. A generic workflow of such a DL-based instance segmentation process is shown in Fig 1.

PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 1. Generic workflow of a DL-based image segmentation pipeline. The DL network is first trained to produce a semantic segmentation which corresponds as closely as possible to a given ground truth. The trained network is then used to segment unseen images. The resulting semantic segmentation is then further processed to obtain the final instance segmentation. DL, deep learning. https://doi.org/10.1371/journal.pcbi.1009879.g001

In contemporary DL literature, 2 types of architecture for segmentation are commonly used: the ones based on the UNet/residual UNet network [22,23] and the approaches using the region proposal networks (RPNs) or Region-based Convolutional Neural Networks (RCNNs) [24]. We will only briefly present both the general properties of both types of networks. The UNet [22] has a symmetric DL architecture. One part (called encoder) extracts the image features, and the other part (named decoder) combines the features and spatial information to obtain the semantic segmentation, for example, cell boundaries, cell body, and image background. In order to obtain the instance segmentation, this is followed by methods such as watershed or graph partitioning. Examples of UNet-based two-dimensional (2D) and 3D segmentation algorithms include, e.g., [21,25,26].

Besides UNet, the other state-of-the-art DL architecture is a RCNN such as Mask R-CNN or MRCNN [27]. MRCNN differs from UNet as the former includes modules for object detection and classification unlike the UNets. MRCNN has been used for high accuracy and automatic segmentation of microscopy images in several works (e.g., [28,29]).

There currently exists a large number of DL pipelines (we have identified and reviewed up to 35 works in the last 5 years in S4 File), where variants of both the above architectures are used to address specific challenges in segmentation such as sparse datasets, availability of partial ground truths, temporal information, etc. (see S4 File for a more extensive review). However, the diversity of the currently available pipelines and inconsistent use of segmentation accuracy metrics makes it difficult to characterize and evaluate their relative performance based on the literature. The presence of such diversity has motivated several benchmarking studies such as [30,31]. Ulman and colleagues [30] describe a thorough evaluation of 21 cell tracking pipelines. One of these includes a 3D UNet DL system for segmentation. This study evaluates the capability of the methods to segment and track correctly different types of data (optical and fluorescent imaging and single and densely packed cells). Leal-Taixé and colleagues [31] compare 4 epithelial cell tracking algorithms using 8 time-lapse series of epithelial cell images. Although both papers highlight the importance of accurate image segmentation and underline the performance of DL, the characterization of the method errors is rather focused on the cell tracking part. In this paper, we focus in detail on the segmentation itself, comparing extensively and quantitatively the capacity of a number of selected DL protocols to accurately segment 3D images. To do so, we retrained the DL systems on a common benchmark 3D dataset and analyzed segmentation characteristics of each pipeline at cellular resolution. The pipelines used here are based on either UNet or RCNN architectures and were selected from the literature based on the following criteria. (i) First, as the focus of this work is on 3D confocal datasets, the pipelines are built for 3D instance segmentation of static images. Analyses of temporal information or specific architectures for cell or particle tracking are not included as these are extensively covered in [30,31]. (ii) Next, the pipeline implementations including pre- and postprocessing methods are available in open-source repositories. (iii) To ensure that the pipelines are reproducible properly on other machines, the training dataset used originally by the authors are available publicly. (iv) Last, the DL pipelines are trainable with new datasets. Based on these criteria, we identified 4 pipelines ([21,24,26,32]), which we further describe below.

The first pipeline is an adapted version of Plantseg [26], which can be trained using 3D images composed of voxels. It uses a variant (see Materials and methods section) of 3D UNet called residual 3D-UNet [23] for the prediction of cell boundaries in 3D, resulting in a semantic segmentation. These are then used in a postprocessing step for estimating the final instance segmentation using graph partitioning. Examples of graph partitioning include GASP [33] and Multicut [10].

The second DL pipeline [21] comprises a 3D UNet, which can be trained using 3D confocal images (i.e., composed of voxels) for prediction of cell boundary, cell interior, and image background regions (as 3D images). These semantic outputs of the 3D UNet are then used to generate a seed image for watershed-based postprocessing. Seeds in watershed segmentation indicate locations within images from where growing of connected regions starts in the image watershed map. The seed images produced from the UNet outputs in this pipeline are therefore used to perform 3D watershed and obtain the final segmentation output.

The third pipeline is adapted from Cellpose [32]. It uses a residual 2D-UNet architecture, which should be trained using 2D images (composed of pixels). The 2D trained UNet predicts horizontal (X) and vertical (Y) vector gradients of pixel values, or flows, along with a pixel probability map (indicating whether pixels are inside or outside of the cell regions) for each 2D image. By following the vector fields, the pixels corresponding to each cell region are clustered around the cell center. This is how 2D gradients in XY, YZ, and ZX planes are estimated. These 6 gradients are averaged together to find 3D vector gradients. These 3D gradients are used to estimate the cell regions in 3D.

The fourth DL pipeline is adapted by the authors of this paper from the well-documented open Mask R-CNN repository [24], and the 3D segmentation concept using this model is inspired from [34]. For the Mask R-CNN–based segmentation, a hybrid approach is adopted as shown in Fig 2. The pipeline uses a MRCNN algorithm, which is trained using 2D image data to predict which pixels belong to cell areas and which do not in each Z slice of a 3D volume leading to a semantic segmentation. Then, the Z slices containing the identified cell regions are stacked into a binary 3D seed image. The cell regions in this binary image are labeled using the connected component approach, where all voxels belonging to a cell are assigned a unique label. These labeled cell regions are used as seeds for watershed-based processing to obtain the final 3D instance segmentation.

PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 2. Displaying all the 3D segmentation pipelines together. The green colored boxes indicate the training process for the respective pipeline. The blue boxes indicate the predicted, semantic segmentations generated by the trained DL algorithms, and the orange boxes indicate phases of postprocessing, leading to the final instance segmentation. The MARS pipeline doesn’t include a training or postprocessing step, but parameter tuning is required. 3D, three-dimensional. https://doi.org/10.1371/journal.pcbi.1009879.g002

A further aspect of investigation in this work is to observe how these DL pipelines compare to a classical non-DL pipeline in terms of segmentation accuracy. They were therefore compared with a watershed-based segmentation pipeline named MARS [35], which uses automatic seed detection and watershed segmentation. In the MARS pipeline, the minima of local intensity of the image detected by a h-minima operator is used to initiate seeds in the image which are then used for 3D watershed segmentation of cells. This pipeline therefore does not involve any model training component.

As these 5 segmentation pipelines have been developed and tested on different datasets and have been characterized using different evaluation metrics, it is difficult to directly compare their performance. For example, in the original papers, the Plantseg and UNet+WS pipelines were trained and designed for images having membrane stainings and therefore use a UNet-based boundary detection method. The Cellpose model was originally trained with diverse types of bio-images such as those with cytoplasmic and nuclear stains and microscopy images with and without fluorescent membrane markers. The Mask R-CNN adopted in this work was originally trained using images from cell nuclei.

Therefore, the first step of the benchmarking protocol was to train the 4 DL pipelines on a common 3D image dataset. We next tested all the 5 segmentation (DL and non-DL) pipelines on a common 3D test image dataset. This was followed by estimating and comparing their performance based on a common set of metrics. Through this protocol, we aimed to develop an efficient strategy for quantitative and in-depth comparison of any 3D segmentation pipeline that currently exists or is under development. Our results show clear differences in performance between the different pipelines and highlight the adaptability of the DL methods to unseen datasets.

[END]

[1] Url: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009879

(C) Plos One. "Accelerating the publication of peer-reviewed science."
Licensed under Creative Commons Attribution (CC BY 4.0)
URL: https://creativecommons.org/licenses/by/4.0/

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/