(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .



Non-destructive classification of unlabeled cells: Combining an automated benchtop magnetic resonance scanner and artificial intelligence [1]

['Philipp Fey', 'Fraunhofer Institute For Integrated Circuits Iis', 'Development Center X-Ray Technology', 'Würzburg', 'Daniel Ludwig Weber', 'Jannik Stebani', 'Philipp Mörchel', 'Peter Jakob', 'Experimental Physics V', 'Biophysics']

Date: 2023-04

An ADCP was used—consisting of RPMI 1640 cell-culture medium with 0.25 mM of Dotagraf for T 2 adjustment and 0.6 wt% of agarose for T 1 adjustment—for generating a MR phantom showing similar T 1 / T 2 values compared to exemplary measurements of cells ( a ). The ADCP was used to prove the signals short-term reproducibility by measuring twelve separate samples in a row and comparing their T 1 and T 2 spectra ( b ). In addition, the long-term reproducibility was proven by measuring multiple ADCP samples throughout a 14-day time course ( c ). Using the ADCP samples, it was also investigated how the sample’s volume would influence the generated data ( d ). Within the given range of 10 μl to 50 μl the signal showed little to no variation, indicating a volume independent signal.

The ADCP was used to investigate the influence of the sample’s volume on the produced T 1 / T 2 signal. For this, samples with varying volume of ADCP from 10 μl to 50 μl in 5 μl increments were prepared. The experiment was repeated three times with freshly prepared ADCPs for each repetition. Like in Fig 1 C , the values for each repetition were combined to reflect the general tendency of the respective sample point. The results showed consistent T 1 and T 2 positions for the measured data throughout all recorded samples ( Fig 1 D ), indicating that the acquired T 1 / T 2 distribution was independent of the corresponding volume. Like in the previous section, statistics confirmed the initial visual interpretation of the plotted data (T 1 p-value = 0.9736; T 2 p-value = 0.7419; S4B Fig ).

Twelve individual ADCP samples were measured from one stock solution. The T 1 and T 2 spectra were plotted, stacked above each other ( Fig 1 B ). Here it became apparent, that the peaks fluctuated in width and height within a certain degree, but that the position of the peaks stayed consistent throughout all repetitions. This observation was congruent in both dimensions (T 1 and T 2 ). To investigate the long-term reproducibility of the tabletop scanner, a series of samples were measured daily over a period of two weeks ( Fig 1 C ). For each point in time, several samples were measured, and the resulting signal was combined to reflect the overall trend of the corresponding time point. The data showed the previously described fluctuation around a constant position in T 1 and T 2 , proving that the measured values remained constant over time. Statistical analysis confirmed the initial visual interpretation of the data (T 1 p-value = 0.0932; T 2 p-value = 0.1807; S4A Fig ).

In order to produce a defined and reproducible measurement, it was decided to establish a phantom sample to test the MR setup. A custom MR phantom was created to match the data obtained on Chinese hamster ovary (CHO) cells. The corresponding measurements were later included in Fig 3 . For this, a combination of the gadoteric acid-based contrast agent Dotagraf and agarose dissolved in Roswell Park Memorial Institute (RPMI) cell-culture medium was used. The concentration of both components was adjusted so that the produced T 1 / T 2 distribution ( Fig 1 A ) matched the one from the measured CHO cells. With this defined agarose-dotagraf cell phantom (ADCP) multiple measurements were performed to investigate the system’s reproducibility.

Automation of the measurement process facilitated the efficient generation of a comprehensive dataset that could be used for analysis. For this purpose, a low-cost robotic arm (a-1) was combined with a self-designed and 3D-printed sample holder (a-2; respective STL files can be found in the attachment). The automation platform was controlled by a microcontroller (a-3) connected to a laptop running a self-designed Matlab user interface (a-4). To keep the viability of the samples as high as possible, they were preheated to 37°C in a heating block (a-5). The measurement system consisted of a console (a-6), a temperature-controlled neodymium magnet operating at 37°C (a-7), and a low-noise amplifier (LNA) (a-8). Subsequent assessments of the viability of three biological replicates (n b ) showed a significant decrease only for HEK293T and Vero cells. Compared with the positive control, none of the other cells showed a statistically significant decrease in viability (b). When the root cause of this decrease was investigated, it was found that the only culture factor that showed significant differences was when the cells were kept at room temperature (RT) for 5 hours (c). This leaves the undersupply of cell culture medium, in addition to keeping the sample at 37°C, as the only factor to improve culture conditions for future applications and development. All samples showed highly significant differences compared to the negative control. All measurements were carried out using n b = 3. *: P ≤ 0.05 / **: P ≤ 0.01 / ***: P ≤ 0.001 / ****: P ≤ 0.0001; n b : biological replicates.

The NucleoCounter not only provided information on cell viability, but also an estimate of the cell’s diameter. Although the diameters showed significant differences ( S4A Fig ), these did not correlate with the observed changes in viability. CHO cells had the largest diameter, i.e., they yielded the largest cell sample at a given cell number and thus the least amount of excess medium, but still had one of the highest observed viabilities. The opposite can be assumed for the Vero cells, which had the smallest diameter, i.e., the largest amount of excess medium, but showed the lowest cell viability of all cell lines examined. This indicated that cell diameter did not have a significant effect on cell viability after processing.

In pre-measurement studies, it was determined, that the samples should contain 5e6 cells, as this yielded reliable results and could be obtained from most cell cultures ( S3 and S4C Figs). Therefore, every cell related sample was composed of this number of cells. To verify their viability, cells were processed in the automation platform and then measured using the NucleoCounter NC-200 as an unbiased, automated counting method. Three biological replicates (n b ) were measured for each cell line used. Statistical analysis of the data showed that only two of the ten measured cell lines (HEK293T and Vero) exhibited a significant decrease in viability ( Fig 2B ). All other cell lines did not show significant differences when compared to the positive control harvested immediately after the cells were harvested from the culture flask. An examination of the possible influences of culture conditions on the viability of the samples revealed that they showed a decrease in viability when stored at RT for the duration of 5 hours (the maximum time the samples would spend in the measurement cycle). All other factors studied did not result in a significant change in viability ( Fig 2B ). This was most evident when comparing the results with a positive control in which the cells were prepared in the same manner but kept in an excess amount of culture medium and stored in the incubator for the duration of the assay. The reported observations suggest that the undersupply of culture medium may be responsible for the reduced viability.

Once it had been demonstrated that the measurement method provided reproducible data, it was of central importance for the subsequent analysis to provide the AI algorithms with a sufficiently large pool of measurement data. The only way to achieve this in a timely manner was to automate the measurement process. Therefore, a low-cost robotic arm was combined with self-designed 3D-printed parts to build an autonomously functioning platform ( Fig 2A ; the STL files for 3D printing can be found in the attachment).

When contour plots were plotted for the measured 2D spectra, two major signal clusters became apparent. One was referred to as the cell peak (yellow arrowhead) and one as the medium peak (green arrowhead). For the cell lines, these clusters were indistinguishable at first glance (a), whereas the MSC clusters showed only one cell peak for undifferentiated cells (b-1 –yellow arrowhead; n b = 8) and two cell peaks for differentiated cells (b-2 –yellow and green arrowhead; n b = 7). The total dataset included n b = 362 (cell lines: n b = 354; MSCs: n b = 8) with a total of 369 samples, as MSCs were measured twice, once for the undifferentiated stage and once for the differentiated stage (c). For experimental reasons, the measured dataset was not balanced, meaning that not all cells were measured equally often. To reduce sample complexity, the weighted centroid was calculated for each cell peak and plotted as a singular T 1 / T 2 coordinate (d1—cell lines; d2—MSCs). Here, the clustering of the different cell lines and MSC differentiation stages became most apparent. This clustering provided the first evidence on how to distinguish the different cells based on their spectral data. n b : biological replicates.

To reduce the complexity of the signals, the spectra were reduced to only one T 1 / T 2 coordinate by calculating the respective weighted centroid. When these were plotted for the measured cell lines ( Fig 2D1 ) and the MSCs ( Fig 2D2 ), the respective regions were more distinguishable for each cluster ( S6 Fig ). As these were in distinct T 1 / T 2 regions, this provided initial evidence that the weighted centroid of the cell peak could provide suitable information for non-invasive classification.

Using the automation solution described previously, a dataset was created that included 362 independent measurements. For cell lines (n = 354), each measurement represented a biological replicate. For MSCs, eight biological replicates were measured, with one contaminated during differentiation, ultimately resulting in 15 independent measurements for MSC-associated data ( Fig 3C ). When plotting all spectral data for the measured cell lines, the signal clustered in two distinct locations. One was referred to as the media peak ( Fig 3A ; green arrowhead), while the other peak was referred to as the cell peak ( Fig 3A ; yellow arrowhead), as the latter only occurred when cells were added to the samples ( S2 Fig ). While the spectra for the measured cell lines showed very different cell peaks due to their shape and location (Figs 3A and S7 ), the spectra obtained for undifferentiated ( Fig 3B1 ) and differentiated ( Fig 3B2 ) MSCs were more locally specific. Another difference was that the cell lines and undifferentiated MSCs showed only one cell peak (yellow arrowhead), while the differentiated MSCs showed two (red arrowhead). This might be due to the lipid buildup during the differentiation process ( S8 Fig ). As described in the introduction, lipids usually show significantly different MR behaviors compared to aqueous solutions like cells and cell culture media.

Since this study aimed to demonstrate that the presented method can be used to non-invasively discriminate between different cell types, it was decided to perform most measurements based on commonly used cell lines ( S8 Fig ). Although they represent a biologically stable and reproducible model system, they are of little relevance for application in a clinical context. Therefore, patient-derived mesenchymal stromal cells (MSCs) were differentiated into adipocytes ( S8 Fig ). Both stages, undifferentiated and differentiated cells, were measured.

When training an SVM on the calculated weighted centroid of different combinations of cell lines, the accuracies obtained correlated directly with the number of cell lines included in the data (a). With the exception of 5 vs. 6, 7 vs. MSC, and 8 vs. MSC, all possible combinations yielded significant differences. The decision boundaries showed that the SVM had difficulties classifying the cell lines correctly (b-d; S9 and S10 Figs). For each iteration, the data were split into training and test data, and the final accuracy was calculated based on the test data. Because the MSC classification decision depended on two data points for each differentiated sample, rather than one, the SVM in these cases yielded widely varying results, ranging from 100% to 0%. Since the SVM calculation did not require too much computational effort, 300 technical replicates (n t ) were calculated for each combination of cell lines.

As previously stated, the weighted centroid showed promising results to use it for classifying the cells. Therefore, the calculated weighted centroids were used as input for a support vector machine (SVM), which classifies the given data by separating them using a non-linear vector. For this, the data were randomly split into training and testing data for every run by a factor of 50%. This ensured, that the SVM was only evaluated based on previously unseen data points. In order to take the SVM’s performance with different cell lines into account, the overall dataset could be split into subsets consisting of a partial amount of cells ( S1 Table ). To investigate the performance, based on the composition of the dataset, each combination was tested individually and the resulting accuracies for 300 technical replicates were compared next to each other ( Fig 4A ). The accuracy correlated directly with the number of included cell lines. The decreasing accuracy could also be visualized by plotting the decision boundaries for the SVM (Figs 4B–4D and S9 and S10 ). While the results were very promising for the classification of the cell lines (Average accuracy of 300 technical replicates: 2 CL = 98.89%; 3 CL = 89.3%; 4 CL = 84.04%; 5 CL = 72.16%; 6 CL = 70.34%; 7 CL = 64.25%; 8 CL = 62.03%; 9 CL = 55.84%; 10 CL = 50.82%; CL = cell lines), the performance for the MSC classification (62.67%) was highly variable and in most cases significantly different to the cell lines. Only for the comparison of 7 cell lines and 8 cell lines to MSC, no significant changes could be detected. The decision boundaries for the MSC classification ( Fig 4E ), showed the algorithm’s struggle to classify MSCs, because the decision was not only based on one data point but on two datapoints simultaneously. By design, this was not a challenge that the SVM was designed for, limiting its applicability to situations where the MR data can be reduced to one individual centroid. Due this this high variability in results, the SVM did not provide a suitable machine learning approach to classify the development of MSCs.

Augmentation of measured data to obtain enough datapoints to train an artificial neural network (ANN)

While the SVM provided suitable results for the classification of cell lines, it did not provide the expected results for the classification of differentiated vs. undifferentiated MSCs. Another aspect that needed reflecting was, that the reduction of the spectrum to just one T 1 and one T 2 coordinate, also lost all information on the orientation and shape of the original signal peak. To incorporate this information into the evaluation, it was decided to use artificial neural networks (ANN) as a classification approach. These incorporated not only one coordinate but the entire matrix of 300 x 300 datapoints.

The downside of an ANN is, that it is highly dependent on enough training data. Therefore, a self-designed augmentation algorithm was used to increase the amount of available training and testing data. The algorithm individually selected the cell peaks and was able to shift them by a random value with a definable standard deviation in T 1 and T 2 . The algorithm was also able to stretch or shrink the selected peak by a random percentage of the original peak width (with a definable standard deviation), keeping the overall signal intensity constant, generating different shapes of signal peaks. To estimate the optimized values for augmentation, the cell line and MSC data were each augmented 40 times using different augmentation values. The weighted centroids for each augmented dataset were calculated and plotted, to visualize the effect of the augmentation on the data (Fig 5). With increasing the augmentation values (shift and stretch), the individual signal groups started to merge into each other resulting in areas of overlapping signal. This was most prominent for shift values ≥ 5 ms (standard deviation). Because of the fewer datapoints, the merging effect for the MSCs was not as prominent as for cell line data. The observed merging of the centroids gave a first possible limitation for the optimization of the augmentation parameters for the ANN training.

The overall effect of the augmentation algorithm was also made visible when the individual spectra were projected onto separate T 1 and T 2 axes (S11 and S12 Figs). Here, the shift of the augmented values relative to the original values could be shown more comprehensively. The stretching aspect of the augmentation was also made visible by different peak widths and heights.

PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 5. Visualization of the effect of augmentation on the data. Different combinations of increasing stretch and shift values were tested to optimize the augmentation parameters. To visualize the results, the weighted centroids for the augmented data (augmentation factor of 40) was calculated and plotted on scatter plots. This was done separately for the augmented cell line data and the MSC data. Due to the double logarithmic scaling of the x- and y-axis of the plots, the effect was more visible at T 2 values. With increasing values, it could be shown how the originally localized signal clusters merged into overlapping regions. This merging of the data may indicate an initial limit to the augmentation values, as strong augmentation leads to indistinguishable signal clusters. Due to the small number of MSC data, the merging of augmented values is not as profound as in cell lines, not resulting in indistinguishable data. https://doi.org/10.1371/journal.pcbi.1010842.g005

The augmented data was then, together with the original data, used to train neural networks. The structure of the network was based on a modified version of the VGG network [37]. A network consisting of seven convolutional layers and four sense layers was chosen for MSC classification. Normalization and maxpooling layers were inserted between the convolutional layers. Relu was chosen as the activation function and Adamax as the optimization function (S14 Fig). For training and testing, cells were cropped to the region representing the cell peak (T 1 : 0–3.0079 s / T 2 : 0–0.4062 s).

When examining the influence of augmentation factors on ANN performance, the statistics showed that there were no significant differences between the assessed augmentation values (Fig 6A). The only significant difference was the comparison between non-augmented and augmented data. The latter yielded significantly higher accuracies. This confirmed the original assumption that the initial dataset was too small, and that augmentation was crucial for training success. Since the amount of data directly correlated with the time required for the ANN to train, the augmentation factor was set to five for further MSC-based ANN analysis.

Further evaluation of the augmentation parameters showed no significant differences between different stretch and shift values (Fig 6B). Based on the highest mean value, 1% stretch and 5 ms shift were selected as the final augmentation parameters for MSC-based ANN training (1% / 1 ms: 83.33%; 5% / 1 ms: 73.89%; 1% / 5 ms: 85%; 5% / 5 ms: 76.67%; 10% / 5 ms: 81.11%; 5% / 10 ms: 74.44%; 10% / 10 ms: 81.11%). Testing higher augmentation values showed a significant decrease in accuracy above a certain threshold. At 25% stretch and 25 ms shift, the results were not significantly different from the previously measured samples, even though the average accuracy dropped to 57%. At 25% and 50 ms, the generated results fell below 50% accuracy. Statistical analysis showed that this was significantly lower compared to all previously measured samples except for 25% stretch and 50 ms shift (S13A Fig). When comparing the results to those generated from uncropped data augmented with 1% stretch and 5 ms shift, the results showed significantly lower accuracies compared to the previously measured cropped data (S13 Fig). This demonstrated the importance of cropping the MSC data for training the ANN. This also demonstrated that the media peak did not hold viable information for the MSC classification, but on the contrary negatively influenced the performance.

Based on these initial assessments, the performance of the trained ANN was examined in more detail. For this purpose, ten independent runs were performed, whose cumulative training loss showed good convergence already after four epochs (Fig 6C). Similar observations were made for the test accuracy, where the plateau was reached after only two to four epochs (Fig 6D). The confusion matrix for the ten test runs (Fig 6E) also proved that most of the samples were correctly classified.

The only drawback of the ANN training based on the MSC data was the relatively high standard deviation of the test accuracy (Fig 6D; shaded area). This varied between 100% and 70%, indicating some instability between samples. This could be due to the limited number of samples initially measured. With an average test accuracy of 85%, it still provided significantly better results than MSC classification using SVM (S18 Fig), which had an average accuracy of 62.67%.

PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 6. Optimization of augmentation parameters and investigation of ANN performance for training on MSC data. Examination of the effect of augmentation factor on ANN performance showed that each augmentation factor resulted in significantly higher accuracy compared to performance on un-augmented data (a—average accuracy = 40%). No significant differences were found when comparing the different augmentation factors (5: 86.67%; 10: 79.7%; 20: 78.1%; 40: 76.18%). Therefore, it was decided to use an augmentation factor of five for future analyses, since the amount of data directly correlated with the time needed for training. A similar evaluation was performed for the augmentation parameters stretch and shift (b). All combinations examined did not yield significant differences. Based on the mean accuracy, 1% stretch and 5 ms shift were chosen as the final augmentation parameters for MSC classification. Based on these scores, an ANN was trained on the corresponding dataset. The cumulative loss values (c) converged after only three to four epochs, while the validation accuracy (d) reached its plateau of 85% at the same time. While the standard deviation (shaded area) of the loss function was very small, it ranged from 100% to 70% for the validation loss, indicating a dependence of the evaluation result on the train-test assignment. The confusion matrix (e—average values of ten replicates) also proved that most of the samples were correctly classified. All measurements were performed ten times independently and the resulting values were averaged. All depicted values are based on n t = 10. *: P ≤ 0.05 / **: P ≤ 0.01 / ***: P ≤ 0.001 / ****: P ≤ 0.0001. https://doi.org/10.1371/journal.pcbi.1010842.g006

Based on the MSC ANN, a similar network architecture was used to classify the data from the cell line samples. The network was also based on the ANN architecture and was optimized to achieve the highest possible accuracy. The network consisted of four convolutional layers separated by normalization and maxpooling layers. The convolutional layers were followed by four dense layers. Relu was used as the activation function and Adamax as the optimization function. Additionally, drop-out layers with a drop-out rate of 25% were used to reduce the rate of overfitting. As with the MSC classification, the architecture ended with a softmax function (S15 Fig).

The augmentation pipeline was optimized by iterations of different parameters. The optimization of the augmentation factor (Fig 7A) showed a similar behavior as before for MSC augmentation. Each augmentation value tested (5,10,20,40) showed significantly better performance compared to no augmentation. There were no detectable significant differences between the tested augmentation values. As before with the MSCs, the augmentation factor was set to five as this resulted in better performance while also reducing the time required to train the ANN. This observation was consistent across multiple cell combinations with increasing numbers of cell lines (S17B Fig).

The same augmentation parameters for shift and stretch that were tested for the MSC augmentation were also tested on the cell line data. Unlike the MSCs, the resulting accuracies for the cell lines showed significant differences (Fig 7B). The data shifted by 1 ms had significantly higher accuracy than the data based on a shift of 5 ms and 10 ms. The analysis also showed that the stretch of the signal peaks did not contribute to the overall result of ANN training, which means that the key factor was the position of the signal peak.Since they did not show significant differences, 1% stretch and 1 ms shift were chosen as the final augmentation parameters for optimal ANN performance.

For the optimal cell composition analysis, all possible combinations for the provided cells were tested and the average accuracy of ten independent runs with independent tensile test split assignments was evaluated. The best combinations (CHO, K562, HEK293T, HeLa, THP1, A549, Vero, MDA231, C2C12—in that order) were compared (Fig 7C). Similar to the previous observations for the MSC based training, the number of cell lines tested, directly correlated with the average accuracy based on the test data. In 70% of the cases, the addition of one additional cell line resulted in no significant change, while the addition of two additional cell lines resulted in a significant decrease in 80% of the cases examined (see table below Fig 7C).

The training loss and validation accuracy for training with two (Fig 7E; CHO and K562) and more cell lines (S16 Fig) showed a plateau for the validation accuracy after four to six epochs and a good convergence of training loss (Fig 7D) after four to ten epochs, which indicates that the ANN was successfully trained. Unlike MSC training, the standard deviation was much smaller, which can be explained by the increased number of training data leading to a more stable training performance.

As also the training length, given in number of training epochs, plays a vital role in the resulting accuracy of an ANN, two different training lengths were investigated. When comparing these two training lengths (12 and 25 epochs), there were no significant differences (S17A Fig). Since this saves 50% of the time needed for training, it was decided to use the twelve epochs as the final value.

Similar to the MSC ANN analysis, it was also investigated how the cropping of the data to the cell peak (T 1 : 0–3.0079 s / T 2 : 0–0.4062 s), influences the outcome of the training. Therefore, the previously used cell combinations were used to assess the network’s accuracy. The cropped results did not show any significant differences when compared to the previously used, uncropped data (S18A Fig).

PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 7. Augmentation and ANN performance for cell line classification. Different augmentation parameters were tested to evaluate their individual performance (a). As already described for the MSC classification (Fig 6), non-augmented data (augmentation factor = 0) provided significantly lower results compared to all augmentation values. There were no significant differences between the compared augmentation factors (5,10,20,40). As the amount of data directly correlated with the need time for training the ANN, it was decided to use an augmentation factor of 5 for subsequent ANN trainings. The comparison of different augmentation parameters revealed significantly lower accuracies when the corresponding data were shifted by more than 1 ms (b). According to the observed data, stretching the selected signal peak did not significantly affect the outcome of ANN training. Therefore, 1% strain and 1 ms shift were chosen as the final augmentation parameters. When used to train an appropriate ANN, accuracy correlated directly with the number of cell lines in the dataset (c). Performance was evaluated based on the cell line combinations used for training. When two cell lines (CHO, K562) were included in the training, cumulative loss converged after five to six epochs (d), while test accuracy (e) reached the plateau of 98% after two to three epochs of training. All depicted values based on n t = 10; *: P ≤ 0.05 / **: P ≤ 0.01 / ***: P ≤ 0.001 / ****: P ≤ 0.0001. https://doi.org/10.1371/journal.pcbi.1010842.g007

Finally, the results of the SVM training for the different cell combinations were compared with the results of ANN training (S18B Fig). The only comparison that showed significant differences was for MSC classification. Here, the ANN resulted in better classification accuracy. Otherwise, none of the cell combinations examined provided conclusive evidence that either the SVM or the ANN resulted in higher classification accuracy.

[END]
---
[1] Url: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010842

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/