(C) PlosOne
This story was originally published on plosone.org. The content has not been altered[1]
Licensed under Creative Commons Attribution (CC BY) license .
url:https://journals.plos.org/plosone/s/licenses-and-copyright
--------------------



Can subjective pain be inferred from objective physiological data? Evidence from patients with sickle cell disease
['Mark J. Panaggio', 'Johns Hopkins University Applied Physics Laboratory', 'Laurel', 'Maryland', 'United States Of America', 'Daniel M. Abrams', 'Department Of Engineering Sciences', 'Applied Mathematics', 'Northwestern University', 'Evanston']
Date: None

In our case, we found that most of the physiological measures were weakly correlated (see Table 6 and S1 Fig ) lending support for the independence assumption. Unfortunately, we found that they were not normally distributed. However, we circumvented this issue by using a non-parametric standardization method to transform the distributions to a normal distribution as follows:

Probabilistic models typically require assumptions about the probability distribution of the features. For example, many models rely on the Gaussian naive Bayes assumption: that the features are independent conditioned on the class and are distributed according to a normal distribution. Even when these assumptions are not strictly satisfied, the resulting models often provide useful predictions [ 22 ].

After removing outliers that satisfied Diastol < 10 (severe hypotension), Systol > 180 (severe hypertension), and SpO2 < 80 (severe hypoxia) and were more than 5 standard deviations from the mean, we divided the data into 10 subsets, each with an approximately equal number of records (because there were 105 records total, subsets each contained either 10 or 11). Models were trained and tested using 10-fold cross-validation. In other words, the models were trained (i.e., parameters were estimated) using 9 subsets, and then tested (i.e., performance metrics were computed) on the remaining subset. This training and testing process was repeated 10 times, once for each possible subset arrangement, and the evaluation metrics on the testing folds were then averaged to provide an estimate of the generalization performance of the algorithm [ 21 ].

Although aggregation into two-hour windows decreases the fraction of missing features in the data, it does not eliminate them entirely. We considered using imputation methods like fully conditioned specification (FCS) to fill in the remaining missing observations. However, these approaches rely on independent observations and therefore ignore sequential information during imputation. This means that trends in the physiological measures may not be captured during the imputation process. The imputation process is also computationally intensive and may not be suitable for real-time pain estimation. We therefore opted not to impute missing features and instead pursued probabilistic models that would allow us to perform inference conditioned on the available data.

There are various ways to address the challenges of missing features and irregular sampling. We addressed the irregular sampling problem by aggregating the data into two-hour windows. When multiple observations were available during a given window, the observed features were averaged. This aggregation process decreases the total number of data points, but no data is lost and it results in uniformly spaced observations. Averaging has the added benefit of smoothing the fluctuations in the physiological measures and pain, and is justified as long as the changes in those features occur over sufficiently long time-scales. The percent of data present (non-missing) for each feature after this procedure is displayed in Table 5 .

Time series models have been studied extensively [ 20 ]. Unfortunately, most time series models are designed for uniformly sampled observations that are completely observed and are therefore not directly applicable to these datasets which contain irregularly sampled pain scores and physiological measures and many missing features.

We extracted records containing pain scores (PainScore) on an 11 point visual analogue scale along with six physiological measures: (i) peripheral capillary oxygen saturation (SpO2), (ii) systolic blood pressure (Systol), (iii) diastolic blood pressure (Diastol), (iv) heart rate (Pulse), (v) respiratory rate (Resp), and (vi) temperature (Temp). Each observation includes a unique patient identifier along with a timestamp indicating when the data was recorded. As such, the data for each patient can be viewed as a time series.

Our results come from a dataset containing electronic medical records from both pediatric and adult patients with sickle cell disease that were admitted for vaso-occlusive crises at Duke Medical Center between January 1, 2014 and January 31, 2017. The dataset was limited to patients who were inpatient for at least 48 hours and required intravenous narcotics for treatment, such as via patient-controlled analgesia system. This included data from 46 unique patients over a total of 105 hospital visits. The demographics of this cohort of patients are typical of patients admitted with SCD pain in the US, however the conclusions we draw may not be applicable to outpatients or patients with SCD outside the US. We note that reference [ 18 ] also studied a subset of this data, using a different approach based on multiple imputation [ 19 ]. In this paper, in contrast, we use semi-supervised learning to develop models that can be used with incomplete data. As a result, our models could be used for real-time pain inference without a need for imputation of missing values.

One way to take the ordinality of pain into account is to interpret this as a regression problem rather than a classification problem. One could attempt to predict the pain, the deviation from the median pain, or the change in pain directly instead of assigning those continuous quantities to discrete classes. Standard methods such as linear regression or linear Gaussian state space models (the continuous analog of the HMMs discussed above) could be used to uncover the dependencies between the physiological measures and these numerical pain values. Unfortunately, these methods either operate under the assumption that these dependencies are linear or they require assumptions about the form of any nonlinearities. In preliminary testing, we found that these linear models underperformed their discrete counterparts and therefore we do not discuss them here.

One limitation of the classification models presented here is that the pain scores are binned into unordered classes. As far as the classifier is concerned, pain scores of 6 and 7 are equally distinct as pain scores of 4 and 7 since both pairs correspond to the same pair of classes: medium and high. Similarly, there is no requirement that medium and high pain classes should be more similar than low and high classes in the model. The models are free to discover this ordinality from the data, but the ordinality is not imposed a priori.

Since 10-fold cross-validation involves random assignment of patients to training and testing folds, the predictions of both of these naive models can be viewed as random variables. One can use bootstrapping to simulate this assignment and prediction process in order to estimate the probability distributions for those predictions as well as any relevant performance metrics (such as those discussed in Section 4.3). Given these distributions and the performance of a trained model, it is straightforward to estimate the probability that the naive model outperforms the trained model according to a given metric. These probabilities are analogous to the so-called p-values used in hypothesis testing. In other words, we can use these p-values to test the null hypothesis that a naive model performs as well as or better than the trained model. The p-values reported in Section 2 were computed using this approach. According to convention, p-values less than α = 0.05 are deemed statistically significant at the 0.05 level, and constitute evidence that the null hypothesis can be rejected in favor of the alternative. In this context, p < 0.05 constitutes statistically significant evidence that the trained model outperforms the naive model.

In the mode guessing model, we simply predict the most common class from the training data. That is analogous to the heuristic that a human might use in the absence of outside information. In the context of pain, when asked to estimate pain without additional data, a clinician might predict that each patient is experiencing the level of pain that is typical among similar patients or that the pain has not changed since the last time the pain was reported. A predictive model that fails to outperform such a model would have limited usefulness in a clinical setting.

In the random guessing model, we make predictions by randomly sampling from a multinomial distribution with class probabilities determined from those proportions. Note that this model is “naive” in the sense that it ignores the input features entirely. If a predictive model fails to outperform this baseline, then one can conclude that either the features contain no useful information about the variable that is being predicted, or the model is misspecified and is not appropriate for the dataset.

Before training a sophisticated statistical or machine learning model like an HMM, it is useful to consider an appropriate baseline model in order to put the model’s performance in context. We consider two such models: a random guessing model and a mode guessing model. Both types of models can be derived from the proportions of observations from each class in the training data.

The parameters of these distributions can be estimated in a semi-supervised fashion using a simplified version of the expectation maximization process used for HMMs. Rather than using the forward backward procedure to compute the posterior probabilities, we can instead exploit the IID assumption to compute them as follows: (4) where the term on the right hand side is given by Eq (3) . Using this result, both the parameters and predictions of the model can be updated iteratively as follows:

The GNB model is a simplified version of the HMM. Rather then modeling three transitions between states, we treat the observations as if they are independent and identically distributed (IID). As with the HMM, we assume that the physiological measures are independent so that the kth emission is Gaussian when conditioned on the pain class.

This step can be modified as follows. When no emissions are observed, we use uniform probabilities p( X t = x |Y t = i) = 1/3. When some but not all emissions are missing, p( X t = x |Y t = i, θ ) is computed by integrating over all possible values of the missing emissions. Since the emissions are independent, this distribution is just a product of Gaussians and one therefore only needs to multiply the densities that correspond to the observed emissions (when present). Finally, when the state variable is observed, we simply set the posterior probability p(Y t = i| X t = x , θ ) equal to one for the observed state and zero for all others.

The first step of the forward-backward procedure involves computing the posterior probabilities for each state conditioned on the observed emissions using the current parameter estimates. This is given by (3) which is derived from Bayes rule. The terms on the right hand side are straightforward to compute using the emission distributions, which are defined by the current parameter estimates θ , and uniform priors p(Y t = i| θ ) = 1/3 for i = 0, 1, 2.

In its original formulation, the Baum-Welch algorithm was designed for the case where the hidden states are never observed and the emission variables are completely observed. In the datasets studied here, some of the hidden states are known and many of the emission variables are missing. In this case, the forward backward procedure must be modified slightly, and the model is trained in a semi-supervised manner.

A fully specified three-state model with six independent emissions is therefore determined by a set of 48 parameters: three priors (π i ), nine transition probabilities (A ij ), 18 emission means (μ ik ) and 18 emission variances ( ). Training an HMM involves estimating these parameters from data. Typically this is achieved using maximum likelihood estimation (MLE), in which one selects the model parameters that are most likely to produce the observed sequences of emission variables. When the data is incomplete, MLE can be carried out using an expectation maximization method known as the Baum-Welch algorithm [ 23 ]. This algorithm is summarized as follows:

The emission variables are modeled as independent Gaussian random variables when conditioned on the hidden variable. In other words, the probability distribution of the physiological measures satisfies where i and k denote the particular class and physiological measure respectively and μ ik and denote the corresponding mean and variance. The initial state in each sequence is modeled as a random variable that follows a multinomial distribution with prior probabilities π 0 , π 1 , and π 2 .

The datasets discussed here contain multiple sequences of states and emission variables, each of which corresponds to a single hospital stay. Below, we use the superscript (n) for n = 1, 2, …N to reference individual sequences when necessary, but suppress this superscript to simplify the notation when there is no ambiguity.

We chose this type of model, which treats the evolution of pain as a random walk, due to the absence of consistent long-term trends in the observed pain (see S2 Fig ) In the direct and median-referenced models, we found that the most common transitions were from state i to state i. In other words, the pain class remains constant most of the time. We also observed that transitions between adjacent pain classes occurred more frequently than non-adjacent classes. These frequencies were reflected in the transition matrices learned by the model (see S3 Table ). In the difference classification problem, we observed that increases in pain were most often followed by decreases in pain and vice versa. Again, this was consistent with the learned transition matrices.

In an HMM, the state variable (the pain class) y t is hidden and evolves according to a probabilistic process that possesses the Markov property, i.e., the probability of transitioning to state i is independent of the history and depends only on the last state j. This process can therefore be fully specified by specifying a transition matrix A with A ij = p(Y t = i|Y t−1 = j).

For both classifiers, we were interested in estimating the probability distribution for the classes conditioned on the physiological measures, where θ denotes the set of parameters, S T denotes the entire sequence of physiological measures x 1 , x 2 , … x T and T is the length of the sequence for a given hospital stay. In the context of HMMs, the observed variables (the physiological measures) are referred to as emissions.

4.3 Evaluation metrics

In Section 4.2, we presented a variety of classification models for the evolution of pain. Given a sequence of emissions, these models can be used to produce probabilistic predictions for the corresponding pain categories. We now discuss various metrics used to compare the performance of these models.

The most natural tool for describing the performance of a classifier is the confusion matrix. Given an N-class classification problem, a confusion matrix is an N × N matrix with entry in row i and column j indicating the number of examples in the test set for which the true class was i and the predicted class was j. Unfortunately, there is no straightforward way to compare confusion matrices since they are multi-dimensional. Researchers often attempt to condense that information into a single metric allowing for direct comparisons. For example, given a confusion matrix C, the diagonal entries correspond to correctly classified observations, and one can therefore compute the accuracy by computing the proportion of correct predictions out of the total number of predictions.

Unfortunately, accuracy is not necessarily a good measure of performance. In classification problems with imbalanced class-sizes, naive models like the mode-guessing model can produce a high degree of accuracy despite the fact that their predictions are useless. This is analogous to a weatherman who never forecasts rain: although the forecast may be correct 90% of the time, it does not provide any useful information.

Alternatives to accuracy include the precision and recall which measure the number of true positives divided by the total number positive predictions and the number of true positives divided by the number of positive examples respectively. Precision is therefore the more useful measure when one is primarily concerned with avoiding false positives and recall is the more useful measure when one is primarily concerned with avoiding false negatives. Both metrics are bounded between zero and one with zero corresponding to a classifier that is always wrong and one corresponding to a perfect classifier. For N-class classification problems, one can compute each metric N times, once for each “positive” class and then average the results to obtain a single number. When the classes are weighted equally during the averaging, the metrics are referred to as the macro averaged precision and recall. When the classes are weighted in proportion to the number of observations in each class, the metrics are referred to as the weighted average precision and recall. Both averaged metrics are difficult to interpret without knowing the sizes of the classes.

The F 1 score is the harmonic mean of the precision and recall and is a way to combine both quantities into a single metric. Like precision and recall, it is possible to compute macro average and weighted average F 1 scores by averaging over the number of classes. Unfortunately, this can lead to counterintuitive results. For example, it is possible to have a macro averaged F 1 score that is lower than both the precision and recall. Consider a three class problem with 10 examples from each class and consider a classifier with predictions that are biased toward class 2 due to an overabundance of examples from class 2 in the training data. Suppose that this classifier produces the confusion matrix shown in Table 7. The single class precisions are given by 1/1, 1/1, and 10/28 and the single class recall values are given by 1/10, 1/10, and 10/10 for classes 0, 1, and 2 respectively. This means the F 1 scores are given by 2/11, 2/11, and 10/19. These results can be averaged across classes (in this case macro and weighted averaging are equivalent) yielding averaged scores of 11/14 ≈ 0.786, 2/5 = 0.4 and 62/209 ≈ 0.297 for precision, recall and F 1 score respectively. In other words, F 1 score is lower than both the precision and the recall despite the fact that it is the harmonic mean of those quantities when applied to each class. On the other hand, if we take the harmonic mean of the averaged precision and recall, we obtain 0.530 which is in between precision and recall.

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 7. Confusion matrix for a hypothetical classifier. In this table, rows indicate the true class and columns indicate the predicted class. The entries in the table describe the number of observations of each type. When computing the F 1 score for this classifier, one finds that, counterintuitively, the averaged result is smaller than both the precision and the recall. https://doi.org/10.1371/journal.pcbi.1008542.t007

Because average F 1 scores are popular multi-class classification metrics, we include the macro-averaged F 1 score in Tables 2–4 for the sake of completeness. However, due to these counter-intuitive properties, we do not consider these metrics to be particularly informative in the context of pain estimation.

Up to this point, all of the metrics we have discussed are based on the confusion matrix. In other words, the metric is computed from the predictions rather than the class probabilities. These predictions are obtained by selecting the class with the highest posterior probability. Unfortunately, in problems with imbalanced classes that are not well-separated, the priors (related to class frequencies of the classes) can dominate the likelihoods so that the posterior probability is always (or almost always) largest for the most common class and the classifier will never (or rarely) predict the minority classes. This does not necessarily mean that the posterior probabilities are uninformative. For example, suppose a patient usually has high pain with probability 0.8 and low pain with probability 0.1. If at a particular point in time, the posterior probabilities for high and low pain change to 0.55 and 0.3 respectively, then that is a sign that the patient’s condition may have improved and that the model is detecting evidence for that change. Metrics that are based on the confusion matrix cannot detect these small changes since the “High” pain class has the highest posterior probability in both cases.

The volume under the receiver operating characteristic surface (VUS) and its binary analog, the area under the receiver operating characteristic curve (AUC), use the posterior probabilities directly instead of the confusion matrix. For example, in a binary classifier one can use a threshold on the posterior probability for the positive class to determine whether to label the example as positive or negative. When this threshold is close to zero, the true positive rate (TPR) will be high, but the false positive rate (FPR) will also be high. One can reduce the false positive rate by increasing the threshold, but this may decrease the true positive rate as well. As the threshold varies, the TPR and FPR trace out a parametric curve (FPR,TPR) = (0,0) to (FPR,TPR) = (1,1). A useful classifier will possess a particular threshold with a low FPR and a high TPR. The area under this curve therefore provides a useful metric for measuring classifier performance.

The VUS is a generalization of the AUC to multi-class problems [15]. Direct computation of the VUS for an n-class classification problem involves integrating over a surface in n(n − 1) dimensions. This can be computationally intensive, so VUS is usually approximated using certain heuristics. The approach we use is motivated by reference [16], where the authors provide a concise interpretation of the VUS. They demonstrate that the VUS is equal to the accuracy obtained in a forced-choice experiment where the classifier is presented with examples from each class and asked to assign labels to each. This procedure can be implemented as follows:

Initialize two counters. Total = 0 and TotalCorrect = 0 . Select one example from each class from the testing set. Using the posterior class probabilities, attempt to assign labels to each of these examples. Note that this essentially a matching problem since we know that there must be an example from each class. We used decision rule III from [15] for the matching procedure. If all of the labels are correct, then increment TotalCorrect and Total . Otherwise, increment Total only. Repeat steps 2 through 4 until all combinations of examples from each class have been exhausted. The accuracy in this task, and therefore the VUS score, of this classifier is given by TotalCorrect/Total .

Note that a trivial classifier that guesses randomly will have an accuracy of 1/n! where n is the number of classes. This means that, in binary classification problems, nontrivial models should exceed a baseline VUS of 0.5, and in three-class classification problems such as those discussed here, the VUS score should exceed 0.167.

The number of combinations of examples from the classes grows quickly with the size of the dataset and the number of classes making this procedure impractical for even moderately sized datasets. Fortunately, one can obtain a satisfactory approximation by using random samples from each class rather than deterministically exploring every combination of examples from each class. We found 1000 samples to be sufficient for obtaining satisfactory estimates for the VUS.

Out of all the metrics presented here, we believe that the VUS is the most meaningful for the pain estimation problem due to its robustness to class imbalance and because of its ability to detect whether the posterior probabilities contain useful information for distinguishing between classes. According to this metric, both HMMs and GNB classifiers provide a significant improvment over null models despite the fact that the trivial mode classifier has higher accuracy.

[1] Url: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008542

(C) GlobalVoices
Licensed under Creative Commons Attribution 3.0 Unported (CC BY 4.0)
URL: https://creativecommons.org/licenses/by/4.0/


via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/