(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org.

(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org.
Licensed under Creative Commons Attribution (CC BY) license.
url:https://journals.plos.org/plosone/s/licenses-and-copyright

------------

Unifying cardiovascular modelling with deep reinforcement learning for uncertainty aware control of sepsis treatment

['Thesath Nanayakkara', 'Department Of Mathematics', 'University Of Pittsburgh', 'Pittsburgh', 'Pa', 'United States Of America', 'Gilles Clermont', 'Department Of Critical Care Medicine', 'The Clinical Research', 'Investigation']

Date: 2022-04

Abstract Sepsis is a potentially life-threatening inflammatory response to infection or severe tissue damage. It has a highly variable clinical course, requiring constant monitoring of the patient’s state to guide the management of intravenous fluids and vasopressors, among other interventions. Despite decades of research, there’s still debate among experts on optimal treatment. Here, we combine for the first time, distributional deep reinforcement learning with mechanistic physiological models to find personalized sepsis treatment strategies. Our method handles partial observability by leveraging known cardiovascular physiology, introducing a novel physiology-driven recurrent autoencoder, and quantifies the uncertainty of its own results. Moreover, we introduce a framework for uncertainty-aware decision support with humans in the loop. We show that our method learns physiologically explainable, robust policies, that are consistent with clinical knowledge. Further our method consistently identifies high-risk states that lead to death, which could potentially benefit from more frequent vasopressor administration, providing valuable guidance for future research.

Citation: Nanayakkara T, Clermont G, Langmead CJ, Swigon D (2022) Unifying cardiovascular modelling with deep reinforcement learning for uncertainty aware control of sepsis treatment. PLOS Digit Health 1(2): e0000012. https://doi.org/10.1371/journal.pdig.0000012 Editor: Matthew Chua Chin Heng, National University of Singapore, SINGAPORE Received: August 14, 2021; Accepted: November 29, 2021; Published: February 17, 2022 This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication. Data Availability: This work uses the publicly available MIMIIC-III database. This can accessed by visiting https://physionet.org/content/mimiciii/1.4/. Funding: The author(s) received no specific funding for this work. Competing interests: The authors have declared that no competing interests exist.

Sepsis is a major host response to infection which can result in tissue damage, organ damage and death. The mortality and economic burden of sepsis is very large. In the U.S., sepsis is responsible for 6% of all hospitalizations and 35% of all in-hospital deaths [1, 2], and an economic burden of more than $20B per year [3]. The treatment of sepsis is extremely challenging, due to the high variability among patients, with respect to both the progression of the disease, the host response to infection, and the response to medical interventions, suggesting the need for a dynamic and personalized approach to treatment [4–6]. Presently, the search for treatment strategies to optimize sepsis patient outcomes remains an open challenge in critical care medicine, despite decades of research.

Recently, there has been considerable interest in the application of Reinforcement Learning (RL) [7] to extract vasopressor and intravenous (IV) fluid treatment policies (i.e., strategies) for septic patients from electronic health records data (ex. [8–12]). Informally, the goal is to learn a policy that maps the patient’s current state to an action (i.e., medical intervention), so as to maximize the chances of future recovery. The RL framework is well-matched to the actual behaviors of physicians, who continuously observe, interpret, and react to their patient’s condition. The promise of RL in medicine is that we might be able to find policies that outperform humans (as it has in other domains, ex. [13–15]), by automatically personalizing the treatment strategy for each patient, as opposed to using one that is expected to work well on the typical patient [16, 17]. However, there are many challenges that must be met before RL can be used to guide medical decision making in real-life settings [18].

A particularly severe challenge is partial observability of patient state. Despite the richness of data collected at the ICU, the mapping between true patient states and clinical observables is often ambiguous. We believe that this ambiguity can be reduced through the use of mechanistic mathematical models of physiology that relate observables to a more complete representation of the patient’s cardiovascular state. Such models are plentiful in the literature, and embody decades of research in physiology and medicine. Our proposed solution integrates, for the first time, a clinically relevant mechanistic model into a Deep RL framework. The specific model we use was chosen because it estimates the unobservable aspects of cardiovascular state that are relevant to specific interventions (vasopressors and IV fluids), and the clinician’s goals—counteracting hypovolemia, vasodilation, and other physiological disturbances. This model is integrated into our framework using a self-trained deep recurrent autoencoder that uses a variety of inputs, including the patient’s vital signs, organ function scores, and previous treatments.

The second challenge addressed by our framework is uncertainty in the learned policy, and thus the expected outcomes. Similar to previous efforts to extract sepsis treatment policies from retrospective data (ex. [8]), our method works in the Batch Reinforcement Learning setting [19], where the agent cannot explore the environment freely. In this setting, it is well known that RL can perform poorly [20], if the agent encounters states that are rare or even unobserved in the training data. For this reason, it has been argued that all forms of uncertainty should be quantified in any application of Artificial Intelligence to Medicine [21]. Thus, we quantify model uncertainty (This should not be confused with the model-based vs model-free RL distinction, because once we have inferred latent states, our approach qualifies as ‘model-free’. The literature also uses the term epistemic uncertainty and parametric uncertainty for model uncertainty.) via bootstrapping and take a distributional approach to factor in environment uncertainty. We also propose a decision framework where the clinician is presented with a quantitative assessment of the distribution over outcomes for each state-action pair.

1 Background and related work 1.1 Reinforcement learning Reinforcement Learning is a framework for optimizing sequential decision making. In its standard form, a Markov Decision Process (MDP), consisting of a 5-tuple (S,A,r,γ,p) is the framework considered. Here, S and A are state and action spaces, is a reward function, p : (S, A, S) → [0, ∞) denotes the unknown environment dynamics, which specifies the distribution of the next state s′, given the state-action pair (s, a), and γ is a discount rate applied to rewards. A policy is (a possibly stochastic) mapping from S to A. The agent aims to compute the policy π which maximizes the expected future reward E p,π [Σ t γtr t ]. In the partially observed setting there is a distinction between the observations, denoted as o t , and the state s t , and the environment dynamics includes the conditional probability density p(o t |s t ). This extends the MDP formalism to that of Partially Observed Markov Decision Process (POMDP). The search for of an optimal policy can be performed in several ways, including the iterative calculation of the value function, , which returns the expected future discounted rewards when following policy π and starting from the state s, or the Q-function, , which returns the expected future reward when choosing action a in state s, and then following policy π. Central to many RL algorithms is the Bellman equation [22]: (1) and the Bellman optimality equation: (2) (where Q*(s, a) is the optimal Q function, and s′ denotes the random next state). 1.2 Distributional and uncertainty aware reinforcement learning Distributional Reinforcement Learning [23–25] extends traditional RL methods by estimating the entire return distribution from a given state, rather than simply an expected value. It has been shown that distributional RL can achieve superior performance in the context of Batch RL [26]. For this reason, and because distributions are relevant to our overall goal of providing clinicians with an assessment of the range of possible outcomes for each state-action pair, we employ Categorical Distributional RL [23]. Here the state, action value distribution is approximated by a discrete distribution with equally spaced support. Further, we employ Deep Ensembles [27] to quantify the uncertainty associated with each state action pair. These ensembles are constructed using bootstrap estimates, as explained in the methods section. 1.3 Reinforcement learning in medicine Reinforcement Learning has been used for various healthcare applications. References [17] and [16] provide comprehensive surveys of healthcare and critical care applications respectively. In the specific context of sepsis treatment, Komorowski et al. [8] used a discrete state representation created by clustering patient physiological readouts, and a 25 dimensional discrete action space to compute optimal treatment strategies using dynamic programming based methods. Others have considered continuous state representations [9] and partial observability [10]. Our proposed decision support system is based on a preference score as shown in Fig 1A. In contrast to previous work, we choose a lower dimensional action space (9 actions), to ensure sufficient coverage in the training data, and a reduced decision time-scale, to be more aligned with clinical practice. The short time scale also provides a clinical justification for the less granular action space. Our rewards are based on previous work [9] (see Methods), which has intermediate SOFA-based rewards, and ±15 terminal rewards, depending on survival. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 1. Proposed decision support system (A): We use the compete patient history, which includes, vitals, scores, and labs, and previous treatment, to infer hidden states. These would all combine to make the state S t . Our trained agent, takes this state and outputs value distributions for each treatment, its own uncertainty, and an approximate clinician’s policy. We then factor in all 3 to propose uncertainty-aware treatment strategies. The electrical analog of the cardiovascular model (B) This provides a lumped representation of the resistive and elastic properties of the entire arterial circulation using just two elements, a resistance R and a capacitance C. This model is used to derive algebraic equations relating R, C, stroke volume (SV), filling time (T), to heart rate (F) and pressure. The Cardiac Output (CO) can be then computed as (SV)F. These equations define the decoder of the physiology-driven autoencoder. Complete physiology-driven autoencoder network structure (C) Patient history is sequentially encoded using three neural networks. A patient encoder computes initial cardiovascular state estimates using patient characteristics, a recurrent neural network (RNN) encodes the past history of vitals and scores, up to and including the current time point, and a transition network which takes the previous cardiovascular state, the action and the history representation to output new cardiovascular state estimates. https://doi.org/10.1371/journal.pdig.0000012.g001

3 Discussion and conclusion We present an interdisciplinary approach which we believe takes a significant step towards improving the current state of data-driven interventions in the context of clinical sepsis, in terms of improving both outcome and interpretability. Indeed, we believe that the maximum benefit of Artificial Intelligence applied to medicine is best realized through the integration of mechanistic models of physiology whenever possible, uncertainty quantification, and human expert knowledge into sequential decision making frameworks. Our contribution improves the status quo in several ways. Compared to prior work, our approach deals with partial observability of data, yet known physiology, by leveraging a low-order two-compartment Windkessel-type cardiovascular model in the context of self-supervised representation learning. As mentioned previously, this has several benefits. First, in the context of sepsis treatment, estimating the cardiovascular state is essential because the clinical decision to administer intravenous fluids or vasopressor is driven by an implicit differential diagnosis by the clinician, as to whether insufficient organ perfusion and shock are secondary to insufficient circulating volume (thus requiring fluids), vasoplegia (thus requiring vasopressors), or some combination of both fundamental pathophysiologies. Second, there is typically insufficient data to determine whether heart function is adequate (contractile dysfunction), but a mechanistic model provides an indirect means for estimating cardiac function by imposing known physiology. Finally, the incorporation of physiologic models improves model explainability, while deep neural networks and stochastic gradient-based optimizers make it possible to learn robust and generalizable representations from large data. We expect the unification of models based on first-principles and data-driven approaches will provide a powerful interface between traditional computational sciences and modern machine learning research, mutually benefiting both disciplines. We have not fully examined the association between inferred physiological state and treatment recommendation to confirm whether recommended actions are indeed clinically sensible. Such work is currently underway. We also introduce an approach to quantifying model uncertainty, which is essential in any practical application of RL-based inference using clinical data. To the best of our knowledge, this is the first time uncertainty quantification is used to quantify epistemic uncertainty in RL-based optimization of sepsis treatment, and of critical care applications more generally. (Previous approaches ex. [11] have considered inherent environment uncertainty). The method’s uncertainty estimates, combined with the recommended action comprise a simple framework for automated clinical decision support. This principle aligns with the larger goal of combining different forms of expertise and knowledge for better decision making, a philosophy consistent with the rest of this work. We chose a decision time step of one hour. Compared to similar work, this is much more compatible with the time scale of medical decision making in sepsis, where fluid and vasopressor treatments are titrated continuously. Accordingly, on such a time scale, there does not appear to be large differences in the relative merit of different dosing strategies. This makes intuitive sense: there is presumably a lesser need for major treatment modifications if decisions are made more frequently. Yet, a frequent finding across patients, especially the sickest ones, was that inaction (no intervention) was a consistently worse strategy. This also meets clinical intuition. Reducing the time scale of decisions is not only appealing clinically in situation of rapidly evolving physiological states, such as is the case in early sepsis, but it also provides a more compelling basis for a less granular action space. Indeed, if decisions are made hourly, it does meet clinical intuition to have fever discrete actions. Few physicians will argue that there is likely to be little difference in administering 100cc or 200cc of fluids in the next hour. In the extreme, if time were continuous, the likely decision space at any given time, is whether a fluid bolus should be administered or not. A similar reasoning applies to vasopressors (increase, reduce, status quo). We further notice that our methods consistently identify high risk, non-survivor patient states which can potentially benefit from more frequent vasopressor treatment. These results should of course, be subject to clinical verification. An important open problem in the application of offline RL to medicine is the means by which one evaluates learned treatment policies, given the obvious ethical issues associated with allowing an AI to exert some control over treatment. Still, proper clinical trials will be necessary, eventually, so the critical care community should define for itself the standards by which an AI would be deemed safe enough to enter clinical trials [32]. In this work, we have largely relied on a combination of medical expertise, and the fact that our model leverages prior knowledge in the form of a simple model of cardiovascular physiology, to argue that the learned policy is reasonable. We make no claim that the policy is expected to produce superior outcomes in sepsis patients, relative to human clinicians. One important area for future work may be the incorporation of more detailed models of physiology into our framework, or perhaps using such models in the context of in silico trials (ex. [33]) as a first step towards demonstrating that a learned policy is safe, and perhaps suitable for pre-clinical and clinical trials. Additional areas for future work include the design of alternative rewards (ex. based on time-dependent hazard ratios for death), and the application of risk-averse offline RL (ex. [34]).

4 Methods 4.1 Data sources and preprocessing Our cohort consisted of adult patients (≥ 17) who satisfied the Sepsis 3 [35] criteria from the Multi-parameter Intelligent Monitoring in Intensive Care (MIMIC-III v1.4) database [36, 37]. We excluded patients with more than 25% missing values after creating hourly trajectories, and patients with no weight measurements recorded. The starting point of trajectories is ICU admission. We further excluded patients who got discharged from the ICU but ended up dying a few days or weeks later at the hospital. Since we don’t have access to their patient data after the ICU release, treating the final ICU data as a terminal state would damage generalizability. We cannot treat those patients as survivors, however, as they were not released from the hospital. Actions were selected by considering hourly total volume of fluids (adjusted for tonicity), and norepinephrine equivalent hourly dose (mcg/kg) for vasopressors. In computing the equivalent rates of each treatment, we followed the exact same queries as Komorowski et al [8]. When different fluids were administrated, we summed up the total fluid intake within the hour, and discretized the resulting distribution. For vasopressors, we considered the maximum norepinephrine equivalent rate administered within the hour to infer the hourly dose. We used 0.15 mcg/kg/min norepinephrine equivalent rate, and 500 ml for fluids, as the 1,2 cutoff when discretizing. These were chosen, considering the mean, median of non zero rates and medical knowledge, We also observe that due to the low dimensional action space, there is flexibility in choosing the cutoffs. A separate 0 action for each was added to denote no treatment. Missing vitals and lab values were imputed using a last value carried forward scheme, as long as missingness remained less than 25% of values. A detailed description on extracting, cleaning and implementation specific processing as well as additional cohort details are included in the supplementary information (Appendix A and Appendix B in S1 Text). 4.2 Models 4.2.1 Physiology-driven autoencoder. Autoencoders are a type of neural networks which learn a useful latent, typically lower-dimensional representation of input data, while assessing the fidelity of this representation by minimizing data reconstruction error. Our autoencoder architecture provides an implicit regularization by constraining the latent states to have physiological meaning, and the decoder to be a fixed physiologic model described in the next section. We further use a denoising scheme by randomly zeroing out input with a probability of 10–25%, when feeding into the network. This random corruption forces the network to take the whole patient trajectory (prior to the current time point) and previous treatment into account when producing its output, because it prevents the network from memorizing the current observation. In essence, we ask the inference network to predict observable blood pressures and the heart rate using corrupted versions of itself, by first projecting it into the cardiovascular latent state, and then decoding that to reconstruct. More precisely, at time t, suppose the full history up to and including t is represented by h t . Then, the output of the system satisfies, Here is the corrupted history computed as, where p is a vector of same dimensions as h t such that each element is sampled independently from a Bernoulli distribution, and (⊙) denotes element wise multiplication. g, f are the encoder and the decoder respectively, a t denotes the treatment at time t and d denotes the demographic variables. The decoder f is detailed out in the next section, and g is the composition of neural networks as shown in Fig 1. Fig 1C, shows the complete architecture of our inference network. As shown in the figure, the encoder is comprised of three neural networks, a patient encoder which computes initial hidden state estimates, a gated recurrent unit (GRU) [38] based recurrent neural network to encode the past history of vitals and scores up to and including the current time point, and a transition network which takes the previous state, the action and the history representation to output new cardiovascular state estimates. We train this structure end-to-end by minimizing the reconstruction loss, using stochastic gradient-based optimization. The supplementary material (Appendix B in S1 Text) provides a detailed description of model and architecture hyper-parameters, and training details. Cardiovascular model. The cardiovascular model, is based on a two-element Windkessel model illustrated using the electrical analog in Fig 1B. This model provides a lumped representation of the resistive and elastic properties of the entire arterial circulation using just two elements, a resistance R and a capacitance C, which represent the systemic vascular resistance (SVR), and the elastance properties of the entire systemic circulation, respectively. Despite it’s simplicity, this model has been previously used to predict hemodynamic responses to vasopressors [39] and as an estimator of cardiac output and SVR [40]. The differential equation representing this model is: (3) were Q(t) represents the volume of blood in the arterial system. As explained in [39], over the interval [0, T] (where T is the filling time of the arterial system) we can write Q(t) as Q(t) = SVδ(t), where SV stands for Stroke Volume, the volume of blood ejected from the heart in a heartbeat. When the system is integrated over the interval [0, T] we obtain the following expressions for P sys , P dias , P MAP , i.e., the systolic,diastolic, mean arterial pressure, respectively, (4) T is the filling time and F is the heart rate, which is determined by T. This system of algebraic equations is used for the decoder of our autoencoder. Since heart rate can itself be affected by vasopressors and fluids, we added heart rate (F) as an additional cardiovascular state despite it being observable. Therefore we have a multivariate function f: {R, C, SV, F, T} → {P sys , P dias , P MAP , F}, represented by the equations above, and the trivial relationship F = F (Despite the obvious relationship we used both F and T, for ease of training and stability.) As stated previously, to prevent it from just using the current observations, we use a denoising scheme for training. This ensures at a fixed time, the model cannot memorize the current observation and learn to invert f, since there is a nonzero probability of corruption. Thus it has to learn to factor in the history and the treatments when determining the cardiovascular states. Once SV is inferred, the cardiac output (CO), can be computed as CO = (SV)F. Since f is not one to one, typically not all states are identifiable. To arrive at a better approximation we used the latent space to only model deviations from fixed baselines. We also posit that identifiable combinations of states, when trained with a denoising scheme, should provide important cardiovascular representations in the POMDP setting. 4.2.2 Denoising GRU autoencoder for representing Lab history. We use another recurrent autoencoder to represent patient lab history, motivated by the fact that labs are recorded only once every 12 hours. Forward filling the same observation for 12 time points, is almost certainly sub-optimal, and the patterns of change in lab history can be helpful in learning a more faithful representation. Thus, we use a denoising GRU autoencoder constructed by stacking three multi-layer GRU networks on top of each other, with a decreasing number of nodes in each layer, the last 10 dimensional hidden layer was used as our representation. This architecture is motivated by architectures used in speech recognition [41]. This model was also trained by corrupting the input, where each data-point was zeroed with a probability of up to 50%. (The rate was gradually increased from 0 to 50%). As with the previous autoencoder, this provides an extra form of regularization, and forces the learned representation to encode the entire history. Model architecture and training details and presented in the supplementary materials (Appendix B in S1 Text). 4.2.3 Behavior cloner. We use a standard multi-layer neural network as our imitation learner. This model is trained using stochastic gradient-based optimization by minimizing the negative log-likelihood loss, between the predicted action and the observed clinician action, with added regularization to prevent overfitting. We do mention that there are many other options that could be used as a imitation learner, including nearest neighbor-based method as in [10]. 4.3 POMDP formulation States. A state is represented by 41 dimensional real-valued vector consisting of: Demographics : Age, Gender, Weight.

: Age, Gender, Weight. Vitals : Heart Rate, Systolic Blood Pressure, Diastolic Blood Pressure, Mean Arterial Blood Pressure, Temperature, SpO2, Respiratory Rate.

: Heart Rate, Systolic Blood Pressure, Diastolic Blood Pressure, Mean Arterial Blood Pressure, Temperature, SpO2, Respiratory Rate. Scores : 24 hour based scores of, SOFA, Liver, Renal, CNS, Cardiovascular

: 24 hour based scores of, SOFA, Liver, Renal, CNS, Cardiovascular Labs : Anion Gap, Bicarbonate, Creatinine, Chloride, Glucose, Hematocrit, Hemoglobin, Platelet, Potassium, Sodium, BUN, WBC.

: Anion Gap, Bicarbonate, Creatinine, Chloride, Glucose, Hematocrit, Hemoglobin, Platelet, Potassium, Sodium, BUN, WBC. Latent States: Cardiovascular states and 10 dimensional lab history representation. Actions. To ensure each action has a considerable representation in the dataset, we discretize vasopressor and fluid administrations into 3 bins, instead of 5 as in previous work [9], [8] [10]. This results in 9 dimensional action space. Timestep. 1 hour. Rewards. We use the reward structure that was suggested by Raghu et. al [9], with a minor modification. Since lactate was very sparse amongst out cohort we only considered SOFA based intermediate rewards. Specifically, whenever s t+1 is not terminal, we use reward of the form: (5) For terminal rewards we put r(s t , a, s t+1 ) = 15 for survival and r(s t , a, s t+1 ) = −15 for non-survival. 4.3.1 Training. We only mention important details of training the RL algorithms here. Representation Learning related training and implementations are detailed out in the supplementary information (Appendix B in S1 Text). We train the Q networks using a weighted random sampling-based experience replay, analogous to the prioritized experienced replay [42], which has resulted in superior performance in classical DRL domains, such as Atari games. In particular for each batch, we sample our transitions from a multinomial distribution, with higher weights given to terminal death states, near death states (measured by time of eventual death), and terminal surviving states. We used a batch size of 100, and adjusted weights such that on average there is 1 surviving state, and 1 death state in each batch. This does introduce bias, with respect to the existing transition dataset, however we argue that this would correspond to sampling transitions from a different data distribution, which is closer to the true patient transition distribution, we are interested in, as we are necessarily interested in reducing mortality. We empirically observe that, when using such a weighting scheme the value distributions align more closely to clinical knowledge in identifying risky states, and near death states. A same weighting scheme was used for all ensemble networks, which are trained to estimate uncertainty. As mentioned previously, we verified that the main results on vasopressor treatment strategies hold even for pure random sampling. 4.4 Uncertainty In this section, we consider model uncertainty, and not the inherent environment uncertainty. Model uncertainty stems from the data used in training, neural network architectures, training algorithms, and the training process itself. Inspired by statistical learning theory [43], and the associated structured risk minimization problem [44], we define the model uncertainty, (conditioned on a state s and a action a), given our learning algorithm, and model architecture as: (6) Here, D denotes the unknown distribution of ICU patient transitions that we are attempting to learn our policies with respect to. θ is a random variable which characterizes the value distributions. (For the C51 algorithm this can be interpreted as an element in ). This is outputted by our networks trained on a dataset sampled from D, for a given state action pair. This random variable is certainly dependent on the training data, and the randomness stems from the inherent randomness of stochastic gradient based optimization [45] and random weights initialization. The quantity l is a divergence metric appropriate for comparing probability distributions. We use the Kullback–Leibler divergence [46] for l. 4.4.1 Estimating the uncertainty measure. We construct a Monte-Carlo estimate of the integral in (6) by bootstrapping 25 different datasets, each substantially smaller than the full training dataset, and training identical distributional RL algorithms in each. This can be done efficiently due to the sample efficiency of distributional methods. Additionally, we can approximate either by the ensemble value distribution, or by the value distribution of the model trained on the full training dataset. 4.5 Uncertainty aware treatment In this section, we describe a general framework for choosing actions that factors in uncertainty. Notice that, because our RL algorithm learns (an approximation of) the optimal value distributions, making decisions by considering additional information does not violate any assumption underlying the learning process. When suggesting safe treatment strategies, we want the proposed action to have high expected value, however we would also like our agent flexible enough to propose an action with less model uncertainty, if two actions have very close expected values to each other. Another important factor to consider is how likely an action is to be taken by a human clinician. This will have significance in a situation where human expertise is scarce. Large retrospective datasets subsume experience of hundreds of clinicians, and knowing what previous clinicians have done in similar situations, will be valuable such situations. Therefore we use behavior cloning to learn an approximate behavior policy of clinicians on average. To satisfy all three goals, we propose a general framework for choosing actions, based on an action preference score, , parameterized by two parameters. This general framework is flexible, yet simple, and the end-user can choose the parameters to reflect their own expert knowledge, and confidence of the framework. Let G(s, a) be a human behavior likelihood score function. In this work we equate G(s, a) with the probabilities outputted by the behavior cloning network described in section 6.2.3. Given a state s, we define associated with each action a, as: (7) where β, λ ≥ 0, u(s, a) is the parametric uncertainty associated with the state-action pair, s, a, G(s, a) is the behavior likelihood probability and is the Q function computed from the ensembled value distributions. When human expertise is available, G(s, a) can be modified or even re-defined to factor in expert opinion. λ penalizes uncertainty, and a low β forces the action to be close to a clinician action. We could recover the expected value criteria by setting β = 1, λ = 0, and we could use the system as a pure behavior cloner, by setting β = 0, λ = 0. Therefore β controls how far from the highest expected value/behavior likelihood score can the agent choose an action.

Supporting information S1 Text. In this section, we provide brief descriptions of the supplementary text file, S1 text. This provides additional exposition on results and methods complimenting the main text and is divided into 4 sections. Cohort Details (Appendix A): This appendix summarizes our cohort presenting summary statistics on the patients. Neural Network Architectures and Implementation Details (Appendix B): This section presents a detailed description of the neural networks used, implementation details and hyper-parameters involved. These include the representation learning, RL and uncertainty quantification. Additional Results (Appendix C): This section presents further results. We do this in three sub sections. RL Results: We present additional results of Reinforcement Learning. This section also include figures showing feature importance scores for all the features, heat maps presenting the recommended actions under different schemes, and expected value evolution for validation patients. Uncertainty Quantification Results: We continue our discussion on uncertainty quantification results. The section presents a table which shows the mean model uncertainties for non survivors and survivors, stratified by cohort and the action. OPE Results: This subsection includes results from OPE. The results include value estimates under different preference scores and cover all the ensembles. However, these results are subject to the caveats mentioned previously. Limitations and Open Problems (Appendix D): We conclude with a high-level discussion of limitations of both our and general RL methods. We further detail out some ideas for future work. https://doi.org/10.1371/journal.pdig.0000012.s001 (PDF)

[END]

[1] Url: https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000012

(C) Plos One. "Accelerating the publication of peer-reviewed science."
Licensed under Creative Commons Attribution (CC BY 4.0)
URL: https://creativecommons.org/licenses/by/4.0/

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/