(C) PLOS One

(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .

Emergence of belief-like representations through reinforcement learning [1]

['Jay A. Hennig', 'Department Of Psychology', 'Harvard University', 'Cambridge', 'Massachusetts', 'United States Of America', 'Center For Brain Science', 'Sandra A. Romero Pinto', 'Department Of Molecular', 'Cellular Biology']

Date: 2023-12

In this study we will consider Pavlovian associative learning tasks, where the sequence of observations and rewards is effectively independent of the agent’s actions. As a result, both the beliefs and value function are independent of the agent’s actions, and the value function is simply a linear transformation of the beliefs (see Materials and methods ). This motivates a straightforward model for estimating value in such partially observable environments [ 6 , 7 ]: First compute beliefs, and then compute the value estimate as a linear transformation of those beliefs, with weights updated by TD learning. This model, which we will refer to as the “Belief model”, can be written as: where b t ∈ [0, 1] K is the model’s belief over the K discrete states, and only w is learned.

The question is, what is an appropriate sufficient statistic? One standard answer is the posterior probability distribution over hidden states given the history of observations and actions, also known as the belief state [ 5 ]: (5) which stipulates how to update the belief in state i given observation o t , and action a t−1 . Here we suppose there are K discrete states, though the above equation can be extended naturally to continuous state spaces.

As for how we construct , note that in a partially observable Markov process, agents do not observe the state s t directly, but instead observe only observations . However, observations are not in general Markovian, which means that cannot be naively written as a function of o t , but must instead be a function of the entire history of observations: (o 1 , …, o t ). One way of understanding this is to note that the value of an observation may depend on the long-term past [ 22 ]. In the dog leash example from the Introduction, the value (to the dog) of her owner picking up the leash depends on the history of events leading up to that moment—for example, if her owner recently announced that his car keys were missing. To use methods such as TD learning, which assume a Markovian state space, we require a compression of this history into a “sufficient statistic”—that is, a transformed state space over which the Markov property holds. Here we will suppose that, given such a sufficient statistic, , we can write as a linear function: (4) where is some feature (indexed by d) summarizing the history of observations, and is a learned set of weights on those features. We can learn w using Eq (3) by noting that , and thus Δ w = ηδ t z t .

Agents do not typically have access to the true value function. Instead, they have an estimate, , which they can update over time using sample paths of states and rewards. In TD learning, agents estimate the discrepancy in their estimated value function using a so-called temporal difference error, the precise definition of the RPE used in TD learning: (2) where when . In general, we will suppose is determined by a set of adaptable parameters θ . We can improve θ by following the stochastic gradient of the squared TD error: (3) where 0 < η < 1 is a learning rate, and is the gradient of with respect to θ .

One standard objective of RL is to learn the expected discounted future return, or value, of each state: (1) where is the state of the environment at time t, 0 ≤ γ < 1 is a discount factor, and r t is the reward. Rewards are random variables that depend on the environment state, and denotes an expectation over the potentially stochastic sequences of states and rewards. For notational simplicity, we will use the shorthand V t = V(s t ).

Importantly, this RNN-based approach resolves all three challenges for learning a belief representation that we raised in the Introduction: 1) The model can learn from observations alone, as no information is provided about the statistics of the underlying environment; 2) the model’s size (parameterized by H, the number of hidden units) can be controlled separately from the number of states in the environment; and 3) the model’s only objective is to estimate value. Though such a model has no explicit objective of learning beliefs (its only objective is to estimate value), the network may discover a belief representation implicitly. We next asked what signatures, if any, would indicate the existence of a belief representation. In the sections that follow we develop an analytical approach for determining whether the Value RNN’s learned representations resemble beliefs.

The Belief model presupposes that animals use a particular feature representation (i.e., beliefs) for estimating value. However, as we described in the Introduction, there are difficulties with assuming animals use a belief representation. Here we ask whether an alternative representation could be learned from the task of estimating value itself, rather than chosen a priori. Note that beliefs can be written as follows: (6) where f is a function parameterized by a specific choice of (fixed) parameters ϕ to ensure the equality holds. This latter equation has the same form as a generic recurrent neural network (RNN). This suggests a model could learn its own representation by treating ϕ as a learnable parameter. We refer to this alternative model as a “Value RNN”: (7) (8) (9) where is the activity of an RNN with H hidden units and parameters ϕ , θ = [ ϕ , w ] is our vector of learned parameters, and can be calculated using backpropagation through time. The only difference from the Belief model is that the representation, z t , is learned (via ϕ ). This allows the network to discover a representation—potentially distinct from beliefs—that is sufficient for estimating value.

RNNs learn belief-like representations

As a working example, we will consider the probabilistic associative learning paradigm where dopamine RPEs were shown to be consistent with a belief representation [7, 13]. This has the added benefit of ensuring that the RNN-based approach described above can recapitulate these previous results.

This paradigm consisted of two tasks, which we will refer to as Task 1 and Task 2 (Fig 1). In both tasks, mice were trained to associate an odor cue with probabilistic delivery of a liquid reward 1.2–2.8s later. The tasks were each composed of two states: an intertrial interval (ITI), during which animals waited for an odor; and an interstimulus interval (ISI), during which animals waited for a reward. In Task 1, every trial contained both an odor and a reward. As a result, the animal’s observations could fully disambiguate the underlying state: An odor signaled a transition to the ISI state, while a reward signaled a transition to the ITI state. In Task 2, by contrast, reward on a given trial was omitted with 10% probability. This meant the underlying states were now only partially observable; for example, in Task 2 an odor signaled a transition to the ISI state with 90% probability.

PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 1. Associative learning tasks with probabilistic rewards and hidden states. A. Trial structure in Starkweather et al. (2017) [7]. Each trial consisted of a variable delay (the intertrial interval, or ITI), followed by an odor, a second delay (the interstimulus interval, or ISI), and a potential subsequent reward. Reward times were sampled from a discretized Gaussian ranging from 1.2–2.8s (see Materials and methods). B-C. In both versions of the task, there were two underlying states: the ITI and the ISI. In Task 1, every trial was rewarded. As a result, an odor always indicated a transition from the ITI to the ISI, while a reward always indicated a transition from the ISI to the ITI. In Task 2, rewards were omitted on 10% of trials; as a result, an odor did not reveal whether or not the state transitioned to the ISI. https://doi.org/10.1371/journal.pcbi.1011067.g001

To formalize these tasks, we largely followed previous work [7, 13]. Each task was modeled as a discrete-time Markov process with states s t ∈ {1, …, K}, where each t denotes a 200ms time bin (Fig 2A). These K “micro” states can be partitioned into those belonging to one of two “macro” states (corresponding to the ITI and the ISI; see Materials and methods). At each point in time, the agent’s observation is one of o t ∈ {odor, reward, null} (Fig 2B). For each task, we trained the Belief model, and multiple Value RNNs (N = 12, each initialized randomly), on a series of observations from that task to estimate value using TD learning (see Materials and methods). Each RNN was a gated-recurrent unit cell [23], or GRU, comprised of H = 50 hidden units. Before training, the Value RNN’s representation consisted of transient responses to each observation (S1 Fig). After training, we evaluated each model on a sequence of new trials from the same task (Fig 2C).

PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 2. Observations, model representations, value estimates, and reward prediction errors (RPEs) during Task 2. A. State transitions and observation probabilities in Task 2. Each macro-state (ISI or ITI) is composed of micro-states denoting elapsed time; this allows for probabilistic reward times and minimum dwell times in the ISI and ITI, respectively. B. Observations emitted by Task 2 during two example trials. Note that omission trials are indicated only implicitly as the absence of a reward observation. C. Example representations (b t , z t ) and value estimates ( ) of two models (Belief model, left; Value RNN, right) for estimating value in partially observable environments, after training. D. After training, both models exhibit similar RPEs. https://doi.org/10.1371/journal.pcbi.1011067.g002

To confirm that this approach could recapitulate previous results, we measured the RPEs of each trained model (Fig 2D), where the model RPEs are calculated using Eq (2). Previous work showed that dopamine activity depended on the reward time differently in the two tasks, with activity decreasing as a function of reward time in Task 1, but increasing as a function of reward time in Task 2 [7] (Fig 3A). As in previous work, we found that this pattern was also exhibited by the RPEs of the Belief model (Fig 3B). We found that the RPEs of the Value RNN exhibited the same pattern (Fig 3C). In particular, the Value RNN’s RPEs became nearly identical to those of the Belief model after training (Fig 3D). We emphasize that the Value RNN was not trained to match the value estimate from the Belief model; rather, it was trained via TD learning using only observations. This result shows that, through training on observations alone, Value RNNs discovered a representation that was sufficient for both learning the value function as well as reproducing previously observed patterns in empirical dopamine activity.

PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 3. RPEs of the Value RNN resemble both mouse dopamine activity and the Belief model. A. Average phasic dopamine activity in the ventral tegmental area (VTA) recorded from mice trained in each task separately. Black traces indicate trial-averaged RPEs relative to an odor observated at time 0, prior to reward; colored traces indicate the RPEs following each of nine possible reward times. RPEs exhibit opposite dependence on reward time across tasks. Reproduced from Starkweather et al. (2017) [7]. B-C. Average RPEs of the Belief model and an example Value RNN, respectively. Same conventions as panel A. D. Mean squared error (MSE) between the RPEs of the Value RNN and Belief model, before and after training each Value RNN. Small dots depict the MSE of each of N = 12 Task 1 RNNs and N = 12 Task 2 RNNs, and circles depict the median across RNNs. https://doi.org/10.1371/journal.pcbi.1011067.g003

We next asked whether the Value RNN learned to estimate value using representations that resembled beliefs. We considered three approaches to answering this question. First, we asked whether beliefs could be linearly decoded from the RNN’s activity. Next, because beliefs are the optimal estimate of the true state in the task, we asked whether RNN activity could similarly be used to decode the true state. Finally, we took a dynamical systems perspective, and asked whether the RNN and beliefs had similar dynamical structure.

RNN activity readout was correlated with beliefs. We first asked whether there was a readout of the Value RNN’s representation, z t , that correlated with beliefs. Because the belief and RNN representations did not necessarily have the same dimensionality, we performed a multivariate linear regression to find the linear transformation of each RNN’s activity that came closest to matching the beliefs (see Materials and methods). In other words, we found the linear transformation, , that could map each RNN’s activity, , to best match the belief vector, , across time: To quantify performance, we used held-out sessions to measure the total variance of the beliefs that were explained by the linear readout of RNN activity (R2; see Materials and methods). We found that this readout of the Value RNN’s activity explained most of the variance of beliefs (Fig 4B; Task 1 R2: 0.61 ± 0.01, mean ± SE, N = 12; Task 2 R2: 0.67 ± 0.02, mean ± SE, N = 12), substantially above the variance explained when using an RNN’s activity before training (Task 1 R2: 0.38 ± 0.01, mean ± SE, N = 12; Task 2 R2: 0.41 ± 0.00, mean ± SE, N = 12). This is not a trivial result of the network’s training objective, as the Value RNN’s target (i.e., value) is only a 1-dimensional signal, whereas beliefs are a K-dimensional signal (here, K = 25). Nevertheless, we found that training a Value RNN to estimate value resulted in its representation becoming more belief-like, in the sense of encoding more information about beliefs. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 4. Value RNN activity readout was correlated with beliefs and could be used to decode hidden states. A. Example observations, states, beliefs, and Value RNN activity from the same Task 2 trials shown in Fig 2. States and beliefs are colored as in Fig 2, with black indicating ITI microstates, and other colors indicating ISI microstates. Note that the states following the second odor observation remain in the ITI (black) because the second trial is an omission trial. Bottom traces depict the linear transformation of the RNN activity that comes closest to matching the beliefs. Total variance explained (R2) is calculated on held-out trials. B. Total variance of beliefs explained (R2), on held-out trials, using different trained and untrained Value RNNs, in both tasks. Same conventions as Fig 3D. C. In purple, the cross-validated log-likelihood of linear decoders trained to estimate true states using RNN activity. Same conventions as Fig 3D. Black circle indicates the log-likelihood when using the beliefs as the decoded state estimate (i.e., no decoder is “trained”). https://doi.org/10.1371/journal.pcbi.1011067.g004

RNN activity could be used to decode hidden states. The previous analysis assessed how much information about beliefs was encoded by the RNN’s representation. Given that the belief representation is a probability estimate over all hidden states, we next asked whether the ground truth state could be decoded from the RNN’s representation. To do this, we performed a multinomial logistic regression to find a linear transformation of each RNN’s activity that maximized the log-likelihood of the true states (see Materials and methods). We quantified performance on held-out sessions by evaluating the log-likelihood of the decoded estimates. Because the beliefs capture the posterior distribution of the state given the observations under the true generative model, the log-likelihood of the beliefs is a ceiling on performance. We found that the log-likelihoods of the decoders trained on the RNNs’ activity approached those of the beliefs, and easily outperformed the decoders that used the activity of the RNNs before training (Fig 4C). Thus, training an RNN to estimate value resulted in a representation that could be used to more accurately decode the true state.

[END]
---
[1] Url: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011067

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/