(C) PLOS One

(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .

Mixtures of strategies underlie rodent behavior during reversal learning [1]

['Nhat Minh Le', 'Department Of Brain', 'Cognitive Sciences', 'Massachusetts Institute Of Technology', 'Cambridge', 'Massachusetts', 'United States Of America', 'Picower Institute For Learning', 'Memory', 'Murat Yildirim']

Date: 2023-10

In reversal learning tasks, the behavior of humans and animals is often assumed to be uniform within single experimental sessions to facilitate data analysis and model fitting. However, behavior of agents can display substantial variability in single experimental sessions, as they execute different blocks of trials with different transition dynamics. Here, we observed that in a deterministic reversal learning task, mice display noisy and sub-optimal choice transitions even at the expert stages of learning. We investigated two sources of the sub-optimality in the behavior. First, we found that mice exhibit a high lapse rate during task execution, as they reverted to unrewarded directions after choice transitions. Second, we unexpectedly found that a majority of mice did not execute a uniform strategy, but rather mixed between several behavioral modes with different transition dynamics. We quantified the use of such mixtures with a state-space model, block Hidden Markov Model (block HMM), to dissociate the mixtures of dynamic choice transitions in individual blocks of trials. Additionally, we found that blockHMM transition modes in rodent behavior can be accounted for by two different types of behavioral algorithms, model-free or inference-based learning, that might be used to solve the task. Combining these approaches, we found that mice used a mixture of both exploratory, model-free strategies and deterministic, inference-based behavior in the task, explaining their overall noisy choice sequences. Together, our combined computational approach highlights intrinsic sources of noise in rodent reversal learning behavior and provides a richer description of behavior than conventional techniques, while uncovering the hidden states that underlie the block-by-block transitions.

Humans and animals can use diverse decision-making strategies to maximize rewards in uncertain environments, but previous studies have not investigated the use of multiple strategies that involve distinct latent switching dynamics in reward-guided behavior. Here, using a reversal learning task, we showed that mice displayed a much more variable behavior than would be expected from a uniform strategy, suggesting that they mix between multiple behavioral modes in the task. We develop a computational method to dissociate these learning modes from behavioral data, addressing the challenges faced by current analytical methods when agents mix between different strategies. We found that the use of multiple strategies is a key feature of rodent behavior even in the expert stages of learning, and applied our tools to quantify the highly diverse strategies used by individual mice in the task. We further mapped these behavioral modes to two types of underlying algorithms, model-free Q-learning and inference-based behavior. These rich descriptions of underlying latent states form the basis of detecting abnormal patterns of behavior in reward-guided decision-making.

Here, we investigated these questions using a combination of behavioral experiments and new computational methods to analyze the mixture of strategies in rodent reward-guided behavior. We studied the behavior of mice in a deterministic reversal learning task involving two alternative choices, a simple task that can optimally be solved by a “win-stay, lose-shift” strategy. Despite this simplicity, we found that mice exhibit sub-optimal behavior in the task and deviated significantly from a uniform strategy. To dissociate the components of these mixed strategies, we have built on a previous state-space approach [ 24 ] to build a blockwise hidden Markov model (blockHMM) which allows inferring the latent states that govern rodent behavior within single sessions. Using this tool, we classified and characterized different modes of behavior and found that they can be grouped into four main classes: a “low-performance” class, two “intermediate-performance” classes, and a “high-performance” class. Finally, we showed that these diverse modes of behavior can be accounted for by two different models, model-free behavior involving trial-by-trial value adjustments, and inference-based behavior involving Bayesian inference of the underlying hidden world state. These new results and methods highlight the use of mixtures of strategies as a significant source of variability in rodent behavior during reversal learning, even in a deterministic setting with little uncertainty.

A source of behavioral variability that has not been well studied in previous studies is the use of mixed strategies in reversal learning. In other behavioral tasks, state-space modeling has revealed the existence of multiple behavioral states that interchange during sensory discrimination [ 23 , 24 ]. It is unclear whether the use of multiple strategies also exists in rodents’ reversal learning behavior, and if so, what components exist in the rodents’ behavioral repertoire. It is also unknown from previous studies whether mice use these complex strategies only in difficult reversal learning tasks, or whether complex strategies are also commonly used even in relatively simple, deterministic reversal learning environments.

In uncertain environments, simple models with relatively few parameters can be fitted to predict the behavior of reinforcement learning agents. For example, an often-used approach is to fit the behavior to a reinforcement learning agent with a learning rate parameter, together with an inverse temperature or exploration parameter [ 4 , 17 – 19 ]. However, recent studies into rodent behavior in reversal learning have revealed more complex behavior in this task, suggesting that simple models might not be sufficient to capture natural behavior. For example, when the level of uncertainty in the environment changes over the course of the experiments, mice can adapt their learning rates according to the statistics of the environment [ 20 , 21 ], suggesting that the learning rate is not fixed across the trial but varies depending on their internal estimates of the environment uncertainty. Furthermore, rodent behavior comprises two concurrent cognitive processes, one reward-seeking component and one perseverative component, which operate on different time scales during the training session [ 22 ]. These previous modeling approaches suggest a rich diversity in rodent behavior in the task and the need for sophisticated computational techniques to model the behavior.

Reversal learning is a behavior paradigm that is often used to study the cognitive processes underlying reward-guided action selection and evaluation of actions based on external feedback [ 1 , 2 ]. Experiments using this task in humans and diverse animal models have contributed to our understanding of the cortical and subcortical circuits that are involved in components of value-guided decision-making such as the evaluation of reward-prediction errors and value [ 2 – 6 ], assessment of uncertainty [ 7 , 8 ], and model-based action selection [ 9 – 11 ]. Detection of aberrant behavioral patterns in reversal learning is critical in clinical diagnostics, as these disruptions are often involved in neuropsychiatric disorders such as obsessive-compulsive disorder, schizophrenia, Parkinson’s disease [ 12 – 14 ], as well as neurodevelopmental disorders [ 15 , 16 ].

Results

Mice display sub-optimal behavior in a 100–0 reversal learning task We trained head-fixed mice on a reversal-learning task involving two alternate actions (Fig 1A). Mice were placed on a vertical rotating wheel [25], and on each trial, they were trained to perform one of two actions, left or right wheel turns. On each trial, one movement was rewarded with probability of 100% and the other with the complementary probability of 0 (Fig 1B). The environments were volatile such that the high- and low-value sides switched after a random number of trials sampled between 15–25 without any external cues, requiring animals to recognize block transitions using only the reward feedback. To ensure stable behavioral performance, we required the average performance of the last 15 trials in each block to be at least 75% before a state transition occurred. We collected behavioral data from n = 21 mice that were trained in the task for up to 30 sessions per animal (typical animal behavior shown in Fig 1C). PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 1. Mice display noisy behavior in a deterministic reversal learning task. (a) (Top) Behavioral task setup for head–fixed mice with freely–rotating wheel. Schematic created with biorender.com. (Bottom) Timing structure for each trial, demarcating the cue, movement and outcome epochs. (b) Structure of deterministic reversal learning task in a ‘100–0’ environment. Hidden states alternated between right–states, with high reward probability for right actions, and left–states, with high reward probability for left actions. The block lengths were randomly sampled from a uniform distribution between 15–25 trials. (c) Example behavioral performance of an animal in the reversal learning task, block transitions are demarcated by vertical dashed lines. Dots and crosses represent individual trials (correct or incorrect). Black trace indicates the rolling performance of 15 trials. (d) Session–averaged performance of all mice (n = 21) during training of 30 sessions. Dashed line indicates the ideal win–stay–lose–shift (WSLS) strategy. (e) Illustration of the sigmoidal transition function with three parameters: switch offset s, switch slope α, and lapse ε. Session–averaged switch offset (f), slope (g), and lapse (h) of all mice (n = 21) during training of 30 sessions. Dashed line indicates the ideal win–stay–lose–shift (WSLS) strategy. https://doi.org/10.1371/journal.pcbi.1011430.g001 In the 100–0 environment, the optimal strategy that yields maximum rewards is “win-stay-lose-shift”, repeating rewarded actions and switching actions after receiving an error which signals the beginning of the next block. Following this strategy, the accuracy in each block would be 93–96% per block (14/15 to 24/25 depending on the block length). Expert rodent behavior falls below this optimal level (Fig 1D) as their performance asymptotes to only 62% (expert performance range on day 30 across all animals was 30%– 76%), suggesting a source of sub-optimal reversal strategy that fundamentally underlies their behavioral patterns in the task. Deviations from a perfect win-stay lose-shift strategy could occur due to two reasons: animals might make more errors at the beginning of the block (early, persistent errors), or they might have a sustained error rate even after switching sides (late, regressive errors) [26,27]. To determine the source of the sub-optimality, we examined the number of “initial errors”, the average number of trials it took for animals to switch directions per block, and the “late performance”, which is their average performance on the last 10 trials of a block. A win-stay lose-shift agent would have 1 initial error and 100% late performance (dashed lines, S1A and S1B Fig). Experimental animals showed both a higher number of initial errors (1.9 ± 0.2 initial errors) and lower late performance (79 ± 4% performance; mean ± standard error for n = 21 animals) compared to the ideal win-stay lose-shift agents (S1A and S1B Fig). Overall performance was not correlated with side bias (difference in performance between left and right blocks; R = –0.3, p = 0.2, S1C Fig). Across all mice, there was no significant difference in block performances on left versus right blocks (S1D Fig; p > 0.05 for 29/30 sessions, Wilcoxon signed rank test). To characterize their switching dynamics more precisely, we fitted a logistic regression model with three parameters to observed choices (Fig 1E). These three parameters represent the latent transition between actions: the switch offset s, slope α, and lapse ε. The switch offset measures the latency of the switch, the slope measures the sharpness of the transition, while the lapse rate signifies the behavioral performance after the transition. A win-stay lose-shift agent would have offset close to 1, very high slope and zero lapse (“WSLS” dashed lines, Fig 1F-H). Expert mice on day 30 instead showed significantly longer switch offset, gentler slope and higher lapse rates (p < 0.01, p < 0.001, p < 0.001 respectively, Wilcoxon signed rank test; Fig 1F–1H). Thus, sub-optimal reversal learning behavior in rodent behavior was due to a combination of both slow switching and high lapse rates.

Mice display diverse and non-uniform switching dynamics in single sessions Our logistic regression model assumes rodent behavior is uniform in each session, but it is not clear if this assumption is valid. For instance, if mice use different transition modes within single sessions, the previous analysis would be insufficient to describe the switching dynamics. Supporting the possibility of multiple strategies during the session, we observed highly variable block-by-block performance and lapse rates within behavioral sessions. For example, the block-by-block performance and lapse rates of an animal, E54, were highly variable even at the expert stage (days 26–30 of training; Fig 2A). This variability is much higher than would be expected by a uniform strategy (red error bars; Fig 2A). There was also no apparent of change of strategies between the first and last blocks of the session, suggesting that these sources of randomness occur sporadically during the session and not due to a change in motivation of the animal with satiety states. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 2. Non–uniform performance of mice in reversal learning. (a) Example performance of a mouse (E54) in relation to its lapse rate in the last 5 days of training (days 26–30). Individual dots show the combinations of performance and lapse rates in single blocks. Lighter blue dots represent early blocks in the session while dark blue dots are late blocks. Red error bars represent the expected mean and standard deviation in performance and lapse rate assuming the mouse uses a single strategy. (b) Comparison of the observed standard deviation of block performances in the final five training sessions (black vertical lines) with the expected standard deviation in performance for an agent that uses a uniform strategy (box plots, n = 100 bootstrap runs). Each row represents one of 21 experimental mice (ID of animals shown on the y–axis). The average performance of each animal on the last 5 days of training, E, is shown on the right. https://doi.org/10.1371/journal.pcbi.1011430.g002 This unexpected increase in behavioral variability was consistently observed in our cohort of animals (Fig 2B). For each animal, we computed the “observed” variability in performance, which is standard deviation of performance across all blocks in the final 5 training sessions of the animals (black bars, Fig 2B). We then computed the degree of variability that would be expected from a uniform strategy. To measure this expected variability, we fit a single sigmoidal transition curve to the behavior in these last 5 sessions and generated the behavior of an agent that always executes this transition dynamics in all blocks in these sessions. We again computed the standard deviation in performance of this simulated behavior, and repeating this procedure for N = 100 runs yields a distribution of this variability measure of the uniform agent. In 20/21 animals, the observed variability of performance was significantly higher than expected (p < 0.01 across bootstrapped distribution), suggesting that rodent behavior is highly non-uniform and that they could be using multiple transition strategies in their behavioral sessions.

The diversity of transition modes is accounted for by the spectrum of model-free and inference-based strategies Our analytical approach decomposes rodent behavior in the reversal learning task into a number of switching modes, each characterized by a respective transition curve. Some of these transition modes have very fast offsets and close to zero lapse rates, resembling the behavior of a win-stay lose-shift agent. However, other modes were sub-optimal and cannot be explained by such a strategy. What types of underlying algorithm might give rise to these modes? We will show next that the diversity in transition modes can be sufficiently accounted for by the space of Q-learning and inference-based behavior, two types of algorithms that are frequently discussed in the literature of reversal learning. More concretely, the spectrum of transition functions might be accounted for by the variability between two classes of agents, Q-learning and inference-based agents. Q-learning is a model-free learning strategy that performs iterative value updates based on external feedback from the environment (Fig 5A, top). In the reversal learning task, the agents maintain two values for left and right actions, q L and q R . The value of the chosen action on each trial is updated according to (1) where r is the trial outcome (0 for errors or 1 for rewards), and γ is the learning rate parameter. We additionally assumed that the agent adopts an ε-greedy policy, choosing the higher-valued action with probability 1 - ε, and choosing actions at random (with probability 50%) on a small fraction ε of trials. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 5. Mapping transition dynamics to underlying behavioral strategies. (a) Implementation of Q–learning (top) and inference–based algorithms (bottom) for simulating choice sequences of artificial agents. (b) Example behavior of simulated Q–learning (top) and inference–based agents (bottom). Each dot or cross represents the outcome of a single trial. In the Q–learning plot, black and blue traces represent the values of each of the two actions. In the inference–based plot, black trace represents the posterior probability of the right state P (s t = R ∣ c 1 , r 1 ,…, c t–1 , r t–1 ). (c) We performed a computational simulation of an ensemble of Q–learning and inference–based agents taken from grids that spanned the Q–learning parameter space (top), or the inference–based parameter space (bottom). Based on the results of the simulations, the spaces were clustered into six groups (represented by different colors), that showed qualitatively different behavior. (d) Transition functions grouped according to the behavioral regime Q1–4, IB5–6. Black lines represent single agents and red trace represents the mean across all the transition functions in each group. (e) Behavioral regime composition of each of the six algorithmic domains (Q1–4, IB5–6). (f) Cross–validated confusion matrix showing the classification performance of a k–nearest neighbor (kNN) classifier trained to predict the class identity (Q1–4, IB5–6) based on the observed transition curve. Diagonal entries show the accuracy for each respective class. https://doi.org/10.1371/journal.pcbi.1011430.g005 In contrast, “inference-based” agents select actions by inferring the world’s hidden state, i.e., which side is more rewarding, on each trial (Fig 5A, bottom). The internal model of these agents consists of two hidden states, L and R, that determine whether the “left” or “right” action is associated with higher reward probability. The transitions of these hidden states are approximated by a Markov process with probability P switch of switching states and 1−P switch for remaining in the same state on each trial. Given this model and observed outcomes on each trial, the ideal observer can perform Bayesian updates to keep track of the posterior distribution of the two states (see update equations in Methods). The agent then uses the posterior over the world states to select the action that maximizes the expected reward on that trial. To understand the correspondence between the type of algorithm (Q-learning or inference-based) and the shape of the transition function, we will break down our analysis into two steps. We first built a forward model by performing a simulation of behavior that is exhibited by Q-learning and inference-based agents with different model parameters. This analysis characterizes and quantifies the features of the transition dynamics shown by each agent. Then, we evaluated whether it is possible to infer the underlying strategy based on the observed transition function using a decoding approach.

Forward simulations For forward simulations, the behavior of model-free agents was simulated for a range of parameters where 0.01 ≤ γ ≤ 1.4, and 0.01 ≤ ϵ ≤ 0.5, and inference-based agents were simulated for a parameter range 0.01 ≤ P switch ≤ 0.45 and 0.55 ≤ P rew ≤ 0.99 (example simulations shown in Fig 5B). As expected from the roles of the parameters from previous literature, transition curves reflect the variations of these parameters along principled axes. For instance, for model-free agents, increasing γ leads to faster switch offsets (S4A–S4C Fig), while varying ϵ predominantly affects the lapse rates of the sigmoidal transitions (S4B and S4D Fig). Inference-based agents are clearly distinguished from model-free behavior by their small lapse rates (S5 Fig). Their behavior varies along an axis that corresponds to the volatility of the environment. As P switch and P rew increase, the internal model assumed by the agents become increasingly volatile. This makes agents more sensitive to errors and hence resulting in faster switch offsets (S5B Fig). Despite these variations in the latency of the switch, the lapse rates of inference-based agents generally remain close to zero.

Backward inference Although there is a wide variation in the features of the transition curves that are exhibited by the various agents, the four types of transition modes that we previously observed in rodent behavior can be seen in particular regimes of either Q-learning or inference-based behavior. (1) “Low-performance”, random behavior is seen in the model-free agents with low learning rate. (2) “Intermediate-performance” behavior with high offset is seen in model-free agents with low learning rate and low exploration rate. It is also seen in inference-based agents in the stable regime. (3) “Intermediate-performance” behavior with high lapse is seen in model-free agents with high exploration rates and high learning rates. (4) “High-performance” behavior is seen in the inference-based agents in the volatile regime. We aim to establish a more precise mapping between the algorithmic spaces (Q-learning and inference-based) and transition dynamics exhibited by the agents. To simplify this mapping, we clustered the sigmoidal transition features of all simulated inference-based and Q-learning agents into groups that display similar transitions. Each agent’s transition curve was reduced to four features–the fitted switch offset, slope, lapse, and overall performance which were used to perform an unsupervised clustering into six groups (S6 Fig). These groups clustered together in the parameter spaces (Fig 5C). The transition curves of these six classes resembled the types of transitions that were observed in rodents’ HMM modes (Fig 5D). Matching these groups to the corresponding regime on the parameter spaces, we found that “low-performance” behavior was seen primarily in the Q1 regime (Fig 5E) which had the lowest learning rate γ. “Intermediate-performance”, high offset was seen in the Q2 regime which had intermediate learning rate but low exploration ε. “Intermediate-performance” with high lapse was seen in regime Q3, with the same learning rate as Q2 but higher exploration. It is also seen in regime Q4, the class which contains various agents with very high learning rate (γ > 1). “High-performance” behavior was seen in the inference-based agents (IB5-6). The regime IB5 also contains a small number of agents with intermediate-performance and high offset. To validate the utility of these classes, we trained a k-nearest neighbor (kNN) classifier to predict the class identity (Q1-4 or IB5-6) of synthetic agents given the observed transition curve. We found that the classifier performed with a high accuracy of 88% on a held-out test set (Fig 5F), compared to a chance performance of 17%. Altogether, these results establish a consistent and reliable mapping from the underlying strategies in the model-free or inference-based parameter spaces, to the observed transition functions that can account for the diversity of rodent behavior during reversal learning.

Changes in the composition of strategies during task learning We next investigated changes in the composition of strategies during the learning of the task. Behavioral modes of animals were first re-classified into the six behavioral strategies, Q1-4, IB5-6 (Fig 6A and 6B). Animals typically used a combination of Q-learning and inference-based modes throughout their training sessions. However, there was some variability in the types of strategies used by different animals. A small subset of animals, such as f16 (Fig 6C), consistently employ model-free strategies during execution of the task. Other animals, such as f11 (Fig 6D) started with a model-free strategy but transitioned to inference-based modes with learning. Remarkably, even in the expert stage (day 30 of training), the animal never operated fully in the inference-based regime but continues to execute a mixture of strategies. This was a common feature of many animals that managed to reach the inference-based stage (such as animal e46, e54, e56, f01, f11, f12, fh02, fh03, Fig 6A. The compositions of blockHMM decoded strategies of individual animals are shown in S7 Fig, together with the individual session performances). PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 6. Mice use combination of model–free and inference–based strategies in reversal learning. (a) Composition of blockHMM mixtures for individual animals. Each row represents one mouse with ID shown on the left. The color of each square represents the decoded behavioral regime of each HMM mode (Q1–4, IB5–6). The number of blocks for each animal, K, was selected by cross–validation and are sorted here in descending order. (b) Transition function of HMM modes for all animals, grouped according to the decoded behavioral regime. (c, d) Distribution of HMM modes for two example animals, f16, and f11, which displayed vastly different behavioral strategies with learning. The average performances of the two animals on the last 5 days of training, E, are shown. (e) Average frequency of HMM modes for n = 21 experimental animals (mean ± standard error) showing the average evolution of behavioral mixtures over the course of training. https://doi.org/10.1371/journal.pcbi.1011430.g006 This shift from model-free to inference-based behavior was seen in the average mixture composition of the animals (Fig 6E). On average, animals started training with a significant fraction of the Q1 mode and smaller fraction of Q4 (56% in Q1 and 24% in Q4, averaged across days 1–5). Over the course of training, the mixture of behavioral strategies slowly shifted from Q1 to Q4, such that around day 15, there is a higher fraction of Q4 than Q1 mode (39% in Q4 compared to 35% in Q1, averaged across days 16–20). This shift reflects an average increase in learning rate in the Q-learning regime. At the same time, the fraction of inference-based modes, IB5 and IB6, was low at the beginning (3% in IB5 and 6% in IB6 averaged across days 1–5), but continuously increased as animals gained experience with the task. At the expert stage, there was an increase in fraction of blocks in the inference-based mode (7% in IB5 and 15% in IB6 on day 30), but the mixture of strategies still remained with Q1 and Q4 being the primary Q-learning modes of the animals. These patterns of strategy mixture were consistent between male and female mice, with no statistically significant difference in composition between sexes (S8 Fig; p > 0.05 for all modes and session groups, Mann-Whitney U-test). Overall, these ubiquitous use of mixtures of strategies, which were distinctive both in naïve and expert animals, further underscore the importance of our approach to dissociate and characterize the features that constitute individual modes of behavior.

[END]
---
[1] Url: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011430

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/