(C) PLOS One

(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .

Active reinforcement learning versus action bias and hysteresis: control with a mixture of experts and nonexperts [1]

['Jaron T. Colas', 'Department Of Psychological', 'Brain Sciences', 'University Of California', 'Santa Barbara', 'California', 'United States Of America', 'Division Of The Humanities', 'Social Sciences', 'California Institute Of Technology']

Date: 2024-04

Active reinforcement learning enables dynamic prediction and control, where one should not only maximize rewards but also minimize costs such as of inference, decisions, actions, and time. For an embodied agent such as a human, decisions are also shaped by physical aspects of actions. Beyond the effects of reward outcomes on learning processes, to what extent can modeling of behavior in a reinforcement-learning task be complicated by other sources of variance in sequential action choices? What of the effects of action bias (for actions per se) and action hysteresis determined by the history of actions chosen previously? The present study addressed these questions with incremental assembly of models for the sequential choice data from a task with hierarchical structure for additional complexity in learning. With systematic comparison and falsification of computational models, human choices were tested for signatures of parallel modules representing not only an enhanced form of generalized reinforcement learning but also action bias and hysteresis. We found evidence for substantial differences in bias and hysteresis across participants—even comparable in magnitude to the individual differences in learning. Individuals who did not learn well revealed the greatest biases, but those who did learn accurately were also significantly biased. The direction of hysteresis varied among individuals as repetition or, more commonly, alternation biases persisting from multiple previous actions. Considering that these actions were button presses with trivial motor demands, the idiosyncratic forces biasing sequences of action choices were robust enough to suggest ubiquity across individuals and across tasks requiring various actions. In light of how bias and hysteresis function as a heuristic for efficient control that adapts to uncertainty or low motivation by minimizing the cost of effort, these phenomena broaden the consilient theory of a mixture of experts to encompass a mixture of expert and nonexpert controllers of behavior.

Reinforcement learning unifies neuroscience and AI with a universal computational framework for motivated behavior. Humans and robots alike are active and embodied agents who physically interact with the world and learn from feedback to guide future actions while weighing costs of time and energy. Initially, the modeling here attempted to identify learning algorithms for an interactive environment structured with patterns in counterfactual information that a human brain could learn to generalize. However, behavioral analysis revealed that a wider scope was necessary to identify individual differences in not only complex learning but also action bias and hysteresis. Sequential choices in the pursuit of rewards were clearly influenced by endogenous action preferences and persistent bias effects from action history causing repetition or alternation of previous actions. By modeling a modular brain as a mixture of expert and nonexpert systems for behavioral control, a distinct profile could be characterized for each individual attempting the experiment. Even for actions as simple as button pressing, effects specific to actions were as substantial as the effects from reward outcomes that decisions were supposed to follow from. Bias and hysteresis are concluded to be ubiquitous and intertwined with processes of active reinforcement learning for efficiency in behavior.

Funding: STG was supported by the Institute for Collaborative Biotechnologies under Cooperative Agreement W911NF‑19‑2‑0026 and grant W911NF‑16‑1‑0474 from the Army Research Office. JPOD was supported by National Institute on Drug Abuse grant R01 DA040011 and the National Institute of Mental Health’s Caltech Conte Center for Social Decision Making (P50 MH094258). The funders had no role in study design, data collection and analysis, the decision to publish, or preparation of the manuscript.

To the end of establishing guidelines for behavioral modeling in general, there were further questions concerning how exactly these directional biases would manifest and how substantial they would be for the experimenter’s default choice of pressing a button, which is a simple and familiar action with trivial motor demands. For proof of concept, the present paradigm can query not only the suitability of these particular forms of biases for button presses but also the viability of these factors as additional complexities while learning theory is advanced. With reference to analogous architectures in machine learning [ 49 – 54 ] as well as with general appeal to modular parallelism and conditional computation for balancing versatility and efficiency in optimal control, the consilient theory of a mixture of experts [ 6 – 8 , 55 – 57 ] can be broadened further for a mixture of expert and nonexpert controllers of behavior (see Discussion ). This contrast of expertise versus efficiency is represented here by different types of expert RL versus nonexpert bias and hysteresis.

Abiding by Occam’s razor [ 48 ], the more parsimonious factors of action bias and hysteresis should be granted first priority for inclusion if they are sufficiently substantial, but testing empirical data was necessary to verify practical feasibility in consideration of the compounded complexity with different forms of learning. Individuals found to not learn well were expected to reveal the greatest effects of bias and hysteresis. Yet those who learned accurately were also hypothesized to exhibit biases that would account for significant variance (even if this were to amount to less variance than that from learning).

Previously, the GRL model was built with fixed prior assumptions for another three free parameters representing action bias and hysteresis. One of these parameters specifies the constant lateral bias; the other two specify a decaying exponential function for the hysteresis trace extending backward across the sequence. This particular configuration of constant bias and exponential hysteresis was initially arrived at intuitively more so than empirically [ 12 , 21 ] while drawing elements from earlier models [ 17 , 18 ]. Now, the 3-parameter adjunct was to actually be tested against GRL alone as well as both simpler and more complex variations for bias and (state-independent) hysteresis. Subsequent testing also proceeded to alternative model features that could be other sources of action repetition or alternation, including state-dependent hysteresis, state-independent action value, confirmation bias in learning, or asymmetric learning rates more generally.

Free parameters are listed for the 72 behavioral models in ascending order of complexity within and across classes. The models are coded with the first letter of the label referring to four possibilities: an absence of learning (“X”), reinforcement learning (RL) without generalization (“0”), generalized reinforcement learning (GRL) with one shared generalization parameter g 1 (“1”), or GRL with two separate generalization parameters g 1 and g 2 (“2”). RL itself required free parameters for the learning rate α and the softmax temperature τ. Models labeled with “C” for the second letter included a constant lateral bias, which was arbitrarily designated as a rightward bias β R (where β R < 0 is leftward). The list is condensed with bracket notation to represent the range for the n-back horizons of each successive model within a hysteresis category (e.g., “2CE[ 1 – 3 ]” for models 2CE1, 2CE2, and 2CE3). Models labeled with”N” and ending with a positive integer (from the range in brackets) included n-back hysteresis with free parameters β n for repetition (β n > 0) or alternation (β n < 0) of each previous action represented—up to 4 trials back (β 4 ) with learning and up to 8 trials back (β 8 ) without learning. Models labeled with “E” and ending with a positive integer N (from the range in brackets) included exponential hysteresis with inverse decay rate λ H taking effect N+1 trials back. Exponential models could also be both parametric and nonparametric with N free parameters β n for initial n-back hysteresis up to 3 trials back (β 3 ), where the final β N is the initial magnitude of the exponential component. “df” stands for degrees of freedom. See also Table A in S1 Text for the unrolled version of the list. This ordering of the models corresponds to the ordering in Figs 2 and 3 .

The primary model comparison here ( Table 2 and Table A in S1 Text ) exhaustively tested various combinations of action-specific effects as well as “generalized reinforcement learning” (GRL), which is a quasi-model-based extension of model-free RL that can flexibly generalize value information across states and actions ( Fig 1B and Fig B in S1 Text ) [ 12 ]. GRL per se is somewhat incidental for the present purposes, but what matters as far as a test case here is that a model incorporating the complexities of bias and hysteresis should still be amenable to exploring complex learning algorithms beyond the most basic RL. GRL is especially complicating in this regard because it introduces high-frequency dynamics to learning with counterfactual updates of multiple value representations in parallel.

Perhaps surprisingly, the hypothesis for hysteresis in the present experiment was that alternation would predominate rather than repetition. An action policy biased toward alternation would follow from the fact that, by design, choosing actions optimally in response to the rotating states of this environment would result in alternating more frequently. Yet, by design, this perseverative alternation that is characteristically independent of learned external value was therefore not conducive to obtaining more rewards from this environment.

A more comprehensive model of action selection can also enhance identifiability with respect to actual learning (or lack thereof) as opposed to other components of variance that may mimic or otherwise obscure signatures of learning with spurious correlations across the finite sequence of actions [ 17 , 18 , 27 , 28 , 39 – 47 ]. As external reinforcement promotes consistent repetition of responses within a state, so too can action bias, and both repetition and alternation from hysteresis can coincidentally align with the reward contingencies of the sequence of states. Whereas preexisting constant biases interact with learning when base rates for actions are unbalanced in sequence, hysteretic biases can further complicate action sequences with not only intrinsic dynamics but also more possibilities for interactions across any sequential patterns in the environment and the dynamics of learning.

Too often, such action-specific effects have been overlooked altogether or given only cursory mention as if they were inconsequential in the context of a learning model. If considered at all, the scope of hysteresis has also usually been limited to only one trial back. (To address this issue here, we modeled hysteresis over a time horizon longer than one trial.) Moreover, because repetition tends to predominate in aggregate behavior for RL and other sequential paradigms, manifestations of hysteresis have mostly been framed so as to deemphasize or entirely disregard alternation biases in favor of repetition biases. Autocorrelational effects have thus been referred to in the literature with unidirectional and often imprecise terminology such as “perseveration”, “perseverance” (a misnomer), “persistence”, “habit”, “choice stickiness”, “choice consistency”, “repetition priming”, “response inertia”, or “behavioral momentum”. Semantics of interpretation aside, the common thread for hysteresis is a past action’s influence on an upcoming action with independence from learnable external feedback and typically, albeit not necessarily, from external states as well.

Fundamentally for even basic RL, the possibilities for variables in a more comprehensive behavioral model can be classified according to dependence on (or independence of) states, actions. previous actions, and reward outcomes. In principle, whereas action value is outcome-dependent, action hysteresis is outcome-independent. However, when modeling actual behavior, this conceptual independence does not guarantee statistical independence because of incidental correlations in finite sequences of action choices. For the present study, the primary model comparison focuses on the three variables (marked with an asterisk) that are the most fundamental and typically the most dissociable—namely, constant bias B(a), state-independent action hysteresis H(a), and state-dependent action value Q(s,a). The extended model comparison also incorporates state-dependent action hysteresis H(s,a) and state-independent action value Q(a). Note that state value V(s) is generally relevant in RL but is not considered here. The abbreviations “PrevAction”, “dep.”, and “indep.” correspond to “previous action”, “dependent”, and “independent”, respectively.

In the present study, we hypothesize that behavior during active learning is determined not only by RL and stochasticity but also by action bias and hysteresis, which are independent of the current state of the external environment and its reward history ( Fig 1 ). This state-independent hysteresis in particular makes actions depend on previous actions regardless of states, but state-dependent hysteresis was also considered later ( Table 1 ). The interplay of these different forces was investigated for human behavior in a task that in one sense is a hierarchical reversal-learning task but in another sense is a sequential button-pressing task ( Fig A in S1 Text ). Hence the behavioral data of a multisite neuroimaging study reported previously [ 12 ] were reanalyzed with further model comparison from this bias-centric perspective.

In practice, model fitting is nontrivial with a sequence of choices typically limited to hundreds or even just dozens of observations. Adding to this challenge, increasingly complex behavior under study imposes greater demands for accommodating multidimensional individual differences and optimizing individual fits without hierarchical Bayesian fitting [ 13 , 29 ] and its disadvantage of estimation bias [ 30 – 35 ]. (For a random grouping of independent data sets, even hierarchical fitting compromises their independence with the strong assumption of a common distribution for every individual based on the ecological fallacy [ 36 – 38 ].) Both within and between individual sequences, sources of variance other than RL may be crucial to complement an RL model despite the costs of additional degrees of freedom. In other words, including modules beyond RL in a model of actual behavior can alleviate estimation bias and other distortions of learning parameters that would otherwise be forced to simultaneously fit other phenomena with omitted variables.

The standard setup for fitting RL to behavior (e.g., [ 22 ]) begins with a 2-parameter model tuned for the learning rate and the softmax temperature, where the latter represents stochasticity [ 3 , 23 – 25 ]. This base model is then built upon with additional free parameters to test for more complex learning phenomena, which should include the due diligence of model comparison and qualitative falsification [ 26 – 28 ]. However, an alternative line of questioning could instead begin with asking whether more parsimonious and perhaps more substantial sources of variance merit prioritization before making any new assumptions about complexities within learning. The emphasis can also be shifted away from the prescriptive (i.e., “According to some notion of ‘optimality’, what should a person do here?”) in favor of the descriptive (“What do people actually do here?”) while creating an opportunity to circle back from empirical findings to a new perspective on different aspects of optimality in behavior.

The present case of two available actions (one per hand) reduces the first component of action bias to a single bidirectional constant for left versus right [ 14 – 16 ]. Hysteresis is bidirectional as well and adds dynamics in the form of either repetition or alternation of previous actions, which may also manifest for a horizon beyond just the most recent action [ 17 – 20 ]. Despite at least some precedent for either action bias or action hysteresis (more so the latter), the combination of both bias and hysteresis has even less precedent for RL [ 12 , 21 ].

(a) Each trial of the structured reward-learning task was initiated with an image cue symbolizing the state of the environment (e.g., “A” or “B”), where the optimal action given the state was a button press with either the left (“L”) or right (“R”) hand. In contrast to the expert control of GRL for mapping state-action pairs to rewards, the nonexpert forces of action bias and hysteresis were modeled as leftward or rightward bias and repetition or alternation bias. These action-specific effects manifest independently of the external state and reward history. (b) What matters for the present purposes is that, while a model with GRL adds complexity to basic RL, even more complexity must be accommodated for action bias and hysteresis. The agent’s mixture policy π t (s t ,a*) is probabilistic over available actions a* in state s t . The action selection of this mixture policy is determined by not only learned value for state-action pairs Q t (s t ,a*) but also constant bias B(a*) and dynamic hysteretic bias H t (a*) with an exponentially decaying hysteresis trace. The outcome of the chosen action a t is a reward r t+1 that updates Q t (s t ,a t ) via the reward-prediction error (RPE) δ t+1 weighted by a learning rate α. For GRL specifically, this RPE signal is generalized to representations of other state-action pairs according to extra parameters for action generalization (g A ) and state generalization (g S ). See Figs 8 and 13 for details of the plots representing individual differences in constant lateral bias (left versus right) and the exponential hysteresis trace (repeat versus alternate). See also the original report of this study with additional details about the paradigm and GRL per se [ 12 ].

The RL framework has appreciable predictive validity [ 4 , 5 ] when accounting for human choices and learning behavior in a variety of settings [ 6 – 8 ]—let alone the power of extensions of RL [ 9 – 12 ]. However, such models sometimes fail to account well for an individual’s behavior even in a relatively simple task that should be amenable to RL in principle [ 13 ]. An open question concerns whether other components of variance not based on learning also exist alongside RL so as to collectively provide a better account of motivated behavior and even learning itself within a more comprehensive model. The present study focuses on the contributions of other elements of active learning that are also essential in their own way: action bias—specifically for actions per se—and action hysteresis, which is determined by the history of previously selected actions ( Fig 1A ).

Whether in machine learning and artificial intelligence or in animal learning and neural intelligence, the most crucial portion of reinforcement learning (RL) [ 1 – 3 ] is not passive, offline, or observational but instead active and online with a challenge of not only prediction but also real-time control. In the real world, resources for activity are finite, and much of active RL is also embodied RL. Whether robot or human, the embodied agent learns from feedback to make decisions and select physical actions that maximize future reward while minimizing various costs of energy as well as time.

Results

Paradigm Additional details of the study and previous results can be found in the original report for these data sets [12]. The hierarchical reversal-learning task delivered probabilistic outcomes for combinations of categorized states and contingent actions with reward distributions changing across 12 blocks of trials (Figs A and B in S1 Text). Suitably for first testing GRL, the state (or context) of each trial represented a two-armed contextual bandit belonging to one of two categories (e.g., faces or houses) with two anticorrelated states per category and two anticorrelated actions per state (i.e., left-hand button press or right-hand button press). For an optimal learner, the counterfactual information in this anticorrelational structure could be leveraged with the discriminative generalization of GRL. The action-generalization weight g A and state-generalization weight g S , which would ideally both be negative for discriminative generalization, govern the relaying of the reward-prediction error across state-dependent actions or across states within a category, respectively. For standard behavioral RL (with or without an extension such as GRL), the state-dependent action values Q t (s,a) that are learned over time would be the only inputs to a probabilistic action-selection policy π t (s,a) characterized by a softmax function with temperature τ: As the scope of the model is expanded, the present study emphasizes that the action policy is a function of not only action value Q t (s,a) but also constant bias B(a) and dynamic hysteretic bias H t (a) as modules within a mixture of experts and nonexperts (Fig 1) [12,21]. Constant bias B(a) becomes a lateral bias between left and right actions in this case, whereas the dynamic hysteretic bias H t (a) maps repetition and alternation to positive and negative signs, respectively. To represent these action-specific biases that are independent of external state and reward history, the equation for the mixture policy incorporates additional terms like so:

Action bias and hysteresis versus learning performance In keeping with the previous point about idiosyncratic environments, the statistics of a given task environment must be considered to set reference points for quantifying and interpreting truly action-specific components of variance. While triple dissociation of bias, hysteresis, and learning is generally nontrivial for a short sequence of active states, this challenge can be exacerbated even more so by class imbalance depending on the temporal statistics of states, actions, and rewards. In arriving at a fully interpretable quantitative model amenable to individual differences, the challenge was first met here by a hierarchically counterbalanced experimental design that was tightly controlled within and across sessions. Regarding the constant lateral bias, available rewards were thus evenly distributed between left-hand and right-hand actions all throughout the experiment. Hence an omniscient optimal agent with perfect 100% accuracy would be guaranteed to produce an even 50% probability of a left- or right-hand action. This was not the case for hysteresis, however. In contrast, that same agent would produce an uneven 66.7% probability of action alternation as a byproduct of choosing the optimal actions here. This incidental asymmetry can superficially mimic an internal alternation bias while a learner actually responds to the external structured sequence of four randomly rotating states. (States were never repeated in consecutive trials, and of the three remaining states, only one from the complementary category would reward the action just performed in a given state for the block—resulting in two-thirds or 66.7% alternation.) Note that a naïve policy with a 100% probability of alternation irrespective of state would nonetheless produce chance accuracy at 50% by design. Such ambiguity for a raw, model-independent measure again underscores the need for comprehensive computational modeling that accounts for multiple implicit effects simultaneously. To the extent that the forces of bias and learning compete with each other to drive behavior, an inverse relation was expected between learning performance and the weight of action bias and hysteresis. Again omitting Nonlearners, overall bias |β R |+|β 1 | in actual learners was inversely correlated with accuracy as the probability of choosing the correct action (FH: r = -0.290, t 38 = 1.87, p = 0.035, r S = -0.374, p = 0.009 for monotonicity; CM: r = -0.472, t 19 = 2.33, p = 0.015, r S = -0.605, p = 0.002 for monotonicity). This inverse relation between modeled bias and objective performance was monotonic across not only all learners but also the alternation-bias group specifically (FH: r = -0.383, t 23 = 1.99, p = 0.029, r S = -0.475, p = 0.009 for monotonicity; CM: r = -0.453, t 14 = 1.90, p = 0.039, r S = -0.618, p = 0.006 for monotonicity), demonstrating that bias as extracted with modeling was not confounded with alternation that may incidentally result from pursuing reward. (See next section for more detail about the alternation-bias group.) To complement the initial quantitative model comparison for overall goodness of fit, a series of posterior predictive checks followed for evidence of bias and hysteresis with qualitative falsification of the null hypotheses in nested models [26–28]. The same technique had been used previously to falsify basic RL against GRL [12]. Each check entailed juxtaposition of empirical behavior and the behavior simulated by GRL models that, while holding a fixed assumption of two new learning parameters for generalization, are incrementally tested with up to three more action-bias parameters. First separating groups on the basis of learning performance, a binary model comparison could illustrate some fundamental limitations of the pure GRL model “2” with no bias as opposed to the final 2CE1 model with three parameters for constant bias and exponential hysteresis. (The intermediate models between these 4- and 7-parameter end points are investigated in greater depth later.) Posterior predictive checks for these two models were tested against empirical results for not only the probability of a correct (versus incorrect) action—as is standard for a learning paradigm—but also the probability of a right-hand (versus left-hand) action and the probability of a repeated (versus alternated) action independent of state. From a naïve perspective it would appear that, by qualitatively capturing the probability of a correct choice across levels of learning performance (FH-G: M = 12.8%, t 30 = 13.13, p < 10−13; FH-P: M = 0.1%, p > 0.05; FH-N: M = 0.1%, p > 0.05; CM-G: M = 12.3%, t 15 = 8.75, p = 10−7; CM-P: M = -0.2%, p > 0.05) in silico as well (FH-G: p < 0.05; FH-P: p > 0.05; FH-N: p > 0.05; CM-G: p < 0.05; CM-P: p > 0.05) (Figs 6A/6D and 7A/7D and Fig Oa/d in S1 Text), the 4-parameter GRL model “2” with no bias seemingly accounts for human behavior comparably to the 7-parameter 2CE1 model expanded with action bias and hysteresis. However, the shortcomings of a purely learning-based account can be revealed even in 0-back and 1-back action-specific effects. Remarkably, these action-specific effects (Figs 6E–6F and 7E–7F) are quite substantial in effect size as compared with the value-based effects (Figs 6D and 7D) typically and most intuitively emphasized in a paradigm for active learning. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 6. Action bias and hysteresis versus learning performance: 3-T Face/House version. To compare the pure GRL model (“2”) with the final 2CE1 model adding three parameters for constant bias and exponential hysteresis, simulated data sets from each model were yoked to their respective empirical data sets. Posterior predictive checks were tested for the probability of a correct action, the probability of a right-hand action, or the probability of a repeated action independent of state. (a) If only examining accuracy in terms of correct choices for maximizing reward, the shortcomings of the reduced model without bias are not so obviously apparent at first. (b) Upon considering action bias, these right-handed individuals mostly had a tendency to select the right-hand action (p < 0.05). Whereas the 2CE1 model could account for this effect with a constant lateral bias (p < 0.05), the reduced model could not (p > 0.05). (c) Regarding the probability of repetition versus alternation, note that 100% accuracy would produce 66.7% alternation for the present experimental design, but 100% alternation would still produce 50% accuracy. The Good-learner group exhibited a tendency to alternate in the aggregate as expected (p < 0.05), whereas the Poor-learner and Nonlearner groups did not (p > 0.05). Only the 2CE1 model featuring exponential hysteresis could match this pattern with quantitative precision. (d-f) Independent of direction, absolute differences from the chance level of 50% reveal the full extent of the action-specific components of variance, which are as substantial as the effects of reward typically emphasized in active learning. For fitting the probability of a right-hand action or a repeated action, a margin of roughly 2% for pure GRL was insubstantial in comparison. Error bars indicate standard errors of the means. https://doi.org/10.1371/journal.pcbi.1011950.g006 PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 7. Action bias and hysteresis versus learning performance: 7-T Color/Motion version. Compare to Fig 6. Results were replicated in the 7-T Color/Motion version of the experiment. https://doi.org/10.1371/journal.pcbi.1011950.g007 Across these right-handed participants, all five groups in the aggregate performed the right-hand action more often (FH-G: M = 1.8%, t 30 = 2.11, p = 0.022; FH-P: M = 9.3%, t 8 = 3.99, p = 0.002; FH-N: M = 5.1, t 6 = 1.54, p = 0.088; CM-G: M = 4.8%, t 15 = 3.21, p = 0.003; CM-P: M = 9.9%, t 4 = 2.36, p = 0.039) (Figs 6B/6E and 7B/7E and Fig Ob/Oe in S1 Text), and greater or marginally greater rightward bias was observed in Poor learners and Nonlearners relative to Good learners (FH-PG: M = 7.6%, t 38 = 3.80, p < 10−3; FH-NG: M = 3.3%, t 36 = 1.43, p = 0.081; CM-PG: M = 5.2%, t 19 = 1.47, p = 0.079). Hence this measure of absolute lateral bias |P(Right)-50%| was also greater in Poor learners and Nonlearners (FH-PG: M = 6.0%, t 38 = 3.81, p < 10−3; FH-NG: M = 3.7%, t 36 = 2.14, p = 0.020; CM-PG: M = 4.8%, t 19 = 1.51, p = 0.074), which likewise held true when correlating across the continuous measure of accuracy rather than discrete participant groups (FH: r = -0.544, t 38 = 4.00, p = 10−4; CM: r = -0.540, t 19 = 2.80, p = 0.006). Whereas the full 2CE1 model could replicate all of these effects (p < 0.05), the reduced GRL model could not (p > 0.05). As a reflection of individual-specific class imbalance or overfitting in the absence of constant bias, a roughly 2% margin was apparent in the absolute difference between the reduced model’s right-hand probability and the chance level of 50% (Figs 6E and 7E). Yet this margin was insubstantial in comparison to the true effect sizes of constant bias that were quantitatively matched by only the full model. Note again that 100% accuracy in this contrived environment would produce 66.7% alternation because of rotating states, but 100% alternation would produce 50% accuracy. The interpretation of this raw measure is thus confounded between effects of reward and hysteresis, but in keeping with the statistics of the environment, the Good-learner groups did exhibit a tendency to alternate in the aggregate while the Poor-learner and Nonlearner groups did not (FH-G: M = -2.9%, t 30 = 2.94, p = 0.003; FH-P: M = 1.5%, p > 0.05; FH-N: M = 0.8%, p > 0.05; CM-G: M = -4.2%, t 15 = 4.34, p < 10−3; CM-P: M = 3.5%, p > 0.05) (Figs 6C/6F and 7C/7F and Fig Oc/f in S1 Text). In contrast, the absolute repetition-or-alternation frequency |P(Repeat)-50%| was significantly greater than chance for all subgroups (FH-G: M = 5.0%, t 30 = 8.11, p < 10−8; FH-P: M = 5.5%, t 8 = 3.73, p = 0.003; FH-N: M = 8.2%, t 6 = 3.84, p = 0.004; CM-G: M = 4.8%, t 15 = 6.15, p < 10−5; CM-P: M = 13.8%, t 4 = 2.60, p = 0.030). Relative to Good learners, Nonlearners exhibited even greater deviation from chance with repetition or alternation (M = 3.2%, t 36 = 1.97, p = 0.028), as did the Poor learners of at least the second data set (M = 9.1%, t 19 = 2.89, p = 0.005). The latter trend held true for the second data set with marginal significance for the continuous measure of accuracy as well (r = -0.312, t 19 = 1.43, p = 0.084). Only the 7-parameter model could match net 1-back effects with quantitative precision (FH-G: p < 0.05; FH-P: p > 0.05; FH-N: p > 0.05; CM-G: p < 0.05; CM-P: p > 0.05), and qualitative falsification of the pure GRL model for such hysteretic effects was to be found in follow-up analyses disambiguating effects of reward and hysteresis. Owing to this disambiguation, the model-based results that follow are more reliable than these model-independent measures for inference about actual hysteresis per se.

Different forms of action bias and hysteresis The 2CE1 model should accommodate the idiosyncrasies of individual participants with respect to not only GRL, which has already been demonstrated [12], but also action bias and hysteresis. Based on parameter fits, Good and Poor learners were combined and then reclassified according to the directionality of either constant bias or hysteretic bias—that is, leftward (β R < 0) versus rightward (β R > 0) or alternation (β 1 < 0) versus repetition (β 1 > 0). Nonlearners were again omitted for more rigorous testing of biases in the presence of actual learning. Each posterior predictive check was extended to the eight models previously highlighted in the reduced model comparison—that is, incrementally building up from the no-bias model “2” with only GRL (4 parameters) to the full 2CE1 model (7 parameters). Necessity could thus be verified for every single parameter of the 2CE1 model. Among these right-handed learners, 28% exhibited a contrary leftward bias (FH: n = 14/40; CM: n = 3/21). Those with leftward bias (FH: M = -2.0%, t 13 = 2.29, p = 0.020; CM: M = -2.3%, t 2 = 3.12, p = 0.045) exhibited a smaller (or marginally smaller) absolute magnitude of bias (FH: M = 4.2%, t 38 = 2.84, p = 0.004; CM: M = 5.1%, t 19 = 1.31, p = 0.103) relative to the rightward-bias group (FH: M = 6.4%, t 25 = 6.30, p < 10−6; CM: M = 7.4%, t 17 = 4.73, p < 10−4) (Fig 8), but the existence of so many leftward biases among right-handed individuals is noteworthy. The models with a parameter for constant bias (2C through 2CE1) could replicate these effects (p < 0.05), whereas those without the parameter could not at all (p > 0.05). These findings falsify the naïve hypothesis that handedness might determine the direction of constant bias invariably. The unpredictable distribution of an effect as simple as laterality stands among the evidence that, in general, individual differences must be modeled without a-priori distributional assumptions—whether about a random sample of individuals or about the population from which they are drawn (see Discussion). PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 8. Constant bias. (a) Based on individual fits of the 2CE1 model, Good and Poor learners were combined and then reclassified according to whether the constant lateral bias was a leftward bias (β R < 0) (magenta bars) or a rightward bias (β R > 0) (cyan bars). The model comparison extended this posterior predictive check and others to another six intermediate models—four models nested within the 2CE1 model featuring exponential hysteresis (2N1, 2E1, 2C, 2CN1) and two models substituting 2-back hysteresis (2N2, 2CN2) but matched for degrees of freedom. For the probabilities of left or right actions, some of these right-handed people actually exhibited a contrary leftward bias; those who did exhibited a smaller absolute magnitude of bias than that of the rightward-bias group (p < 0.05). The models with a parameter for constant bias (2C through 2CE1) could replicate these effects (p < 0.05), falsifying the models that could not at all for lack of this parameter (p > 0.05). (b) Results were replicated in the 7-T Color/Motion version of the experiment. https://doi.org/10.1371/journal.pcbi.1011950.g008 Bear in mind that optimal behavior results in more frequent alternation of actions in this particular setting. Conversely, naïve alternation does not result in above-chance performance for the aforementioned reasons. Despite the latter fact, behavior was hypothesized to be predisposed to alternation that is independent of states and outcomes after an agent has been alternating actions at the appropriate times due to learning that is dependent on states and outcomes. This hypothesis might initially appear at odds with the typical narrative in the RL literature emphasizing perseveration as naïve action repetition, but here, that would only represent first-order perseveration at the level of actions. At the level of policies, second-order perseveration suggests that a learner in such an environment perseverates from an expert reward-seeking policy of optimal alternation when appropriate to a nonexpert default policy of perseverative alternation whenever. In keeping with this hypothesis, the alternation-bias group (FH: n = 25/40; CM: n = 16/21) was expected to outnumber the repetition-bias group (FH: n = 15/40; CM: n = 5/21) as well as exhibit an effect on the raw probability of alternation (FH: M = -5.0%, t 24 = 7.32, p < 10−7; CM: M = -5.4%, t 15 = 4.93, p < 10−4) (Fig 9). Yet reward-maximizing accuracy was not significantly higher for the alternation-bias group than for the repetition-bias group (FH: M = 3.2%, p > 0.05; CM: M = 2.2%, p > 0.05), confirming the action-specific nature of this bias as a nonexpert heuristic. The arrow of causality for the hypothesis of second-order perseveration primarily points from optimal alternation to perseverative alternation rather than vice versa. These results lend themselves to an analogy with the previously described cohort that was left-biased despite being right-handed, whereas there was still also a sizable repetition-bias group in which some learners instead adhered to a more intrinsic first-order perseveration effect like what has typically been reported in the literature. That is, this learning cohort could sometimes alternate to exploit actions with high estimated reward when appropriate but still perseverated so as to repeat actions according to a more robust default repetition bias (FH: M = 3.3%, t 14 = 2.24, p = 0.021; CM: M = 7.4%, t 4 = 1.06, p = 0.175; nonsignificant, but versus alternation-bias group: M = 12.9%, t 19 = 3.06, p = 0.003). Whereas the models with at least one parameter for hysteretic bias (including the simplest 2N1 model) could replicate these 1-back effects (p < 0.05), the models with no such parameter could not (p > 0.05). PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 9. Hysteresis represented by the previous trial. The learners were next reclassified according to whether the hysteretic bias was an alternation bias (β 1 < 0) (violet bars) or a repetition bias (β 1 > 0) (orange bars). With some adhering to a more typical profile of first-order perseveration, the repetition-bias group did retain a substantial effect on the probability of repeating an action independent of state (p < 0.05). However, in keeping with second-order perseveration, the alternation-bias group actually outnumbered and outweighed in effect size the repetition-bias group (p < 0.05). That is, extra alternation could follow from the design feature whereby optimal behavior would more frequently result in alternating actions. In contrast to optimal alternation when appropriate for a given state, this perseverative alternation was action-specific so as to not actually improve reward-maximizing accuracy for the alternation-bias group (p > 0.05). The models with at least one parameter for hysteretic bias could replicate these 1-back effects (p < 0.05). Although the 2C model with constant bias could partially mimic action repetition with a nonsignificant trend, the models without any hysteresis parameters (2 and 2C) could not properly match the empirical 1-back effect (p > 0.05). https://doi.org/10.1371/journal.pcbi.1011950.g009 Notably, the 2C model with constant bias but no hysteresis could partially mimic the repetition effect observed in the repetition-bias group (with a trending but nonsignificant result, p > 0.05). That is, a true action-repetition effect could be overfitted to some extent by instead representing only imbalanced base rates for actions. Although this reduced constant-only model fails to match the empirical repetition result quantitatively, there is cause for alarm in the qualitative trend that spuriously arises in both data sets. As discussed previously, the present environment represents a distinct active-learning paradigm in which such class imbalance is actually minimized—unlike most other environments with greater confounding in distributions for classes such as those of the actions per se or repetitions versus alternations. In general, omission of repetition bias may inflate estimates of constant bias with limited data if there is insufficient opportunity for repetition to be demonstrated across multiple actions. Likewise, omission of constant bias may inflate estimates of a confounded repetition effect. Conversely, omission of alternation bias may deflate estimates of constant bias because this alternation counteracts the incidental repetition of an action with a greater base rate. The different forms of bias and hysteresis all need to be accounted for comprehensively.

Psychometric modeling of the mixture policy More quantitatively precise modeling of psychometric functions followed to examine the interface of value-based learning, action-specific effects, and the softmax function determining the mixture policy for action selection. The breadth of this mixture of experts and nonexperts integrated modular elements of basic RL, generalized RL, constant bias, hysteretic bias, and stochasticity from exploration as well as noise. As expected across all subgroups of learners, the probability of an action increased with the difference between the state-dependent action values Q t (s t ,a) learned by the GRL component of the 2CE1 model as fitted to empirical behavior (FH-L: β = 1.544, t 13 = 6.38, p = 10−5; FH-R: β = 2.084, t 25 = 6.74, p < 10−6; FH-A: β = 1.682, t 24 = 9.60, p < 10−9; FH-P: β = 2.316, t 14 = 4.61, p < 10−3; CM-L: β = 0.938, t 2 = 2.67, p = 0.058; CM-R: β = 1.494, t 17 = 7.20, p < 10−6; CM-A: β = 1.443, t 15 = 7.20, p < 10−5; CM-P: β = 1.76, t 4 = 2.97, p = 0.021) (Figs 10 and 11). In determining the probability of left-hand versus right-hand actions, constant bias was derived from the logistic model in the appropriate directions for both the leftward-bias (FH: β = -0.113, t 13 = 2.93, p = 0.006; CM: β = -0.103, t 2 = 2.97, p = 0.049) and rightward-bias (FH: β = 0.265, t 25 = 6.98, p = 10−7; CM: β = 0.302, t 17 = 5.08, p < 10−4) groups (Fig 10). The models featuring constant bias could replicate these effects with comparable psychometric functions (p < 0.05), whereas models without the parameter could not (p > 0.05). PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 10. Psychometric modeling of constant bias. The probability of an action increased with the difference between action values Q t (s t ,a) derived from the GRL component of the 2CE1 model as fitted to empirical behavior (p < 0.05). Constant bias was derived from a logistic model in the appropriate directions for both the leftward-bias and rightward-bias groups (p < 0.05). The models featuring constant bias could replicate these effects with quantitative precision as well (p < 0.05), whereas models without the parameter could not (p > 0.05). The nine plots per row each have an identical x-axis despite omission of tick labels from every other plot for readability. Error bars indicate standard errors of the means. https://doi.org/10.1371/journal.pcbi.1011950.g010 For instead the probability of repeated versus alternated actions independent of state, hysteretic bias was derived from the logistic model in the appropriate directions for both the alternation-bias (FH: β = -0.178, t 24 = 5.21, p = 10−5; CM: β = -0.220, t 15 = 5.31, p < 10−4) and repetition-bias (FH: β = 0.218, t 14 = 4.79, p = 10−4; CM: β = 0.462, t 4 = 1.35, p = 0.124; nonsignificant, but versus alternation-bias group: M = 0.682, t 19 = 3.51, p = 0.001) groups (Fig 11). The models featuring at least one parameter for hysteretic bias could replicate these 1-back effects with comparable psychometric functions (p < 0.05), and while models without the parameter could not (p > 0.05), the solitary constant bias of the 2C model does deceptively mimic repetition with a nonsignificant trend. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 11. Psychometric modeling of hysteresis represented by the previous trial. For instead the probabilities of alternated or repeated actions, hysteretic bias was likewise derived from a GRL-based logistic model in the appropriate directions for both the alternation-bias and repetition-bias groups (p < 0.05). The models featuring at least one parameter for hysteretic bias could replicate these 1-back effects with comparable psychometric functions (p < 0.05), and while models without the parameter could not (p > 0.05), the 2C model could again deceptively mimic repetition with a nonsignificant trend. https://doi.org/10.1371/journal.pcbi.1011950.g011

Dynamics of action hysteresis The hysteresis trace of the 2CE1 model extends its temporal horizon beyond the 1-back effects examined thus far. For the preceding posterior predictive checks, the extra parameter for exponential decay could not explicitly show the full extent of its impact—showing instead only subtle quantitative improvement. If this costly free parameter were to be justified, its improvement for the model would need to also be qualitative and substantial. Considering that the 2CE1 model has already been shown to outperform both simpler and more complex implementations of hysteresis overall, the assumption of two parameters for exponential hysteresis must provide a superior parsimonious fit for effects of action history ranging from 2-back onward with an indefinite horizon. Moreover, 2-parameter exponential hysteresis outperformed n-back models for not only n = 1 but also n = 2 (2CN1 and 2CN2), establishing that it must not be only the 2-back effects but rather also 3-back and beyond that have significant weight beyond 1-back. Accordingly, hysteretic effects were explored directly up to eight trials back. The probability of a repeated action was now conditioned on each respective action from the eight most recent trials (Fig 12; see Fig P in S1 Text for distributions of runs of consecutive repeats). As expected for the repetition-bias group, this probability of repeating a previous action (FH: M = 3.3%, t 14 = 2.24, p = 0.021; CM: M = 7.4%, t 4 = 1.06, p = 0.175; nonsignificant, but versus alternation-bias group: M = 12.9%, t 19 = 3.06, p = 0.003) was elevated above chance prior to 1-back as well (FH: M = 4.1%, t 14 = 3.39, p = 0.002; CM: M = 8.3%, t 4 = 1.83, p = 0.070 with marginal significance) and remained elevated. Conversely, for the alternation-bias group, this probability returned from a 1-back alternation effect (FH: M = -5.0%, t 24 = 7.32, p < 10−7; CM: M = -5.4%, t 15 = 4.93, p < 10−4) to the chance level prior to 1-back (FH: M = -0.3%, p > 0.05; CM: M = -0.4%, p > 0.05) as it increased slightly thereafter. Only the models with exponential hysteresis (2E1 and 2CE1) could match the shapes of the action-history curves, and the addition of constant bias made the correspondence even more precise. Concerning its pitfall of mimicry, constant bias alone (2C) manifests as an across-trial increase in the probability of repetition that superficially resembles the multitrial signature of an extended hysteresis trace. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 12. Hysteresis represented across multiple trials. Here the scope of hysteresis was extended to previous actions up to eight trials back. For the repetition-bias group, this probability of repeating a previous action remained elevated above chance prior to 1-back (p < 0.05). For the alternation-bias group, this probability instead returned from a 1-back alternation effect (p < 0.05) to chance prior to 1-back as it increases backward (p > 0.05). Only the models with exponential hysteresis could properly match the shapes of the action-history curves, and the addition of constant bias made the correspondence even more precise. With regard to mimicry, an upward shift in the curve from constant bias in the 2C model superficially resembles the autocorrelational signature of repetition across multiple trials with exponential hysteresis. The nine plots per row each have an identical x-axis despite omission of tick labels from every other plot for readability. Error bars indicate standard errors of the means. https://doi.org/10.1371/journal.pcbi.1011950.g012 To better interpret the preceding model-independent time courses, the fitted parameters of the GRL model with either exponential or n-back (i.e., 4-back) hysteresis provide context by explicitly factoring out confounds in constant bias as well as the effects of value-based learning (Fig 13). (The selection of 4-back is only for comparison of action-history curves, as the corrected fit of the 9-parameter 2CN4 model was actually worse than that of 2CN2 after adding two more free parameters.) This juxtaposition of parametric and nonparametric implementations of hysteresis revealed notably close correspondence for at least the first two trials back. However subtle the correspondence may be for decaying 3-back and 4-back effects, the superior overall fit of the exponential model relative to a simpler 2-back model (2CN2) already indicated the persistence of collectively significant cumulative effects from 3-back and beyond. Moreover, omission of constant bias (2E1 or 2N4) consistently inflated all of the modeled repetition weights, revealing the source of the mimicry between constant bias and repetition—especially in the persistent exponential form—that was alluded to with posterior predictive checks. The 3-parameter adjunct of constant bias and exponential hysteresis proves necessary as well as largely sufficient to distill the action-specific aspects of individual behavioral profiles. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 13. Hysteresis parameters with exponential or nonparametric models. The fitted parameters of the GRL model with either exponential or 4-back hysteresis are plotted as repetition weights (or alternation if negative)—simply β n for n-back models or the corresponding weights β 1 λ H n-1 in the exponential function. Action-specific effects are better illuminated here by explicitly factoring out effects of RL and GRL within the comprehensive model. There is close correspondence between these parametric (2E1 and 2CE1) and nonparametric (2N4 and 2CN4) implementations of hysteresis for at least the first two trials back. The need for a scope extending beyond 1-back demands more than one free parameter, and a proper hysteresis trace with exponential decay yields an even better fit than a scope of 2-back due to subtle effects from 3-back and beyond. As further evidence of interactions among parameters, omission of constant bias (2E1 or 2N4) consistently inflated the modeled repetition weights as they were forced to attempt to mimic the necessary third parameter for constant bias. Altogether, the CE1 adjunct is essential. Error bars indicate standard errors of the means. https://doi.org/10.1371/journal.pcbi.1011950.g013

Different forms of bias and hysteresis versus learning performance The first set of analyses originally split the three levels of learning performance without splitting directions of action biases, whereas the second split directions of bias across Good and Poor learners without splitting levels of learning performance. For this final stage, participants were further divided into six subgroups that separated the two directions of either form of bias as well as the three levels of learning performance—this time also plotting the two directions for previously omitted Nonlearners. There are statistical limitations with this next degree of granularity, which left some of the subgroups with a small sample, but these intersectional subgroups are worth consideration even if only to verify that the main effects essentially extend to this level as well. With respect to the first set of original findings, action bias and hysteresis were significant for Good learners but even more pronounced for Poor learners and Nonlearners (Figs 6 and 7). Second, 2CE1 simulations modeled with constant bias and exponential hysteresis could replicate the directions and magnitudes of empirical action-specific effects both qualitatively and quantitively (Figs 8 and 9). Notwithstanding the lack of statistical significance in a few of the smallest samples, these trends from either two or three groups consistently held true with the scrutiny of their interface within the six subgroups (Figs Q and R in S1 Text).

Alternatives to state-independent action hysteresis With the primary model comparison establishing that the 2CE1 model has the ideal architecture among the 72 models compared thus far, what follows are other possibilities that could be considered instead of or in addition to state-independent action hysteresis for comparable effects and possible confounds. In other words, these factors could ultimately relate to some form of repetition or alternation across the sequence of action choices. The list of alternative features includes state-dependent action hysteresis H t (s t ,a) (cf. [21]), state-independent action value Q t (a), confirmation bias in learning that weighs positive outcomes over negative with the constraint α N < α P (i.e., only optimism), or asymmetric learning rates with flexibility in the possibilities for α N ≠ α P (i.e., optimism or pessimism). Parsimony is paramount here, and none of these alternatives are as parsimonious as basic hysteresis that is both outcome-independent and state-independent. Take, for example, certain instances of action repetition: Rather than default attribution to a more general optimistic confirmation bias for learning [64–68], first-order perseveration may offer a more parsimonious explanation for some observations. As mentioned for RL, confirmation bias can translate to an asymmetry in learning rates favoring positive over negative outcomes [69–78]—but at the cost of greater susceptibility to overfitting relative to state-dependent or state-independent hysteresis [42,44,46,79–81], which can manifest its own sort of outcome-independent confirmation bias (see Discussion). (Moreover, as option values become relative in the action policy, the action generalization of GRL can also achieve effects comparable to what asymmetric learning rates might otherwise produce. This point is beyond the present scope but illustrates the broader issue of compounding complexity across the many possibilities that a model could incorporate.) The initial round of analyses for this extended model comparison began with substitutions of the factors of interest so as to test—and presumably falsify—their alternative hypotheses for the origins of repetition and alternation biases that state-independent hysteresis has been shown to account for with the posterior predictive checks above. Qualitative falsification was indeed robust for all four alternatives, such that none of these model features were capable of generating the original action-history curves that only state-independent action hysteresis could produce (Fig 14 and Fig S in S1 Text). These falsifications were hypothesized a priori in consideration of the following conceptual distinctions. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 14. Alternatives to state-independent action hysteresis. Compare to Fig 12. To falsify alternative hypotheses concerning the origins of the apparent effects of state-independent action hysteresis H t (a) (“2CE1”), the model comparison was first extended to test substitution of state-dependent action hysteresis H t (s t ,a) (“sE1+2C”), state-independent action value Q t (a) (“Qa+2C”), confirmation bias in learning with the constraint α N < α P (“cLR+2C”), or asymmetric learning rates with no constraint for α N ≠ α P (“LR+2C”). As expected, none of these alternatives were capable of generating the original action-history curves that only state-independent action hysteresis could produce. https://doi.org/10.1371/journal.pcbi.1011950.g014 First, state-dependent hysteresis (“sE1+2” or “2sE1”) would not align with state-independent hysteresis because the four states were rotated in sequence such that there were variable numbers of trials between the origins and consequences of state-dependent effects. In keeping with this point, only a subtle repetition effect emerged after two trials back. For the original repetition-bias group, the effect sizes were nonexistent for one trial back and quantitatively insufficient from two trials back onward. Furthermore, for the original alternation-bias group, the emergent repetition effect was actually counterproductive such that it pointed in the opposite direction. Second, state-independent action value (“Qa+2”) is unlike state-independent action hysteresis inasmuch as action value is outcome-dependent while action hysteresis is outcome-independent. In principle, there is potential for some degree of confounding if actions that are rewarded consistently end up being repeated consistently. However, in this controlled environment, state-independent action value had little impact on the action-history curves. For the second data set at least, there was a subtle alternation effect in both the original alternation-bias group and the original repetition-bias group—counterproductively for the latter. Third, confirmation bias in learning (“cLR+2”) is generally limited to action repetition and is not only outcome-dependent but also state-dependent in the presence of rotating states here. Like with state-dependent hysteresis, there was only a subtle repetition effect from two trials back onward. However, unlike with state-dependent hysteresis, model simulations for the alternation-bias group did not exhibit a contrary repetition bias. Fourth, a more flexible asymmetry in learning rates (“LR+2”), including either an optimistic confirmation bias or a pessimistic doubt bias, is again state- and outcome-dependent in the presence of rotating states here. Notably, not all participants in the repetition-bias group adhered to the rule of α N < α P in the absence of the constraint forcing confirmation bias. Hence the action-history curve for the repetition-bias group was not elevated above chance beyond 2-back as before with the constrained “cLR+2” result. Instead, the unconstrained asymmetry of “LR+2” produced a 1-back alternation effect for both groups—that is, also counterproductively for the repetition-bias group. With respect to the alternation-bias group, the model’s effect was insufficient in magnitude to quantitatively account for the actual effect observed.

[END]
---
[1] Url: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011950

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/