(C) PLOS One

(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .

Understanding dual process cognition via the minimum description length principle [1]

['Ted Moskovitz', 'Gatscomputational Neuroscience Unit', 'University College London', 'London', 'United Kingdom', 'Google Deepmind', 'Kevin J. Miller', 'Department Of Ophthalmology', 'Maneesh Sahani', 'Matthew M. Botvinick']

Date: 2024-12

William James famously distinguished between two modes of action selection, one based on habit and the other involving effortful deliberation [1]. This idea has since ramified into a variety of “dual-process” theories in at least three distinct domains of psychology and neuroscience. One of these domains concerns executive control, and distinguishes action selection that is “automatic”, reflecting robust stimulus-response associations, from that which is “controlled”, overriding automatic actions when necessary [2, 3]. A second focuses on reward-based learning, distinguishing behavior that is sensitive to current goals (“goal-directed” or “model-based”) from that which is habitual [4, 5]. The third addresses judgment and decision making (JDM), where canonical theories distinguish between two cognitive systems: a “System 1”, which employs fast and frugal heuristic decision strategies, and a “System 2”, which supports more comprehensive reasoning [6, 7].

While the reduction of action selection to dual processes is undoubtedly a simplification, across these three domains, dual-process models have accumulated considerable empirical support, and each domain has developed explicit computational models of how dual processes might operate and interact [3, 5, 8–14]. These computational models, however, are typically domain-specific, reproducing behavioral phenomena that are within the scope of their domain. It remains unknown whether dual-process phenomena in different domains result from different sets of computational mechanisms, or whether they can be understood as different manifestations of a single, shared set. That common mechanisms might be at play is suggested by a wealth of neuroscientific data. Specifically, studies have linked controlled behavior, model-based action selection, and System-2 decision making with common circuits centering on the prefrontal cortex [2, 4, 15–18], while automatic behavior, habitual action selection, and heuristic decision making appear to engage shared circuits lying more posterior and running through the dorsolateral striatum [18–21]. While further study into these neuroanatomical relationships is required, these results do beg the question of whether a single computational model could account for these patterns of decision-making.

In this work, we add to the growing body of literature which seeks a normative explanation for these phenomena [22–26]. That is, we seek a theory that can reproduce behavioral findings associated with dual process cognition, but which is derived instead from an optimization principle, allowing dual process cognition to be understood as the solution to a fundamental behavioral or computational problem. To identify such a principle, we begin by considering a fundamental problem confronting both biological and machine intelligence: generalization. We discuss a fundamental computational theory of generalization, link it to behavior, and demonstrate that a recently-proposed behavioral model from machine learning based on this principle can successfully reproduce canonical dual-process phenomena from executive control, reward-based learning, and JDM.

In order to practically model these codes, one can choose from a number of universal coding schemes, so-called because they are guaranteed to behave monotonically with respect to the true underlying code lengths. One such encoding scheme is the variational code [ 33 – 35 ], which implements the deviation cost via the negative log-likelihood of the data under the model and the complexity cost as the Kullback-Leibler (KL) divergence between the model and a sparse base distribution. Minimizing this objective is equivalent to performing variational inference with a particular choice of simplicity-inducing prior. While there are many such universal coding schemes [ 35 , 36 ], we focus on the variational code in this work due to its compatibility with neural network implementation.

A fundamental demand of intelligent behavior is to capitalize on past learning in order to respond adaptively to new situations, that is, to generalize. Humans in particular show a remarkable capacity for behavioral generalization, to such a degree that this has been regarded as one of the hallmarks of human intelligence [ 27 ]. From a modeling perspective, one way to generalize is to capture shared structure underlying the tasks with which a decision-maker is faced. However, if a model has too many degrees of freedom, it can overfit to noise in the observed data which may not reflect the true distribution. In approaching this problem, the machine learning literature points consistently to the importance of compression: In order to build a system that effectively predicts the future, the best approach is to ensure that that system accounts for past observations in the most compact or economical way possible [ 28 – 31 ]. This Occam’s Razor-like philosophy is formalized by the minimum description length (MDL) principle, which prescribes finding the shortest solution written in a general-purpose programming language which accurately reproduces the data, an idea rooted in Kolmogorov complexity [ 32 ]. Given that actually computing Kolmogorov complexity is impossible in general, MDL theory advocates for a more practical approach, proposing that the best representation or model M for a body of data D is the one that minimizes the expression (1) L(M) here is the description length of the model, that is, the number of bits it would require to encode that model, a measure of its complexity [ 30 ]. L(D|M), meanwhile, is the description length of the data given the model, that is, an information measure indicating how much the data deviates from what is predicted by the model. In short, MDL favors the model that best balances between deviation and complexity, encoding as much of the data as it can while also remaining as simple as possible.

While our simulations focus on target phenomena that have been documented across many experimental studies, in presenting each simulation we focus on observations from one specific (though representative) empirical study, to provide a concrete point of reference. It should be noted that the target phenomena we address, in almost all cases, take the form of qualitative rather than quantitative patterns. Our statistical tests, described in S1 Methods , thus take the form of qualitative hypothesis tests rather than quantitative fits to data, paralleling the reference experimental research.

For each target phenomenon, we pursue the same approach to simulation: We begin with a generic MDL-C agent model, configured and initialized in the same way across simulations (with the exception of input and output unit labels tailored to the task context). The model is then trained on an appropriate target task and its behavior or internal computations queried for comparison with target phenomena. Importantly, the model is in no case directly optimized to capture target phenomena, only to solve the task at hand. In the rare case where target effects depend sensitively on experimenter-chosen hyperparameters of MDL-C, this dependency is described alongside other results.

A detailed description of simulation methods, sufficient to fully replicate our work, is presented in S1 Methods . Briefly, for each target dual-process domain, we focused on a set of empirical phenomena that the relevant specialty literature treats as fundamental or canonical. We do not, of course, address all behavioral and neural phenomena that might be considered relevant to constrain theory in each domain, and we dedicate a later section to the question of whether any empirical findings that we do not directly model might present challenges for our theory. Nevertheless, the core phenomena in each field are fairly well recognized, and we expect our selections will be uncontroversial. Indeed, each target phenomenon has been the focus of previous computational work, and we dedicate a later section to comparisons between our modeling approach and previous proposals. While such comparisons are of course important, one point that we continue to stress throughout is that no previous model has addressed the entire set of target phenomena, bridging between the three domains we address.

Having established these points, we are now in position to advance the central thesis of the present work: We propose that MDL-C may offer a useful normative model for dual-process behavioral phenomena. As in dual-process theory, MDL-C contains two distinct decision-making mechanisms. One of these (corresponding to in Fig 1A ) distills as much target behavior as possible in an algorithmically simple form, reminiscent of the habit system or System 1 in dual-process theory. Meanwhile, the other (RNN π ) enjoys greater computational capacity and intervenes when the simpler mechanism fails to select the correct action, reminiscent of executive control or System 2 in dual-process theory. MDL-C furnishes a normative explanation for this bipartite organization by establishing a connection with the problem of behavioral generalization. To test whether MDL-C can serve as such a model, we conducted a series of simulation studies spanning the three behavioral domains where dual-process theory has been principally applied: executive control in Simulation 1, reward-based decision making in Simulation 2, and JDM in Simulation 3.

Equipped with this runnable implementation, we can return to the problem of generalization, and ask whether MDL-C in fact enhances generalization performance. In other words, we’d like to verify that this regularization enables the agent to adapt more quickly than it would otherwise to new goals. Fig 1B and 1C presents relevant simulation results (see also S1 Methods , and [ 38 ] for related theoretical analysis and further empirical evaluation). When our MDL-C agent is trained on a set of tasks from a coherent domain (e.g., navigation or gait control) and then challenged with a new task from this same domain, it learns faster than an agent with the same architecture but lacking MDL regularization. In short, policy compression, following the logic of MDL, enhances generalization. For further examples, see [ 38 ].

A: Neural network implementation of MDL-C. Perceptual observations (input o) feed into two recurrent networks. The lower pathway ( ) has noisy connections with VDO regularization, outputting action distribution π 0 . The upper pathway (RNN π ) outputs distribution π, which overwrites π 0 . KL divergence between policies is computed, and action a is selected from π. B: MDL regularization enhances generalization. Left: MDL-C agent vs unregularized baseline (Standard RL) in grid navigation task. Barplot shows average trials to find shortest path to new goals. Right: Average reward in continuous control task. MDL-C learns faster with related task experience and outperforms Standard RL.

Recent advances in artificial intelligence (AI) allow us to implement MDL-C in the form of a runnable simulation model, as diagrammed in Fig 1 (see S1 Methods ). Here, both policy π and policy π 0 are parameterized as identical recurrent neural networks, both receiving the same perceptual inputs. On every time-step, the network implementing the reference policy π 0 —henceforth —outputs a probability distribution over actions. That distribution is then updated by the network implementing policy π (RNN π ), and the agent’s overt action is selected (see S1 Methods ). To implement MDL regularization using a variational code, the deviation term L(π|π 0 ) is quantified as the KL divergence between the two policies π and π 0 , consistent with the fact that the KL divergence represents the amount of information required to encode samples from one probability distribution (here π) given a second reference distribution (π 0 ). In order to implement the complexity cost L(π 0 ), we apply a technique known as variational dropout [ 41 ]. VDO applies a form of multiplicative Gaussian noise to the network activations which is equivalent to applying a KL divergence penalty between the distribution over model weights and a sparse prior. There are multiple possible choices for such a prior, but we apply the Jeffreys prior [ 42 ], which, in conjunction with a policy distribution in the exponential family, is asymptotically equivalent to the normalized maximum likelihood estimator, perhaps the most fundamental MDL estimator [ 43 ]. For more details, see the S1 Methods section and [ 38 ].

This principle is applicable to any behaviorally-defined objective function (e.g. imitation learning [ 39 ]). In our simulations, we consider objective functions defined via the reinforcement learning framework (RL; [ 40 ]), in which the environment delivers quantitative ‘rewards’ in way that depends on its state and on the agent’s actions, and the agent attempts to maximize these rewards. This framework is appealing for modeling behavior in tasks from multiple disciplines, as it assumes no a priori access to a model of the world, generalizes a number of other learning paradigms (e.g., any supervised learning problem can be cast as an RL task), and can be adapted to both simple and complex observation types via function approximation. These objectives can be combined in the following expression: (2) where R denotes cumulative reward and λ as a weighting parameter. Maximizing this objective yields a form of regularized policy optimization which [ 38 ] call minimum description length control (MDL-C). At an intuitive level, MDL-C trains the learning agent to formulate a policy that maximizes reward while also staying close to a simpler or more compressed reference policy. By compressing useful behavioral patterns from past experience, this default policy can guide the control policy to more quickly find solution to new tasks [ 38 ]. This division of the agent into two modules, one of which is incentivized to solve new tasks and the other to compress those solutions, is reminiscent of the many dual-process theories in psychology and neuroscience. Crucially, this organization is here derived from first-principles reasoning about the requirements of combining the MDL principle with adaptive behavior, rather than neuroscientific or psychological data.

Given a normative principle for generalization, the next step in developing our model is to apply the MDL principle in the context of decision-making. This means defining an ‘agent’ that receives observations of the environment and emits actions based on an adjustable ‘policy,’ a mapping from situations to actions. A ‘task’ is defined as a combination of an environment and some objective that the agent’s policy is optimized to accomplish within that environment. The MDL principle holds that learning is the process of discovering regularity in data, and that any regularity in the data can be used to compress it [ 37 ]. In order to apply MDL theory to an agent, then, we must define what exactly the “data” that we want to compress. [ 38 ] propose that agents faced with a multitude of tasks should aim to identify common behavioral patterns that arise in the solutions to these tasks. In other words, the data that the agent should seek to compress are useful patterns of interaction with the world—optimal policies—for solving the problems it most frequently faces. To align with MDL theory, a behavioral system for generalization needs to accomplish two objectives. First, it must generate data by solving tasks, and second, it must identify useful structure in these data through compression. These objectives are assigned to two processes: a behavioral, or “control” policy π, which aims to find solutions to new tasks, and an auxiliary, or “default” policy π 0 , which attempts to compress these solutions.

Results

Simulation 3: Judgment and decision making. As noted earlier, dual-process models in JDM research distinguish between System-1 and System-2 strategies, the former implementing imprecise heuristic procedures, and the latter sounder but more computationally expensive analysis [6, 7]. As in the other dual-process domains we have considered, there appears to be a neuroanatomical dissociation in this case as well, with System-2 responses depending on prefrontal computations [15, 16]. Recent research on heuristics has increasingly focused on the hypothesis that they represent resource-rational approximations to rational choice [26]. In one especially relevant study, [24] proposed that heuristic decision making arises from a process that “controls for how many bits are required to implement the emerging decision-making algorithm” (p. 8). This obviously comes close to the motivations behind MDL-C. Indeed, [24] implement their theory in the form of a recurrent neural network, employing the same regularization that we apply to our . They then proceed to show how the resulting model can account for heuristic use across several decision-making contexts. One heuristic they focus on, called one-reason decision making, involves focusing on a single choice attribute to the exclusion of others [69]. As shown in Fig 6A, reproduced from [24], a description-length regularized network, trained under conditions where one-reason decision making is adaptive (see [24] and S1 Methods), shows use of this heuristic in its behavior, as also seen in human participants performing the same task. In contrast, an unregularized version of the same network implements a more accurate but also more expensive “compensatory” strategy, weighing choice features more evenly. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 6. A. Heuristic one-reason decision making (left) and richer compensatory decision making (right) in a multi-attribute choice task, from [24]. Gini coefficients, on the y axis, capture the degree to which decisions depend on one feature (higher values, with asymptotic maximum of one) versus all features evenly (zero), with references for one-reason decision making (single cue) and a fully compensatory strategy (equal weighting) indicated. Data points for each trial correspond to observations from separate simulation runs. Human participants in the study displayed both patterns of behavior, depending on the task conditions. B. Behavior of MDL-C in the task from [24], under conditions where human participants displayed one-reason decision making. C. Behavior of π 0 (left) and π (right) when the KL penalty for divergence between the two policies is reduced (see S1 Methods). D. In the simulation from panel C, the divergence between policies is increased when the agent emits a non-heuristic decision. https://doi.org/10.1371/journal.pcbi.1012383.g006 As illustrated in Fig 6B, when MDL-C is trained on the same task as the one used by [24] (see S1 Methods), it displays precisely the same heuristic behavior those authors observed in their human experimental participants. Digging deeper, MDL-C provides an explanation for some additional empirical phenomena that are not addressed by [24] or, to the best of our knowledge, any other previous computational model. In an experimental study of one-reason decision making, [69] observed that application of the heuristic varied depending on the available payoffs. Specifically, heuristic use declined with the relative cost of applying a compensatory strategy, taking more feature values into account. MDL-C shows the same effect. When the weighting of the deviation term D KL (π||π 0 ) is reduced relative to the value-maximization term in the MDL-C objective (see S1 Methods), the policy π and thus the agent’s behavior take on a non-heuristic compensatory form (Fig 6D). Critically, in this case MDL-C instantiates the non-heuristic policy side-by-side with the heuristic policy, which continues to appear at the level of π 0 . This aligns with work suggesting that System-1 decision making can occur covertly even in cases where overt responding reflects a System-2 strategy. In particular, [15] observed activation in prefrontal areas associated with conflict detection in circumstances where a tempting heuristic response was successfully overridden by fuller reasoning (see also [16]). A parallel effect is seen in our MDL-C agent in the degree of conflict (KL divergence) between policies π and π 0 (Fig 6D).

[END]
---
[1] Url: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1012383

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/