Bonsai Trees in Your Head: How the Pavlovian System Sculpts Goal-Directed Choices by (original) (raw)

Decision field theory-planning: A cognitive model of planning and dynamic decision making

The world is full of complex environments in which individuals must plan a series of choices to obtain some desired outcome. In these situations entire sequences of events, including one’s future decisions, should be considered before taking an action. Backward induction provides a normative strategy for planning, in which one works backward, deterministically, from the end of a scenario. However, it often fails to account for human behavior. I propose an alternative account, Decision Field Theory-Planning, in which individuals plan future choices on the fly through repeated mental simulations. A key prediction of DFT-P is that payoff variability produces noisy simulations and reduces sensitivity to utility differences. In two multistage risky decision tree experiments I obtained this payoff variability effect, with choice proportions moving toward 0.50 as variability increased. I showed that DFT-P provides valuable insight into the strategies that people used to plan future choices and allocate cognitive resources across decision stages.

Adaptive integration of habits into depth-limited planning defines a habitual-goal–directed spectrum

Behavioral and neural evidence reveal a prospective goal-directed decision process that relies on mental simulation of the environment , and a retrospective habitual process that caches returns previously garnered from available choices. Artificial systems combine the two by simulating the environment up to some depth and then exploiting habitual values as proxies for consequences that may arise in the further future. Using a three-step task, we provide evidence that human subjects use such a normative plan-until-habit strategy, implying a spectrum of approaches that interpolates between habitual and goal-directed responding. We found that increasing time pressure led to shallower goal-directed planning, suggesting that a speed-accuracy tradeoff controls the depth of planning with deeper search leading to more accurate evaluation, at the cost of slower decision-making. We conclude that subjects integrate habit-based cached values directly into goal-directed evaluations in a normative manner. planning | habit | reinforcement learning | speed/accuracy tradeoff | tree-based evaluation B ehavioral and neural evidence suggest that the brain uses distinct goal-directed and habitual systems for decision-making (1–5). A goal-directed system exploits an individual's model, i.e., their knowledge of environmental dynamics, to simulate the consequences that will likely follow a choice (6) (Fig. 1A). Such evaluations, which assess a decision-tree expanding into the future to estimate the total reward, adapt flexibly to changes in environmental dynamics or the values of outcomes. Evaluating deep trees, however, is computationally expensive (in terms of time, working memory, metabolic energy, etc.) and potentially error-prone. By contrast, the habitual system simply caches the rewards received on previous trials conditional on the choice (Fig. 1C) without a rep-resentational characterization of the environment (hence being called " model-free ") (6, 7). This process hinders adaptation to changes in the environment, but has advantageous computational simplicity. Previous studies show distinct behavioral and neurobio-logical signatures of both systems (8–18). Furthermore, consistent with the theoretical strengths and weaknesses of each system (2, 19), different experimental conditions influence the relative contributions of the two systems in controlling behavior according to their respective competencies (20–23). Here, we suggest that individuals, rather than simply showing greater reliance on the more competent system in each condition, combine the relative strengths of the two systems in a normative manner by integrating habit-based cached values directly into goal-directed evaluations. Specifically, we propose that given available resources (time, working memory, etc.), individuals decide the depth k up to which they can afford full forward simulations and use cached habitual values thereafter. That is, individuals compute the value of a choice by adding the first k rewards, predicted by the explicit simulation , to the value of the remaining actions, extracted from the cache. We call this process an integrative " plan-until-habit " system (Fig. 1B). The greater flexibility of planning implies that a larger k in the plan-until-habit system leads to more accurate evaluations. This accuracy comes at the cost of spending more time and using more cognitive resources. If the depth is zero (k = 0), for example because of severe time constraints, the overall plan-until-habit system would appear purely habitual. In contrast, given a sufficiently great depth (k → ∞), it would appear purely goal-directed. Intermediate integer values of k could permit a normative balance, whereby depth of planning is optimized with respect to available resources. Previous studies of planning have used shallow tasks (8–18, 20–23) and have found evidence for the two extreme values of k. Rather than this dichotomous dependence on either goal-directed or habitual systems, we hypothesize that individuals use an integrative plan-until-habit system for decision making with intermediate values of k. We further hypothesize that the choice of k is a covert internal decision that is influenced by the availability of cognitive resources. To test these hypotheses, we designed a three-step task that was adapted from a popular methodology for assessing model-based and model-free control (12). Our version involves a deeper planning problem that provides the opportunity for subjects to exhibit a plan-until-habit strategy with an intermediate value of k. In brief, our human behavioral data demonstrate that individuals indeed used intermediate depths in the plan-until-habit system and that limiting the time allowed to make a decision led to significantly smaller values of k (i.e., shallower goal-directed planning). Results Two groups of subjects performed ∼400 trials of a three-stage task (Fig. 2). The first stage involved two choices, represented by different fractal images, each of which led commonly to one, and rarely to the other, of two second-stage states. These states were Significance Solving complex tasks often requires estimates of the future consequences of current actions. Estimates could be learned from past experience, but they then risk being out of date, or they could be calculated by a form of planning into the future, a process that is computationally taxing. We show that humans integrate learned estimates into their planning calculations, saving mental effort and time. We also show that increasing time pressure leads to reliance on learned estimates after fewer steps of planning. We suggest a normative rationale for this effect using a computational model. Our results provide a perspective on how the brain combines different decision processes collaboratively to exploit their comparative computational advantages.