Doina Precup | McGill University (original) (raw)

Papers by Doina Precup

Research paper thumbnail of Quantifying the determinants of outbreak detection performance through simulation and machine learning

•We developed a model for quantifying determinants of outbreak detection performance. •We used B... more •We developed a model for quantifying determinants of outbreak detection performance.
•We used Bayesian networks to model relations between outbreak and algorithm characteristics and detection performance.
•We used the model to predict detection performance for different outbreak scenarios.
•The model can provide a quantitative evaluation of new methods and data in biosurveillance systems.

Research paper thumbnail of Sparse Distributed Memories for On-Line Value-Based Reinforcement Learning

In this paper, we advocate the use of Sparse Distributed Memories (SDMs) for on-line, value-based... more In this paper, we advocate the use of Sparse Distributed Memories (SDMs) for on-line, value-based reinforcement learning (RL). SDMs provide a linear, local function approximation scheme, designed to work when a very large/ high-dimensional input (address) space has to be mapped into a much smaller physical memory. We present an implementation of the SDM architecture for on-line, value-based RL in continuous state spaces. An important contribution of this paper is an algorithm for dynamic on-line allocation and adjustment of memory resources for SDMs, which eliminates the need for choosing the memory size and structure a priori. In our experiments, this algorithm provides very good performance while efficiently managing the memory resources.

Research paper thumbnail of Automatic basis function construction for approximate dynamic programming and reinforcement learning

We address the problem of automatically constructing basis functions for linear approximation of ... more We address the problem of automatically constructing basis functions for linear approximation of the value function of a Markov Decision Process (MDP). Our work builds on results by Bertsekas and Castañon (1989) who proposed a method for automatically aggregating states to speed up value iteration. We propose to use neighborhood component analysis , a dimensionality reduction technique created for supervised learning, in order to map a high-dimensional state space to a lowdimensional space, based on the Bellman error, or on the temporal difference (TD) error. We then place basis function in the lower-dimensional space. These are added as new features for the linear function approximator. This approach is applied to a high-dimensional inventory control problem.

Research paper thumbnail of Redagent: winner of TAC SCM 2003

Research paper thumbnail of RedAgent-2003: An Autonomous Market-Based Supply-Chain Management Agent

The Supply Chain Management track of the international Trading Agents Competition (TAC SCM) was i... more The Supply Chain Management track of the international Trading Agents Competition (TAC SCM) was introduced in 2003 as a test-bed for researchers interested in building autonomous agents the act in dynamic supply chains. TAC SCM provides a challenging scenario for existing AI decisionmaking algorithms, due to the high dimensionality and the non-determinism of the environment, as well as the combinatorial nature of the problem. In this paper we present RedAgent, the winner of the first TAC SCM competition. RedAgent is based on a multi-agent design, in which many simple, heuristic agents manage tasks such as fulfilling customer orders or procuring particular resources. The key idea is to use internal markets as the main decision mechanism, in order to determine what products to focus on and how to allocate the existing resources. The internal markets ensure the coordination of the individual agents, but at the same time provide price estimates for the goods that RedAgent has to sell and purchase, a key feature in this domain. We describe RedAgent's architecture and analyze its behavior based on data from the competition.

Research paper thumbnail of Characterizing Markov Decision Processes

Problem characteristics often have a significant influence on the difficulty of solving optimizat... more Problem characteristics often have a significant influence on the difficulty of solving optimization problems. In this paper, we propose attributes for characterizing Markov Decision Processes (MDPs), and discuss how they affect the performance of reinforcement learning algorithms that use function approximation. The attributes measure mainly the amount of randomness in the environment. Their values can be calculated from the MDP model or estimated on-line. We show empirically that two of the proposed attributes have a statistically significant effect on the quality of learning. We discuss how measurements of the proposed MDP attributes can be used to facilitate the design of reinforcement learning systems.

Research paper thumbnail of Metrics for Finite Markov Decision Processes

We present metrics for measuring the similarity of states in a finite Markov decision process (MD... more We present metrics for measuring the similarity of states in a finite Markov decision process (MDP). The formulation of our metrics is based on the notion of bisimulation for MDPs, with an aim towards solving discounted infinite horizon reinforcement learning tasks. Such metrics can be used to aggregate states, as well as to better structure other value function approximators (e.g., memory-based or nearest-neighbor approximators). We provide bounds that relate our metric distances to the optimal values of states in the given MDP.

Research paper thumbnail of Learning in non-stationary Partially Observable Markov Decision Processes

We study the problem of finding an optimal policy for a Partially Observable Markov Decision Proc... more We study the problem of finding an optimal policy for a Partially Observable Markov Decision Process (POMDP) when the model is not perfectly known and may change over time. We present the algorithm MEDUSA+, which incrementally improves a POMDP model using selected queries, while still optimizing the reward. Empirical results show the response of the algorithm to changes in the parameters of a model: the changes are learned quickly and the agent still accumulates high reward throughout the process.

Research paper thumbnail of Active Learning in Partially Observable Markov Decision Processes

Learning in Partially Observable Markov Decision Processes is a notoriously difficult problem. Th... more Learning in Partially Observable Markov Decision Processes is a notoriously difficult problem. The goal of our research is to address this problem for environments in which a partial model may be available, in the beginning, but in which there is uncertainty about the model parameters. We developed and algorithm called MEDUSA , which is based on ideas from active learning [1,2,3].

Research paper thumbnail of A formal framework for robot learning and control under model uncertainty

While the Partially Observable Markov Decision Process (POMDP) provides a formal framework for th... more While the Partially Observable Markov Decision Process (POMDP) provides a formal framework for the problem of robot control under uncertainty, it typically assumes a known and stationary model of the environment. In this paper, we study the problem of finding an optimal policy for controlling a robot in a partially observable domain, where the model is not perfectly known, and may change over time. We present an algorithm called MEDUSA which incrementally learns a POMDP model using queries, while still optimizing a reward function. We demonstrate effectiveness of the approach for a simple scenario, where a robot seeking a person has minimal a priori knowledge of its own sensor model, as well as where the person is located.

Research paper thumbnail of Eligibility Traces for Off-Policy Policy Evaluation

Eligibility traces have been shown to speed reinforcement learning, to make it more robust to hid... more Eligibility traces have been shown to speed reinforcement learning, to make it more robust to hidden states, and to provide a link between Monte Carlo and temporal-difference methods.

Research paper thumbnail of Temporal Abstraction in Reinforcement Learning

Decision making usually involves choosing among different courses of action over a broad range of... more Decision making usually involves choosing among different courses of action over a broad range of time scales. For instance, a person planning a trip to a distant location makes high-level decisions regarding what means of transportation to use, but also chooses low-level actions, such ...

Research paper thumbnail of Classification Using Phi-Machines and Constructive Function Approximation

This article presents a new classification algorithm, called CLEF, which induces a -machine by co... more This article presents a new classification algorithm, called CLEF, which induces a -machine by constructing its own features based on the training data. The features can be viewed as defining subsets of the instance space, and they allow CLEF to create useful non-linear functions over the input variables. The algorithm is guaranteed to find a classifier that separates the training instances, if such a separation is possible. We compare CLEF empirically to several other classification algorithms, including a well-known decision tree inducer, an artificial neural network inducer, and a support vector machine inducer. Our results show that the CLEF-induced -machines and support vector machines have similar accuracy on the suite tested, and that both are significantly more accurate than the other classifiers produced. We argue that the classifiers produced by CLEF are easy to interpret, and hence may be preferred over support vector machines in certain circumstances.

Research paper thumbnail of Learning Options in Reinforcement Learning

Temporally extended actions (e.g., macro actions) have proven very useful for speeding up learnin... more Temporally extended actions (e.g., macro actions) have proven very useful for speeding up learning, ensuring robustness and building prior knowledge into AI systems. The options framework (Precup, 2000; Sutton, Precup & Singh, 1999) provides a natural way of incorporating such actions into reinforcement learning systems, but leaves open the issue of how good options might be identified. In this paper, we empirically explore a simple approach to creating options. The underlying assumption is that the agent will be asked to perform different goal-achievement tasks in an environment that is othertherwise the same over time. Our approach is based on the intuition that states that are frequently visited on system trajectories, could prove to be useful subgoals (e.g., McGovern & Barto, 2001; Iba, 1989). We propose a greedy algorithm for identifying subgoals based on state visitation counts. We present empirical studies of this approach in two gridworld navigation tasks. One of the environments we explored contains bottleneck states, and the algorithm indeed finds these states, as expected. The second environment is an empty gridworld with no obstacles. Although the environment does not contain any obvious subgoals, our approach still finds useful options, which essentially allow the agent to explore the environment more quickly.

Research paper thumbnail of Using Options for Knowledge Transfer in Reinforcement Learning

... Our model can also be viewed as a partially observable Markov decision problem (POMDP), with ... more ... Our model can also be viewed as a partially observable Markov decision problem (POMDP), with a special structure that we describe. ... Although solving POMDPs is very difficult, we incorporate ideas from POMDP theory in our algorithm. ...

Research paper thumbnail of Intra-Option Learning about Temporally Abstract Actions

Several researchers have proposed modeling temporally abstract actions in reinforcement learning ... more Several researchers have proposed modeling temporally abstract actions in reinforcement learning by the combination of a policy and a termination condition, which we refer to as an option. Value functions over options and models of options can be learned using methods designed for semi-Markov decision processes (SMDPs). However, all these methods require an option to be executed to termination. In this paper we explore methods that learn about an option from small fragments of experience consistent with that option, even if the option itself is not executed. We call these methods intra-option learning methods because they learn from experience within an option. Intra-option methods are sometimes much more efficient than SMDP methods because they can use off-policy temporaldifference mechanisms to learn simultaneously about all the options consistent with an experience, not just the few that were actually executed. In this paper we present intra-option learning methods for learning value functions over options and for learning multi-time models of the consequences of options. We present computational examples in which these new methods learn much faster than SMDP methods and learn effectively when SMDP methods cannot learn at all. We also sketch a convergence proof for intraoption value learning.

Research paper thumbnail of Theoretical Results on Reinforcement Learning with Temporally Abstract Options

We present new theoretical results on planning within the framework of temporally abstract reinfo... more We present new theoretical results on planning within the framework of temporally abstract reinforcement learning (Precup & Sutton, 1997; Sutton, 1995). Temporal abstraction is a key step in any decision making system that involves planning and prediction. In temporally abstract reinforcement learning, the agent is allowed to choose among “options”, whole courses of action that may be temporally extended, stochastic, and contingent on previous events. Examples of options include closed-loop policies such as picking up an object, as well as primitive actions such as joint torques. Knowledge about the consequences of options is represented by special structures called multi-time models. In this paper we focus on the theory of planning with multi-time models. We define new Bellman equations that are satisfied for sets of multi-time models. As a consequence, multi-time models can be used interchangeably with models of primitive actions in a variety of well-known planning methods including value iteration, policy improvement and policy iteration.

Research paper thumbnail of Improved Switching among Temporally Abstract Actions

In robotics and other control applications it is commonplace to have a preexisting set of control... more In robotics and other control applications it is commonplace to have a preexisting set of controllers for solving subtasks, perhaps hand-crafted or previously learned or planned, and still face a difficult problem of how to choose and switch among the controllers to solve an overall task as well as possible. In this paper we present a framework based on Markov decision processes and semi-Markov decision processes for phrasing this problem, a basic theorem regarding the improvement in performance that can be obtained by switching flexibly between given controllers, and example applications of the theorem. In particular, we show how an agent can plan with these high-level controllers and then use the results of such planning to find an even better plan, by modifying the existing controllers, with negligible additional cost and no re-planning. In one of our examples, the complexity of the problem is reduced from 24 billion state-action pairs to less than a million state-controller pairs. In many applications, solutions to parts of a task are known, either because they were handcrafted by people or because they were previously learned or planned. For example, in robotics applications, there may exist controllers for moving joints to positions, picking up objects, controlling eye movements, or navigating along hallways. More generally, an intelligent system may have available to it several temporally extended courses of action to choose from. In such cases, a key challenge is to take full advantage of the existing temporally extended actions, to choose or switch among them effectively, and to plan at their level rather than at the level of individual actions.

Research paper thumbnail of Between MDPs and Semi-MDPs: Learning, Planning, and Representing Knowledge at Multiple Temporal Scales

Artificial Intelligence, 1998

Learning, planning, and representing knowledge at multiple levels of temporal abstraction are key... more Learning, planning, and representing knowledge at multiple levels of temporal abstraction are key challenges for AI. In this paper we develop an approach to these problems based on the mathematical framework of reinforcement learning and Markov decision processes (MDPs). We extend the usual notion of action to include options-whole courses of behavior that may be temporally extended, stochastic, and contingent on events. Examples of options include picking up an object, going to lunch, and traveling to a distant city, as well as primitive actions such as muscle twitches and joint torques. Options may be given a priori, learned by experience, or both. They may be used interchangeably with actions in a variety of planning and learning methods. The theory of semi-Markov decision processes (SMDPs) can be applied to model the consequences of options and as a basis for planning and learning methods using them. In this paper we develop these connections, building on prior work by , Parr (in prep.) and others. Our main novel results concern the interface between the MDP and SMDP levels of analysis. We show how a set of options can be altered by changing only their termination conditions to improve over SMDP methods with no additional cost. We also introduce intra-option temporal-difference methods that are able to learn from fragments of an option's execution. Finally, we propose a notion of subgoal which can be used to improve the options themselves. Overall, we argue that options and their models provide hitherto missing aspects of a powerful, clear, and expressive framework for representing and organizing knowledge.

Research paper thumbnail of Planning with Closed-Loop Macro Actions

Planning and learning at multiple levels of temporal abstraction is a key problem for arti cial i... more Planning and learning at multiple levels of temporal abstraction is a key problem for arti cial intelligence. In this paper we summarize an approach t o this problem based on the mathematical framework of Markov decision processes and reinforcement learning. Conventional model-based reinforcement learning uses primitive actions that last one time step and that can be modeled independently of the learning agent. These can be generalized to macro actions, m ulti-step actions speci ed by an arbitrary policy and a way o f completing. Macro actions generalize the classical notion of a macro operator in that they are closed loop, uncertain, and of variable duration. Macro actions are needed to represent common-sense higher-level actions such as going to lunch, grasping an object, or traveling to a distant c i t y. This paper generalizes prior work on temporally abstract models (Sutton 1995) and extends it from the prediction setting to include actions, control, and planning. We d e n e a s e m a n tics of models of macro actions that guarantees the validity o f planning using such models. This paper present n e w results in the theory of planning with macro actions and illustrates its potential advantages in a gridworld task.

Research paper thumbnail of Quantifying the determinants of outbreak detection performance through simulation and machine learning

•We developed a model for quantifying determinants of outbreak detection performance. •We used B... more •We developed a model for quantifying determinants of outbreak detection performance.
•We used Bayesian networks to model relations between outbreak and algorithm characteristics and detection performance.
•We used the model to predict detection performance for different outbreak scenarios.
•The model can provide a quantitative evaluation of new methods and data in biosurveillance systems.

Research paper thumbnail of Sparse Distributed Memories for On-Line Value-Based Reinforcement Learning

In this paper, we advocate the use of Sparse Distributed Memories (SDMs) for on-line, value-based... more In this paper, we advocate the use of Sparse Distributed Memories (SDMs) for on-line, value-based reinforcement learning (RL). SDMs provide a linear, local function approximation scheme, designed to work when a very large/ high-dimensional input (address) space has to be mapped into a much smaller physical memory. We present an implementation of the SDM architecture for on-line, value-based RL in continuous state spaces. An important contribution of this paper is an algorithm for dynamic on-line allocation and adjustment of memory resources for SDMs, which eliminates the need for choosing the memory size and structure a priori. In our experiments, this algorithm provides very good performance while efficiently managing the memory resources.

Research paper thumbnail of Automatic basis function construction for approximate dynamic programming and reinforcement learning

We address the problem of automatically constructing basis functions for linear approximation of ... more We address the problem of automatically constructing basis functions for linear approximation of the value function of a Markov Decision Process (MDP). Our work builds on results by Bertsekas and Castañon (1989) who proposed a method for automatically aggregating states to speed up value iteration. We propose to use neighborhood component analysis , a dimensionality reduction technique created for supervised learning, in order to map a high-dimensional state space to a lowdimensional space, based on the Bellman error, or on the temporal difference (TD) error. We then place basis function in the lower-dimensional space. These are added as new features for the linear function approximator. This approach is applied to a high-dimensional inventory control problem.

Research paper thumbnail of Redagent: winner of TAC SCM 2003

Research paper thumbnail of RedAgent-2003: An Autonomous Market-Based Supply-Chain Management Agent

The Supply Chain Management track of the international Trading Agents Competition (TAC SCM) was i... more The Supply Chain Management track of the international Trading Agents Competition (TAC SCM) was introduced in 2003 as a test-bed for researchers interested in building autonomous agents the act in dynamic supply chains. TAC SCM provides a challenging scenario for existing AI decisionmaking algorithms, due to the high dimensionality and the non-determinism of the environment, as well as the combinatorial nature of the problem. In this paper we present RedAgent, the winner of the first TAC SCM competition. RedAgent is based on a multi-agent design, in which many simple, heuristic agents manage tasks such as fulfilling customer orders or procuring particular resources. The key idea is to use internal markets as the main decision mechanism, in order to determine what products to focus on and how to allocate the existing resources. The internal markets ensure the coordination of the individual agents, but at the same time provide price estimates for the goods that RedAgent has to sell and purchase, a key feature in this domain. We describe RedAgent's architecture and analyze its behavior based on data from the competition.

Research paper thumbnail of Characterizing Markov Decision Processes

Problem characteristics often have a significant influence on the difficulty of solving optimizat... more Problem characteristics often have a significant influence on the difficulty of solving optimization problems. In this paper, we propose attributes for characterizing Markov Decision Processes (MDPs), and discuss how they affect the performance of reinforcement learning algorithms that use function approximation. The attributes measure mainly the amount of randomness in the environment. Their values can be calculated from the MDP model or estimated on-line. We show empirically that two of the proposed attributes have a statistically significant effect on the quality of learning. We discuss how measurements of the proposed MDP attributes can be used to facilitate the design of reinforcement learning systems.

Research paper thumbnail of Metrics for Finite Markov Decision Processes

We present metrics for measuring the similarity of states in a finite Markov decision process (MD... more We present metrics for measuring the similarity of states in a finite Markov decision process (MDP). The formulation of our metrics is based on the notion of bisimulation for MDPs, with an aim towards solving discounted infinite horizon reinforcement learning tasks. Such metrics can be used to aggregate states, as well as to better structure other value function approximators (e.g., memory-based or nearest-neighbor approximators). We provide bounds that relate our metric distances to the optimal values of states in the given MDP.

Research paper thumbnail of Learning in non-stationary Partially Observable Markov Decision Processes

We study the problem of finding an optimal policy for a Partially Observable Markov Decision Proc... more We study the problem of finding an optimal policy for a Partially Observable Markov Decision Process (POMDP) when the model is not perfectly known and may change over time. We present the algorithm MEDUSA+, which incrementally improves a POMDP model using selected queries, while still optimizing the reward. Empirical results show the response of the algorithm to changes in the parameters of a model: the changes are learned quickly and the agent still accumulates high reward throughout the process.

Research paper thumbnail of Active Learning in Partially Observable Markov Decision Processes

Learning in Partially Observable Markov Decision Processes is a notoriously difficult problem. Th... more Learning in Partially Observable Markov Decision Processes is a notoriously difficult problem. The goal of our research is to address this problem for environments in which a partial model may be available, in the beginning, but in which there is uncertainty about the model parameters. We developed and algorithm called MEDUSA , which is based on ideas from active learning [1,2,3].

Research paper thumbnail of A formal framework for robot learning and control under model uncertainty

While the Partially Observable Markov Decision Process (POMDP) provides a formal framework for th... more While the Partially Observable Markov Decision Process (POMDP) provides a formal framework for the problem of robot control under uncertainty, it typically assumes a known and stationary model of the environment. In this paper, we study the problem of finding an optimal policy for controlling a robot in a partially observable domain, where the model is not perfectly known, and may change over time. We present an algorithm called MEDUSA which incrementally learns a POMDP model using queries, while still optimizing a reward function. We demonstrate effectiveness of the approach for a simple scenario, where a robot seeking a person has minimal a priori knowledge of its own sensor model, as well as where the person is located.

Research paper thumbnail of Eligibility Traces for Off-Policy Policy Evaluation

Eligibility traces have been shown to speed reinforcement learning, to make it more robust to hid... more Eligibility traces have been shown to speed reinforcement learning, to make it more robust to hidden states, and to provide a link between Monte Carlo and temporal-difference methods.

Research paper thumbnail of Temporal Abstraction in Reinforcement Learning

Decision making usually involves choosing among different courses of action over a broad range of... more Decision making usually involves choosing among different courses of action over a broad range of time scales. For instance, a person planning a trip to a distant location makes high-level decisions regarding what means of transportation to use, but also chooses low-level actions, such ...

Research paper thumbnail of Classification Using Phi-Machines and Constructive Function Approximation

This article presents a new classification algorithm, called CLEF, which induces a -machine by co... more This article presents a new classification algorithm, called CLEF, which induces a -machine by constructing its own features based on the training data. The features can be viewed as defining subsets of the instance space, and they allow CLEF to create useful non-linear functions over the input variables. The algorithm is guaranteed to find a classifier that separates the training instances, if such a separation is possible. We compare CLEF empirically to several other classification algorithms, including a well-known decision tree inducer, an artificial neural network inducer, and a support vector machine inducer. Our results show that the CLEF-induced -machines and support vector machines have similar accuracy on the suite tested, and that both are significantly more accurate than the other classifiers produced. We argue that the classifiers produced by CLEF are easy to interpret, and hence may be preferred over support vector machines in certain circumstances.

Research paper thumbnail of Learning Options in Reinforcement Learning

Temporally extended actions (e.g., macro actions) have proven very useful for speeding up learnin... more Temporally extended actions (e.g., macro actions) have proven very useful for speeding up learning, ensuring robustness and building prior knowledge into AI systems. The options framework (Precup, 2000; Sutton, Precup & Singh, 1999) provides a natural way of incorporating such actions into reinforcement learning systems, but leaves open the issue of how good options might be identified. In this paper, we empirically explore a simple approach to creating options. The underlying assumption is that the agent will be asked to perform different goal-achievement tasks in an environment that is othertherwise the same over time. Our approach is based on the intuition that states that are frequently visited on system trajectories, could prove to be useful subgoals (e.g., McGovern & Barto, 2001; Iba, 1989). We propose a greedy algorithm for identifying subgoals based on state visitation counts. We present empirical studies of this approach in two gridworld navigation tasks. One of the environments we explored contains bottleneck states, and the algorithm indeed finds these states, as expected. The second environment is an empty gridworld with no obstacles. Although the environment does not contain any obvious subgoals, our approach still finds useful options, which essentially allow the agent to explore the environment more quickly.

Research paper thumbnail of Using Options for Knowledge Transfer in Reinforcement Learning

... Our model can also be viewed as a partially observable Markov decision problem (POMDP), with ... more ... Our model can also be viewed as a partially observable Markov decision problem (POMDP), with a special structure that we describe. ... Although solving POMDPs is very difficult, we incorporate ideas from POMDP theory in our algorithm. ...

Research paper thumbnail of Intra-Option Learning about Temporally Abstract Actions

Several researchers have proposed modeling temporally abstract actions in reinforcement learning ... more Several researchers have proposed modeling temporally abstract actions in reinforcement learning by the combination of a policy and a termination condition, which we refer to as an option. Value functions over options and models of options can be learned using methods designed for semi-Markov decision processes (SMDPs). However, all these methods require an option to be executed to termination. In this paper we explore methods that learn about an option from small fragments of experience consistent with that option, even if the option itself is not executed. We call these methods intra-option learning methods because they learn from experience within an option. Intra-option methods are sometimes much more efficient than SMDP methods because they can use off-policy temporaldifference mechanisms to learn simultaneously about all the options consistent with an experience, not just the few that were actually executed. In this paper we present intra-option learning methods for learning value functions over options and for learning multi-time models of the consequences of options. We present computational examples in which these new methods learn much faster than SMDP methods and learn effectively when SMDP methods cannot learn at all. We also sketch a convergence proof for intraoption value learning.

Research paper thumbnail of Theoretical Results on Reinforcement Learning with Temporally Abstract Options

We present new theoretical results on planning within the framework of temporally abstract reinfo... more We present new theoretical results on planning within the framework of temporally abstract reinforcement learning (Precup & Sutton, 1997; Sutton, 1995). Temporal abstraction is a key step in any decision making system that involves planning and prediction. In temporally abstract reinforcement learning, the agent is allowed to choose among “options”, whole courses of action that may be temporally extended, stochastic, and contingent on previous events. Examples of options include closed-loop policies such as picking up an object, as well as primitive actions such as joint torques. Knowledge about the consequences of options is represented by special structures called multi-time models. In this paper we focus on the theory of planning with multi-time models. We define new Bellman equations that are satisfied for sets of multi-time models. As a consequence, multi-time models can be used interchangeably with models of primitive actions in a variety of well-known planning methods including value iteration, policy improvement and policy iteration.

Research paper thumbnail of Improved Switching among Temporally Abstract Actions

In robotics and other control applications it is commonplace to have a preexisting set of control... more In robotics and other control applications it is commonplace to have a preexisting set of controllers for solving subtasks, perhaps hand-crafted or previously learned or planned, and still face a difficult problem of how to choose and switch among the controllers to solve an overall task as well as possible. In this paper we present a framework based on Markov decision processes and semi-Markov decision processes for phrasing this problem, a basic theorem regarding the improvement in performance that can be obtained by switching flexibly between given controllers, and example applications of the theorem. In particular, we show how an agent can plan with these high-level controllers and then use the results of such planning to find an even better plan, by modifying the existing controllers, with negligible additional cost and no re-planning. In one of our examples, the complexity of the problem is reduced from 24 billion state-action pairs to less than a million state-controller pairs. In many applications, solutions to parts of a task are known, either because they were handcrafted by people or because they were previously learned or planned. For example, in robotics applications, there may exist controllers for moving joints to positions, picking up objects, controlling eye movements, or navigating along hallways. More generally, an intelligent system may have available to it several temporally extended courses of action to choose from. In such cases, a key challenge is to take full advantage of the existing temporally extended actions, to choose or switch among them effectively, and to plan at their level rather than at the level of individual actions.

Research paper thumbnail of Between MDPs and Semi-MDPs: Learning, Planning, and Representing Knowledge at Multiple Temporal Scales

Artificial Intelligence, 1998

Learning, planning, and representing knowledge at multiple levels of temporal abstraction are key... more Learning, planning, and representing knowledge at multiple levels of temporal abstraction are key challenges for AI. In this paper we develop an approach to these problems based on the mathematical framework of reinforcement learning and Markov decision processes (MDPs). We extend the usual notion of action to include options-whole courses of behavior that may be temporally extended, stochastic, and contingent on events. Examples of options include picking up an object, going to lunch, and traveling to a distant city, as well as primitive actions such as muscle twitches and joint torques. Options may be given a priori, learned by experience, or both. They may be used interchangeably with actions in a variety of planning and learning methods. The theory of semi-Markov decision processes (SMDPs) can be applied to model the consequences of options and as a basis for planning and learning methods using them. In this paper we develop these connections, building on prior work by , Parr (in prep.) and others. Our main novel results concern the interface between the MDP and SMDP levels of analysis. We show how a set of options can be altered by changing only their termination conditions to improve over SMDP methods with no additional cost. We also introduce intra-option temporal-difference methods that are able to learn from fragments of an option's execution. Finally, we propose a notion of subgoal which can be used to improve the options themselves. Overall, we argue that options and their models provide hitherto missing aspects of a powerful, clear, and expressive framework for representing and organizing knowledge.

Research paper thumbnail of Planning with Closed-Loop Macro Actions

Planning and learning at multiple levels of temporal abstraction is a key problem for arti cial i... more Planning and learning at multiple levels of temporal abstraction is a key problem for arti cial intelligence. In this paper we summarize an approach t o this problem based on the mathematical framework of Markov decision processes and reinforcement learning. Conventional model-based reinforcement learning uses primitive actions that last one time step and that can be modeled independently of the learning agent. These can be generalized to macro actions, m ulti-step actions speci ed by an arbitrary policy and a way o f completing. Macro actions generalize the classical notion of a macro operator in that they are closed loop, uncertain, and of variable duration. Macro actions are needed to represent common-sense higher-level actions such as going to lunch, grasping an object, or traveling to a distant c i t y. This paper generalizes prior work on temporally abstract models (Sutton 1995) and extends it from the prediction setting to include actions, control, and planning. We d e n e a s e m a n tics of models of macro actions that guarantees the validity o f planning using such models. This paper present n e w results in the theory of planning with macro actions and illustrates its potential advantages in a gridworld task.