Action Priors for Domain Reduction
3.6 Related Work
Recent research on learning policy priors (Wingate et al., 2011) (and the earlier work by Doshi-Velez et al. (2010)) has similar aspirations to our own. They propose a policy search algorithm, based on MCMC search, which learns priors over the problem do- main. The method requires the specification of a hyperprior (abstract prior knowledge) over the prior, with the effect that the method learns priors which it is able to share among states. For example, the method can discover the dominant direction in a navi- gation domain, or that there are sequences of motor primitives which are effective and should always be prioritised during search. They perform inference on (π,θ), where π is the policy, and θ the parameters by casting this search problem as approximate inference, over which they can specify or learn the priors. Our work does not assume a known model, or kernel, over policy space. This means we cannot sample from a generative model as is typical in the Bayesian reinforcement learning setting.
An alternative method for transferring information from previous tasks to act as pri-
3.6. Related Work 79
ors for a new task is through model-based approaches, such as those taken by Sunmola (2013). This embraces the Bayesian reinforcement learning paradigm of maintaining a distribution over all possible transition models which could describe the current task and environment, and updating this distribution (belief) every time an action is taken.
Transfer here is achieved by using the experience of state transitions in previous tasks to update beliefs when it first encounters each state in the new task, before anything is known about the transition probabilities from that state. Local feature models are also used to facilitate generalisation.
The strength of transferring action probabilities, rather than model probabilities is three-fold. Firstly, the information present in action priors is very general, with the only requirement being local usefulness of actions. As such, agents learning with action pri- ors can easily be put through learningcurricula, and then deployed to vastly different environments. Secondly, transferring knowledge through action models is an idea ap- plicable to a wide range of different decision making paradigms, including model-free and model-based reinforcement learning, planning and online learning. Finally, an ac- tion prior has an intuitive interpretation as an understanding of the “common sense”
behaviours of some local space.
Work by Sherstov and Stone (2005) also has similar aspirations to those of our own methods. In problems with large action sets (|A| ∼100, often from parametrised actions), they also try to either cut down the action set, or bias exploration in learning.
The difference in this work is that the reduced action set, based on what they call the relevance of an action, is determined from the training data of optimal policies for the entiredomain, rather than for each state or observation. This has the effect of pruning away actions that are always harmful throughout the domain, but the pruning is not context-specific.
Our interpretation of action priors as distilling domain specific knowledge from a set of task instance specific behaviours is similar to the idea of dividing a problem (or set thereof) into an agent space and a problem space (Konidaris and Barto, 2006). The agent space refers to commonalities between the problems, whereas the problem space is specific to each task. This formulation involves learning a reward predictor which, in a sense, can be used to guide action selection.
Where our approach reduces computation by biasing and restricting the search over action space, similar benefits have been found by only searching over limited aspects of the state space, particularly in relational planning problems. Notable examples in- clude reasoning only in terms of the subset of objects that are relevant for current plan-
ning purposes (relevance grounding) (Lang and Toussaint, 2009), or using variables to stand in for the objects relevant to the current actions (deictic references) (Pasula et al., 2007). This is similar to the way in which pruning is used in search, but we prune based on the expected utility of the action which is estimated from the utility of that action over the optimal policies for a set of previous tasks.
Options (Precup et al., 1998) are a popular formalism of hierarchical reinforcement learning, and are defined as temporally extended actions with initiation sets where they can be invoked, and termination conditions. There are many approaches to learning these, see e.g. Pickett and Barto (2002). Although there are similarities between learn- ing the initiation sets of options and action priors, they are distinct, in that an initiation set defines where the option can physically be instantiated, whereas an action prior describes regions where the option is useful. This is the same distinction as must be drawn between learning action priors and the preconditions for planning operators (e.g.
Mour˜ao et al. (2012)). For example, while pushing hard against a door may always be physically possible, this level of force would be damaging to a glass door, but that choice would not be ruled out by options or planning preconditions. Consequently, action priors not only augment preconditions, but are beneficial when using large sets of options or operators, in that they mitigate the negative impact of exploration with a large action set.
One approach to reusing experience is to decompose an environment or task into a set of subcomponents, learn optimal policies for these common elements, and then piece them together (Foster and Dayan, 2002), possibly applying transforms to make the subcomponents more general (Ravindran and Barto, 2003). This is the philosophy largely taken by the options framework. Our method differs by discovering a subset of reasonable behaviours in each perceptual state, rather than one optimal policy. Our priors can thus be used for a variety of different tasks in the same domain, although the policy must still be learned. As a result, our method is also complementary to this decomposition approach.
The idea of using a single policy from a similar problem to guide exploration is a common one. For example, policy reuse (Fernandez and Veloso, 2006) involves maintaining a library of policies and using these to seed a new learning policy. The key assumption in this work is that the task for one of the policies in the library is similar to the new task. While we acknowledge that this is a very sensible approach if the policies are indeed related, we are instead interested in extracting a more abstract level of information about the domain which is task independent, and thereby hopefully
3.7. Using Action Priors to Give Advice to Agents with Hidden Goals 81
useful for any new task.
Other authors have explored incorporating a heuristic function into the action selec- tion process to accelerate reinforcement learning (Bianchi et al., 2007), but this work does not address the acquisition of these prior behaviours. This approach is also sen- sitive to the choice of values in the heuristic function, and requires setting additional parameters.
Action priors are related to the idea of learningaffordances(Gibson, 1986), being action possibilities provided by some environment. These are commonly modelled as properties of objects, and these can be learnt from experience (see e.g. Sun et al.
(2010)). The ambitions of action priors are however slightly different to that of affor- dances. As an example, learning affordances may equate to learning that a certain class of objects is “liftable” or “graspable” by a particular robot. We are instead interested in knowing how likely it is that lifting or grasping said object will be useful for the tasks this robot has been learning. Ideally, action priors should be applied over action sets which arise as the result of affordance learning.
Another related concept is that of optimising the ordering of behaviours in a policy library (Dey et al., 2012a,b). The idea is to use a set of classifiers or experts to select a sequence of behaviours which should be tried by an agent in its current setting, thereby providing an ordered list of fallback plans. The difficulty in our reinforcement learning setting is not only that the state or observation context is important, but our setting considers different tasks, each of which would require different local actions.
Additionally, in the reinforcement learning context, it can be difficult to evaluate how good a choice a particular action is, which is important for updating the weights over the experts. This approach would also not necessarily guarantee that every action is chosen infinitely often in the limit of infinite trials, which is a requirement for the convergence of Q-learning.