Action Priors for Domain Reduction
3.2 Action Priors
3.2.3 Action Priors as Domain Reduction
3.2. Action Priors 51
tainty (Ortega and Braun, 2013). As these probabilities relate to the domain, rather than the current task, this is still traded off against exploiting the current Q-function.
Definition 3.2.3 (Domain Futile Action). For a domainD= (S,A,T,γ), and a set of tasks
T
={τi}with corresponding task futile actionsABi⊂S×A, let thedomain futile actionsbeAB =TiABi.Definition 3.2.4(Domain Futile State). For a domainD= (S,A,T,γ), and a set of tasks
T
={τi}with corresponding task futile states SBi⊂S, let thedomain futile statesbe SB =TiSBi.Given a domainD and a set of tasks
T
, the domain futile states SB and domain futile actionsAB are those which are consistently avoided by all optimal policies which complete the tasks inT
. We can then use these to define a task-set minimised domain.Definition 3.2.5 (Task-Set Minimised Domain). For a domain D= (S,A,T,γ), a set of tasks
T
={τi} with corresponding domain futile states SB and domain futile ac- tions AB, the task-set minimised domain under task setT
is given by DT = (Sˆ= S\SB,A,Tˆ,γ), withTˆ(s,a,s0) ←−
η(s,a)T(s,a,s0) ifs∈S,ˆ (s,a)∈A,ˆ s0∈Sˆ
0 otherwise,
where ˆA= (S,A)\AB, andη(s,a) = 1
∑s0∈ST(s,a,s0), ∀s∈S,ˆ ∀a∈Aˆ indicates normalisa- tion ofT, such that∑s0∈STˆ(s,a,s0) =1,∀s∈S,ˆ ∀a∈A.ˆ
The task-set minimised domain contains all states and actions from the original domain which are non-futile. As a result, values in the state transition function tran- sitioning from a futile state s, or using a futile action a are excluded. Additionally, transitions from a non-futile state swith a non-futile actionainto a futile states0, are not permitted by the definition of futile states.
Theorem 3.2.6 (Properties of a Task-Set Minimised Domain). If a set of tasks
T
inthe domain D are used to construct DT, then the following properties of DT hold:
1. The optimal reward achievable for each task in
T
can still be obtained in DT. 2. Each optimal policyπiused to achieve a taskτi∈T
in the domain D is optimalfor that same taskτiin DT.
Proof. 1. The goals G⊆S of task τi, are the states which by the definition ofτi must be visited by the optimal policyπiin order for it to be optimal. Therefore, G⊆SRi by definition, and soGdoes not contain task futile states, and so is in the reduced MDPDT by construction.
3.2. Action Priors 53
2. A policyπi has a zero probability of invoking a futile action, making these ac- tions unnecessary, by construction. All states and actions reachable by this pol- icy must, by construction, be included in DT. As optimal policyπi transitions only through reachable states by definition, and all those states are contained in DT, so mustπibe valid inDT.
Proposition 3.2.7. Exploration will be safer inDT than inD, as DT contains fewer trap states and dead ends thanD.
Proof. Futile actions and states which are common to all tasks are excluded from the reduced MDPDT by construction. This implies that fewer trap states will be accessed during exploration, and those that remain are all reachable states, and thus on the path to a goal state, for at least one task. Any states that remain must lie on an optimal path.
Proposition 3.2.8. For an unknown task drawn from
T
, a learning agent will be ex- pected to learn that task no slower inDT than inD, and possibly faster.Proof. The number of state action pairs in the reduced MDP |(S,A)| ≥ |A|, whichˆ implies there are no more, but possibly fewer, possible states and actions to be explored by the learning agent. With the resulting decrease in the size of the search space, learning is necessarily faster.
These properties hold if every task encountered at test time was experienced during training time, and so was incorporated intoDT. In practice, this may not be the case, and we compensate for this through the addition of the pseudo-counts α0. Assume there is a complete set of tasks
T
, such that a subsetT
s∼T
have been sampled by the agent during training. Assume also that for some new task, there is a goal state which is futile given the tasks inT
s. In the worst case, there is a single action in a single state which will transition into this new goal state. Thus, transitions from a state in ˆs∈Sˆ with a single action leading to a state inSB will be made with probabilityP(Sˆ−→SB) = α0
(|A| −1)(|
T
s|+α0) +α0 (3.6) in this worst case.The effect of the probability in Equation (3.6) is that when no prior tasks have been sampled (i.e. |
T
s|=0), this transition from ˆswill be explored with probability 1/|A|,as expected. As the number of tasks experienced for which SB is futile increases, the probability of this transition decreases proportionally to 1/|
T
s|. As a result, we consider that there is a distribution over all tasks, which has been sampled to give the current action prior. α0 must therefore reflect this sampling, and is chosen based on the variability and diversity in tasks in the domain.As a result, exploring using action priors is a form of safe exploration, which is achieved in taskN+1 by biasing exploration towards behaviours which have success- fully been used inNrelated tasks in the same domain. The key assumption here is that interaction with unsafe states and actions is consistent across all tasks, such that states which are unsafe given one set of tasks cannot constitute goal states for others. Unsafe states are the subset ofSBand unsafe actions the subset ofAB which cause a large neg- ative reward, or even damage to the agent. This effect of safe exploration is evident in the results presented in Section 3.4, through the higher initial rewards indicating both an interaction with fewer negative reward states, and fewer failed episodes.