Reinforcement learning (RL) deals with the problem of an agent that has to learn how to behave to maximize its utility by its interactions with an environment (Sutton & Barto, 1998; Kaelbling, Littman & Moore, 1996). Reinforcement learning problems are usually formalized as Markov Decision Processes (MDP), which consist of a finite set of states and a finite number of possible actions that the agent can perform. At any given point in time, the agent is in a certain state and picks an action. It can then observe the new state this action leads to, and receives a reward signal. The goal of the agent is to maximize its long-term reward. In this standard formalization, no particular structure or relationship between states is assumed. However, learning in environments with extremely large state spaces is infeasible without some form of generalization. Exploiting the underlying structure of a problem can effect generalization and has long been recognized as an important aspect in representing sequential decision tasks (Boutilier et al., 1999). Hierarchical Reinforcement Learning is the subfield of RL that deals with the discovery and/or exploitation of this underlying structure. Two main ideas come into play in hierarchical RL. The first one is to break a task into a hierarchy of smaller subtasks, each of which can be learned faster and easier than the whole problem. Subtasks can also be performed multiple times in the course of achieving the larger task, reusing accumulated knowledge and skills. The second idea is to use state abstraction within subtasks: not every task needs to be concerned with every aspect of the state space, so some states can actually be abstracted away and treated as the same for the purpose of the given subtask.
In this section, we will introduce the MDP formalism, where most of the research in standard RL has been done. We will then mention the two main approaches used for learning MDPs: model-based and model-free RL. Finally, we will introduce two formalisms that extend MDPs and are widely used in the Hierarchical RL field: semi-Markov Decision Processes (SMDPs) and Factored MDPs.
Markov Decision Processes (MDPs)
A Markov Decision Process consists of:
a set of states S
a set of actions A
a transition probability function: Pr(s’ | s, a), representing the probability of the environment transitioning to state s’ when the agent performs action a from state s. It is sometimes notated T(s, a, s’).
a reward function: E[r | s, a], representing the expected immediate reward obtained by taking action a from state s.
a discount factor γ ∈ (0, 1], that downweights future rewards and whose precise role will be clearer in the following equations.
A deterministic policy p:S -> A is a function that determines, for each state, what action to take. For any given policy p, we can define a value function Vπ, representing the expected infinite-horizon discounted return to be obtained from following such a policy starting at state s:Vπ(s) = E[r0 + γ r1+ γ2 r2 + γ3 r3 + …].Bellman (1957) provides a recursive way of determining the value function when the reward and transition probabilities of an MDP are known, called the Bellman equation:Vπ(s) = R(s, π(s)) + g Ss’ÎS T(s, π(s), s’) Vπ(s’),commonly rewritten as an action-value function or Q-function:Qπ(s,a) = R(s, a) + g Ss’ÎS T(s, a, s’) Vπ(s’).An optimal policy p*(s) is a policy that returns the action a that maximizes the value function:p*(s) = argmaxa Q*(s,a)States can be represented as a set of state variables or factors, representing different features of the environment: s = <f1, f2, f3, …, fn>.
Key Terms in this Chapter
Markov Decision Process: The most common formalism for environments used in reinforcement learning, where the problem is described in terms of a finite set of states, a finite set of actions, transition probabilities between states, a reward signal and a discount factor
Semi-Markov Decision Process: An extension to the MDP formalism that deals with temporally extended actions and/or continuous time.
Factored-State Markov Decision Process: An extension to the MDP formalism used in Hierarchical RL where the transition probability is defined in terms of factors, allowing the representation to ignore certain state variables under certain contexts
Hierarchical Task Decomposition: A decomposition of a task into a hierarchy of smaller subtasks.
Hierarchical Reinforcement Learning: A subfield of reinforcement learning concerned with the discovery and use of task decomposition, hierarchical control, temporal and state abstraction (Barto Mahadevan, 2003)
State-Space Generalization: The technique of grouping together states in the underlying MDP and treating them as equivalent for certain purposes.
Reinforcement Learning: The problem faced by an agent that learns to a utility measure behavior from its interaction with the environment.