Model-Based Multi-Objective Reinforcement Learning by a Reward Occurrence Probability Vector

Model-Based Multi-Objective Reinforcement Learning by a Reward Occurrence Probability Vector

Tomohiro Yamaguchi, Shota Nagahama, Yoshihiro Ichikawa, Yoshimichi Honma, Keiki Takadama
Copyright: © 2020 |Pages: 27
DOI: 10.4018/978-1-7998-1382-8.ch010
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

This chapter describes solving multi-objective reinforcement learning (MORL) problems where there are multiple conflicting objectives with unknown weights. Previous model-free MORL methods take large number of calculations to collect a Pareto optimal set for each V/Q-value vector. In contrast, model-based MORL can reduce such a calculation cost than model-free MORLs. However, previous model-based MORL method is for only deterministic environments. To solve them, this chapter proposes a novel model-based MORL method by a reward occurrence probability (ROP) vector with unknown weights. The experimental results are reported under the stochastic learning environments with up to 10 states, 3 actions, and 3 reward rules. The experimental results show that the proposed method collects all Pareto optimal policies, and it took about 214 seconds (10 states, 3 actions, 3 rewards) for total learning time. In future research directions, the ways to speed up methods and how to use non-optimal policies are discussed.
Chapter Preview
Top

Introduction

Reinforcement learning (RL) is a popular algorithm for automatically solving sequential decision problems such as robot behavior learning and most of them are focused on single-objective settings to decide a single solution. A single objective RL can solve a simple learning task under a simple situation. However, in real world robotics, a robot often faces that the optimal condition on its own objective changes such as an automated driving car in a public road where many human driving cars move. So the real world learner has to treat multi-objective which may conflict by subsumption architecture (Tajmajer 2017)or the weights of these objectives may depend on the situations around the learner. Therefore, it is important to study multi-objective optimization problems in both research fields for robotics and reinforcement learning.

In multi-objective reinforcement learning (MORL), the reward function emits a reward vector instead of a scalar reward. A scalarization function with a vector of n weights (weight vector) is a commonly used to decide a single solution. The simple scalarization function is linear scalarization such as weighted sum. The main problem of previous MORL methods is a huge learning cost required to collect all Pareto optimal policies. Hence, it is hard to learn the high dimensional Pareto optimal policies. To solve this, this chapter proposes the novel model-based MORL method by reward occurrence probability (ROP) with unknown weights. There are two main features. The first feature is that the average reward of a policy is defined by inner product of the ROP vector and the weight vector. The second feature is that it learns ROP in each policy instead of Q-values. Pareto optimal deterministic policies directly form the vertices of a convex hull in the ROP vector space. Therefore, Pareto optimal policies are calculated independently with weights and just once. The experimental results show that the authors’ proposed method collected all Pareto optimal policies under three dimensional stochastic environments, and it takes a small computation time though previous MORL methods learn at most two or three dimensions deterministic environments.

The objectives of this chapter are as follows:

  • 1.

    Solving multi-objective reinforcement learning problems where there are multiple conflicting objectives with unknown weights.

  • 2.

    To learn all Pareto optimal solutions which maximize the average reward defined by the reward occurrence probability (ROP) vector of a solution and unknown weights.

  • 3.

    Visualizing the distribution of all Pareto optimal solutions in the ROP vector space.

Top

Background

Reinforcement learning (RL) is a popular algorithm for a learning agent to automatically solve sequential decision problems which are commonly modeled as Markov decision processes (MDPs). A MDP is a discrete time stochastic control process where outcomes are partly random and partly under the control of a decision maker. At each time step, the process is in some state s, and the decision maker may choose any action a that is available in state s. The process responds at the next time step by randomly moving into a new state s', and giving the decision maker a corresponding reward Ra(s, s'). In almost reinforcement learning methods, the reward is usually simplified as Ra(s) = R(a, s). The probability that the process moves into its new state s' is influenced by the chosen action. Specifically, it is given by the state transition function Pa(s, s'). When a next state s' only depends on the current state s and the decision maker's action a, (it is independent of all previous states and actions), this property is called simple Markov property. A discrete MDP model is represented as both the state action transition matrix Pa(s, s') and reward matrix Ra(s, s') for all triple among state s, action a and a new state s' in the environment.

Key Terms in this Chapter

Markov Chain: A stochastic model describing a sequence of possible states in which the probability of each state depends only on the previous state. It is an intension of Markov decision processes, the difference is the subtraction of actions and rewards.

Weight Vector: A trade-off among multi objective, and each element of the vector represents a weight of each objective.

Multi-Objective MDP (MOMDP): An MDP in which the reward function describes a vector of n rewards (reward vector), one for each objective, instead of a scalar.

Model-Based Approach: The reinforcement learning algorithm which starts with directly estimating the MDP model statistically, then calculates the value of each state as V(s) or the quality of each state action pair Q(s, a) using the estimated MDP to search the optimal solution that maximizes V(s) of each state.

LC-Learning: One of the average reward model-based reinforcement learning methods. It collects all reward acquisition deterministic policies under the unichain condition.

Pareto Optimization: It is to find multiple policies that cover the Pareto front, which requires collective search for sampling the Pareto set.

Reward Acquisition Probability (ROP): The expected occurrence probability per step for the reward.

Reinforcement Learning: The popular learning algorithm for automatically solving sequential decision problems. It is commonly modeled as Markov decision processes (MDPs).

Average Reward: The expected received rewards per step when an agent performs state transitions routinely according to a policy.

Markov Decision Process (MDP): It is a discrete time and a discrete state space stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker.

Complete Chapter List

Search this Book:
Reset