Tomohiro Yamaguchi, Shota Nagahama, Yoshihiro Ichikawa, Yoshimichi Honma, Keiki Takadama

Copyright: © 2020
|Pages: 27

DOI: 10.4018/978-1-7998-1382-8.ch010

Chapter Preview

TopReinforcement learning (RL) is a popular algorithm for automatically solving sequential decision problems such as robot behavior learning and most of them are focused on single-objective settings to decide a single solution. A single objective RL can solve a simple learning task under a simple situation. However, in real world robotics, a robot often faces that the optimal condition on its own objective changes such as an automated driving car in a public road where many human driving cars move. So the real world learner has to treat multi-objective which may conflict by subsumption architecture (Tajmajer 2017)or the weights of these objectives may depend on the situations around the learner. Therefore, it is important to study multi-objective optimization problems in both research fields for robotics and reinforcement learning.

In multi-objective reinforcement learning (MORL), the reward function emits a reward vector instead of a scalar reward. A scalarization function with a vector of n weights (weight vector) is a commonly used to decide a single solution. The simple scalarization function is linear scalarization such as weighted sum. The main problem of previous MORL methods is a huge learning cost required to collect all Pareto optimal policies. Hence, it is hard to learn the high dimensional Pareto optimal policies. To solve this, this chapter proposes the novel model-based MORL method by reward occurrence probability (ROP) with unknown weights. There are two main features. The first feature is that the average reward of a policy is defined by inner product of the ROP vector and the weight vector. The second feature is that it learns ROP in each policy instead of Q-values. Pareto optimal deterministic policies directly form the vertices of a convex hull in the ROP vector space. Therefore, Pareto optimal policies are calculated independently with weights and just once. The experimental results show that the authors’ proposed method collected all Pareto optimal policies under three dimensional stochastic environments, and it takes a small computation time though previous MORL methods learn at most two or three dimensions deterministic environments.

The objectives of this chapter are as follows:

*1.*Solving multi-objective reinforcement learning problems where there are multiple conflicting objectives with unknown weights.

*2.*To learn all Pareto optimal solutions which maximize the average reward defined by the reward occurrence probability (ROP) vector of a solution and unknown weights.

*3.*Visualizing the distribution of all Pareto optimal solutions in the ROP vector space.

Reinforcement learning (RL) is a popular algorithm for a learning agent to automatically solve sequential decision problems which are commonly modeled as Markov decision processes (MDPs). A MDP is a discrete time stochastic control process where outcomes are partly random and partly under the control of a decision maker. At each time step, the process is in some state *s*, and the decision maker may choose any action *a* that is available in state *s*. The process responds at the next time step by randomly moving into a new state *s'*, and giving the decision maker a corresponding reward *R _{a}*(

Markov Chain: A stochastic model describing a sequence of possible states in which the probability of each state depends only on the previous state. It is an intension of Markov decision processes, the difference is the subtraction of actions and rewards.

Weight Vector: A trade-off among multi objective, and each element of the vector represents a weight of each objective.

Multi-Objective MDP (MOMDP): An MDP in which the reward function describes a vector of n rewards (reward vector), one for each objective, instead of a scalar.

Model-Based Approach: The reinforcement learning algorithm which starts with directly estimating the MDP model statistically, then calculates the value of each state as V(s) or the quality of each state action pair Q(s, a) using the estimated MDP to search the optimal solution that maximizes V(s) of each state.

LC-Learning: One of the average reward model-based reinforcement learning methods. It collects all reward acquisition deterministic policies under the unichain condition.

Pareto Optimization: It is to find multiple policies that cover the Pareto front, which requires collective search for sampling the Pareto set.

Reward Acquisition Probability (ROP): The expected occurrence probability per step for the reward.

Reinforcement Learning: The popular learning algorithm for automatically solving sequential decision problems. It is commonly modeled as Markov decision processes (MDPs).

Average Reward: The expected received rewards per step when an agent performs state transitions routinely according to a policy.

Markov Decision Process (MDP): It is a discrete time and a discrete state space stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker.

Search this Book:

Reset

Copyright © 1988-2023, IGI Global - All Rights Reserved