Accommodating Picky Customers: Regret Bound and Exploration Complexity for Multi-Objective Reinforcement Learning release_nf2bblami5fbfjz7xw63ngy4py

by Jingfeng Wu, Vladimir Braverman, Lin F. Yang

Released as a article .

2021  

Abstract

In this paper we consider multi-objective reinforcement learning where the objectives are balanced using preferences. In practice, the preferences are often given in an adversarial manner, e.g., customers can be picky in many applications. We formalize this problem as an episodic learning problem on a Markov decision process, where transitions are unknown and a reward function is the inner product of a preference vector with pre-specified multi-objective reward functions. We consider two settings. In the online setting, the agent receives a (adversarial) preference every episode and proposes policies to interact with the environment. We provide a model-based algorithm that achieves a nearly minimax optimal regret bound 𝒪(√(min{d,S}· H^2 SAK)), where d is the number of objectives, S is the number of states, A is the number of actions, H is the length of the horizon, and K is the number of episodes. Furthermore, we consider preference-free exploration, i.e., the agent first interacts with the environment without specifying any preference and then is able to accommodate arbitrary preference vector up to ϵ error. Our proposed algorithm is provably efficient with a nearly optimal trajectory complexity 𝒪(min{d,S}· H^3 SA/ϵ^2). This result partly resolves an open problem raised by <cit.>.
In text/plain format

Archived Files and Locations

application/pdf  749.1 kB
file_j5g6aysi5zgjvm43jklbvqf3dm
arxiv.org (repository)
web.archive.org (webarchive)
Read Archived PDF
Preserved and Accessible
Type  article
Stage   submitted
Date   2021-10-27
Version   v3
Language   en ?
arXiv  2011.13034v3
Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)
Catalog Record
Revision: 9e9d95f3-98e6-4908-ba2d-bef0302ed3bf
API URL: JSON