Learning to Score Behaviors for Guided Policy Optimization
release_xzibkri3efagxov56tpnpcimxq
by
Aldo Pacchiano, Jack Parker-Holder, Yunhao Tang, Anna Choromanska,
Krzysztof Choromanski, Michael I. Jordan
2020
Abstract
We introduce a new approach for comparing reinforcement learning policies,
using Wasserstein distances (WDs) in a newly defined latent behavioral space.
We show that by utilizing the dual formulation of the WD, we can learn score
functions over policy behaviors that can in turn be used to lead policy
optimization towards (or away from) (un)desired behaviors. Combined with
smoothed WDs, the dual formulation allows us to devise efficient algorithms
that take stochastic gradient descent steps through WD regularizers. We
incorporate these regularizers into two novel on-policy algorithms,
Behavior-Guided Policy Gradient and Behavior-Guided Evolution Strategies, which
we demonstrate can outperform existing methods in a variety of challenging
environments. We also provide an open source demo.
In text/plain
format
Archived Files and Locations
application/pdf 3.1 MB
file_5lbrpuz7qnd3fchxbvkty2xzeu
|
arxiv.org (repository) web.archive.org (webarchive) |
1906.04349v4
access all versions, variants, and formats of this works (eg, pre-prints)