Safe Policy Improvement with Baseline Bootstrapping release_q7vb7w3ugvdtpdgrpdv7ghoe2e

by Romain Laroche, Paul Trichelair, Rémi Tachet des Combes

Released as a article .

2019  

Abstract

This paper considers Safe Policy Improvement (SPI) in Batch Reinforcement Learning (Batch RL): from a fixed dataset and without direct access to the true environment, train a policy that is guaranteed to perform at least as well as the baseline policy used to collect the data. Our approach, called SPI with Baseline Bootstrapping (SPIBB), is inspired by the knows-what-it-knows paradigm: it bootstraps the trained policy with the baseline when the uncertainty is high. Our first algorithm, Π_b-SPIBB, comes with SPI theoretical guarantees. We also implement a variant, Π_≤ b-SPIBB, that is even more efficient in practice. We apply our algorithms to a motivational stochastic gridworld domain and further demonstrate on randomly generated MDPs the superiority of SPIBB with respect to existing algorithms, not only in safety but also in mean performance. Finally, we implement a model-free version of SPIBB and show its benefits on a navigation task with deep RL implementation called SPIBB-DQN, which is, to the best of our knowledge, the first RL algorithm relying on a neural network representation able to train efficiently and reliably from batch data, without any interaction with the environment.
In text/plain format

Archived Files and Locations

application/pdf  2.5 MB
file_etpbup5u6rf5fchvc4s7vw6lni
arxiv.org (repository)
web.archive.org (webarchive)
Read Archived PDF
Preserved and Accessible
Type  article
Stage   submitted
Date   2019-06-07
Version   v5
Language   en ?
arXiv  1712.06924v5
Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)
Catalog Record
Revision: 2dd56d87-8e47-41e7-a07d-b3c62c54a8e4
API URL: JSON