Policy Optimization with Stochastic Mirror Descent
release_sdlzc557kveyrbndppqyyt6ft4
by
Long Yang, Yu Zhang, Gang Zheng, Qian Zheng, Pengfei Li, Jun Wen, Gang Pan
2021
Abstract
Improving sample efficiency has been a longstanding goal in reinforcement
learning. This paper proposes 饾殔饾殎饾櫦饾櫩饾櫨 algorithm: a sample efficient
policy gradient method with stochastic mirror descent. In 饾殔饾殎饾櫦饾櫩饾櫨, a
novel variance-reduced policy gradient estimator is presented to improve sample
efficiency. We prove that the proposed 饾殔饾殎饾櫦饾櫩饾櫨 needs only
饾挭(系^-3) sample trajectories to achieve an
系-approximate first-order stationary point, which matches the best
sample complexity for policy optimization. The extensive experimental results
demonstrate that 饾殔饾殎饾櫦饾櫩饾櫨 outperforms the state-of-the-art policy
gradient methods in various settings.
In text/plain
format
Archived Files and Locations
application/pdf 2.2 MB
file_vluzkdbwabgubdq2umeoelyh44
|
arxiv.org (repository) web.archive.org (webarchive) |
1906.10462v4
access all versions, variants, and formats of this works (eg, pre-prints)