Policy Optimization with Stochastic Mirror Descent release_sdlzc557kveyrbndppqyyt6ft4

by Long Yang, Yu Zhang, Gang Zheng, Qian Zheng, Pengfei Li, Jun Wen, Gang Pan

Released as a article .

2021  

Abstract

Improving sample efficiency has been a longstanding goal in reinforcement learning. This paper proposes 饾殔饾殎饾櫦饾櫩饾櫨 algorithm: a sample efficient policy gradient method with stochastic mirror descent. In 饾殔饾殎饾櫦饾櫩饾櫨, a novel variance-reduced policy gradient estimator is presented to improve sample efficiency. We prove that the proposed 饾殔饾殎饾櫦饾櫩饾櫨 needs only 饾挭(系^-3) sample trajectories to achieve an 系-approximate first-order stationary point, which matches the best sample complexity for policy optimization. The extensive experimental results demonstrate that 饾殔饾殎饾櫦饾櫩饾櫨 outperforms the state-of-the-art policy gradient methods in various settings.
In text/plain format

Archived Files and Locations

application/pdf  2.2 MB
file_vluzkdbwabgubdq2umeoelyh44
arxiv.org (repository)
web.archive.org (webarchive)
Read Archived PDF
Preserved and Accessible
Type  article
Stage   submitted
Date   2021-12-09
Version   v4
Language   en ?
arXiv  1906.10462v4
Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)
Catalog Record
Revision: e3465c58-cd67-4fba-8384-ad52fa97d5d0
API URL: JSON