GDI: Rethinking What Makes Reinforcement Learning Different From Supervised Learning
release_hsxgrwj2kzcytnisynzlcjltni
by
Jiajun Fan, Changnan Xiao, Yue Huang
2022
Abstract
Deep Q Network (DQN) firstly kicked the door of deep reinforcement learning
(DRL) via combining deep learning (DL) with reinforcement learning (RL), which
has noticed that the distribution of the acquired data would change during the
training process. DQN found this property might cause instability for training,
so it proposed effective methods to handle the downside of the property.
Instead of focusing on the unfavourable aspects, we find it critical for RL to
ease the gap between the estimated data distribution and the ground truth data
distribution while supervised learning (SL) fails to do so. From this new
perspective, we extend the basic paradigm of RL called the Generalized Policy
Iteration (GPI) into a more generalized version, which is called the
Generalized Data Distribution Iteration (GDI). We see massive RL algorithms and
techniques can be unified into the GDI paradigm, which can be considered as one
of the special cases of GDI. We provide theoretical proof of why GDI is better
than GPI and how it works. Several practical algorithms based on GDI have been
proposed to verify the effectiveness and extensiveness of it. Empirical
experiments prove our state-of-the-art (SOTA) performance on Arcade Learning
Environment (ALE), wherein our algorithm has achieved 9620.98% mean human
normalized score (HNS), 1146.39% median HNS and 22 human world record
breakthroughs (HWRB) using only 200M training frames. Our work aims to lead the
RL research to step into the journey of conquering the human world records and
seek real superhuman agents on both performance and efficiency.
In text/plain
format
Archived Files and Locations
application/pdf 11.0 MB
file_k3bz7k5kdbckthyi2qauv3h5lq
|
arxiv.org (repository) web.archive.org (webarchive) |
2106.06232v6
access all versions, variants, and formats of this works (eg, pre-prints)