Model Selection in Batch Policy Optimization
release_thrtvpivf5gy5h3i7cff5kzmvu
by
Jonathan N. Lee, George Tucker, Ofir Nachum, Bo Dai
2021
Abstract
We study the problem of model selection in batch policy optimization: given a
fixed, partial-feedback dataset and M model classes, learn a policy with
performance that is competitive with the policy derived from the best model
class. We formalize the problem in the contextual bandit setting with linear
model classes by identifying three sources of error that any model selection
algorithm should optimally trade-off in order to be competitive: (1)
approximation error, (2) statistical complexity, and (3) coverage. The first
two sources are common in model selection for supervised learning, where
optimally trading-off these properties is well-studied. In contrast, the third
source is unique to batch policy optimization and is due to dataset shift
inherent to the setting. We first show that no batch policy optimization
algorithm can achieve a guarantee addressing all three simultaneously,
revealing a stark contrast between difficulties in batch policy optimization
and the positive results available in supervised learning. Despite this
negative result, we show that relaxing any one of the three error sources
enables the design of algorithms achieving near-oracle inequalities for the
remaining two. We conclude with experiments demonstrating the efficacy of these
algorithms.
In text/plain
format
Archived Files and Locations
application/pdf 815.3 kB
file_nbrsqa6dmjab3eqy77glmyfbtq
|
arxiv.org (repository) web.archive.org (webarchive) |
2112.12320v1
access all versions, variants, and formats of this works (eg, pre-prints)