Deep Exploration via Bootstrapped DQN (Osband et al. 2016)
less than 1 minute read
Goal
- Offer a mechanism to perform deep exploration, that is, have learners
that can plan to learn over multiple steps.
Motivation
- $\epsilon$-greedy-like strategies only decide to learn locally.
- If a distant part of the state-space is unexplored and there is a
known positive reward very close on the way leading to this portion of the
space, then the learner is highly unlikely to successively draw $\epsilon$s
that would ultimately lead to exploring this far-away zone.
Proposed solution: Bootstrapped DQN
- Train a DQN but using $K$ different “heads”, i.e. $K$ different estimators
of the $Q$ function,
- These are all initialized at random independently.
- The assumption is that they will learn different policies, helping to
better explore the space,
- For each episode:
- Pick a head $k$ uniformly at random over the $K$ heads
- Use the value function $Q_k$ as the agent for this episode.
- Due to the random initialization, it could well be that one of the heads
considers that there is some invaluable cheese in the “unobserved” part of
the space, it will then, if used during an episode, naturally decide to go
into this part of the space to get this reward.