Deep Exploration via Bootstrapped DQN (Osband et al. 2016)

less than 1 minute read

Goal

Offer a mechanism to perform deep exploration, that is, have learners that can plan to learn over multiple steps.

$\epsilon$-greedy-like strategies only decide to learn locally.
If a distant part of the state-space is unexplored and there is a known positive reward very close on the way leading to this portion of the space, then the learner is highly unlikely to successively draw $\epsilon$s that would ultimately lead to exploring this far-away zone.

Train a DQN but using $K$ different “heads”, i.e. $K$ different estimators of the $Q$ function,
These are all initialized at random independently.
- The assumption is that they will learn different policies, helping to better explore the space,
For each episode:
- Pick a head $k$ uniformly at random over the $K$ heads
- Use the value function $Q_k$ as the agent for this episode.
Due to the random initialization, it could well be that one of the heads considers that there is some invaluable cheese in the “unobserved” part of the space, it will then, if used during an episode, naturally decide to go into this part of the space to get this reward.