Explore-then-commit algorithm

Explore Then Commit (ETC) is an algorithm for the multi-armed bandit problem foc,used on finding the best trade-off between exploration and exploitation.

Multi-armed bandit problem

The multi-armed bandit problem is a sequential game where one player has to choose at each turn between $K$ actions (arms). Behind every arm $a$ is an unknown distribution $\nu _{a}$ that lies in a set ${\mathcal {D}}$ known by the player (for example, ${\mathcal {D}}$ can be the set of Gaussian distributions or Bernoulli distributions).

At each turn $t$ the player chooses (pulls) an arm $a_{t}$ , they then get an observation $X_{t}$ of the distribution $\nu _{a_{t}}$ .

Regret minimization

The goal is to minimize the regret at time $T$ that is defined as

R_{T}:=\sum _{a=1}^{K}\Delta _{a}\mathbb {E} [N_{a}(T)]

where

$\mu _{a}:=\mathbb {E} [\nu _{a}]$ is the mean of arm $a$
$\mu ^{*}:=\max _{a}\mu _{a}$ is the highest mean
$\Delta _{a}:=\mu ^{*}-\mu _{a}$
$N_{a}(t)$ is the number of pulls of arm $a$ up to turn $t$

The player has to find an algorithm that chooses at each turn $t$ which arm to pull based on the previous actions and observations $(a_{s},X_{s})_{s<t}$ to minimize the regret $R_{T}$ .

This is a trade-off problem between exploration (finding the arm with the highest mean) and exploitation (playing the arm which is perceived to be the best as much as possible).^[1]

Algorithm

Two runs of ETC with the same M = 10. On the first run it does manage to find the best arm after the exploration while it does not on the second run

The algorithm explores each arm $M$ times. For the rest of the game the algorithm exploits its discoveries by playing the arm with the highest mean. If the horizon $T$ is known, then the number of explorations $M$ can depend on $T$ .

Adaptations of the algorithm exist^[2] and can be found in the literature for other settings.^[3]

Pseudocode

The player chooses M
 for each arm i do:
    select arm i M times
    update empirical mean mu[i]
for t from MK+1 to T do:
    select arm a with highest empirical mean mu[a]

Theoretical results

Trade of between exploration (large M) and exploitation (small M) for ETC

When all arms are $1$ -sub gaussian, by choosing to explore each arm $M$ times, the regret at time $T$ verify

R_{T}\leq M\sum _{i=1}^{K}\Delta _{i}+(T-MK)\sum _{i=1}^{K}\Delta _{i}\exp \left(-{\frac {M\Delta _{i}^{2}}{4}}\right)

^[1]

the first term is considered the cost of the exploration

M\sum _{i=1}^{K}\Delta _{i}

.

The second term is the cost of not having explored enough, leading to a probability of not having an optimal arm as the arm with the highest empirical mean.

(T-MK)\sum _{i=1}^{K}\Delta _{i}\exp \left(-{\frac {M\Delta _{i}^{2}}{4}}\right)

Increasing $M$ increases the first term while decreasing the second term. The best possible $M$ must depend on the $(\Delta _{i})_{i}$ which is unknown by the player.

For two arms with Gaussian distribution of variance $1$ , it was proved that ETC cannot achieve the asymptotic optimal regret of the Equation of Lai-Robbins.^[4]

References

^ ^a ^b Lattimore, Tor; Szepesvári, Csaba (2020). Bandit Algorithms. Cambridge University Press. doi:10.1017/9781108571401.
^ Jin, Tianyuan; Xu, Pan; Xiao, Xiaokui; Gu, Quanquan (2021). "Double Explore-then-Commit: Asymptotic Optimality and Beyond". Conference on Learning Theory. Proceedings of Machine Learning Research. pp. 2584–2633.
^ Nie, Guanyu; Agarwal, Mridul; Umrawal, Abhishek Kumar; Aggarwal, Vaneet; Quinn, Christopher John (2022). "An Explore-then-Commit Algorithm for Submodular Maximization under Full-Bandit Feedback". Uncertainty in Artificial Intelligence. Proceedings of Machine Learning Research. pp. 1541–1551.
^ Garivier, Aurélien; Kaufmann, Emilie; Lattimore, Tor (2016). "On Explore-Then-Commit Strategies". Advances in Neural Information Processing Systems 29.

[LattimoreSzepesvari2020-1] Lattimore, Tor; Szepesvári, Csaba (2020). Bandit Algorithms. Cambridge University Press. doi:10.1017/9781108571401.

[JinXuXiaoGu2021-2] Jin, Tianyuan; Xu, Pan; Xiao, Xiaokui; Gu, Quanquan (2021). "Double Explore-then-Commit: Asymptotic Optimality and Beyond". Conference on Learning Theory. Proceedings of Machine Learning Research. pp. 2584–2633.

[NieAgarwalUmrawalAggarwalQuinn2022-3] Nie, Guanyu; Agarwal, Mridul; Umrawal, Abhishek Kumar; Aggarwal, Vaneet; Quinn, Christopher John (2022). "An Explore-then-Commit Algorithm for Submodular Maximization under Full-Bandit Feedback". Uncertainty in Artificial Intelligence. Proceedings of Machine Learning Research. pp. 1541–1551.

[GarivierKaufmannLattimore2016-4] Garivier, Aurélien; Kaufmann, Emilie; Lattimore, Tor (2016). "On Explore-Then-Commit Strategies". Advances in Neural Information Processing Systems 29.

[1]