Explore-then-commit algorithm
Explore Then Commit (ETC) is an algorithm for the multi-armed bandit problem foc,used on finding the best trade-off between exploration and exploitation.
Multi-armed bandit problem
[edit]The multi-armed bandit problem is a sequential game where one player has to choose at each turn between actions (arms). Behind every arm is an unknown distribution that lies in a set known by the player (for example, can be the set of Gaussian distributions or Bernoulli distributions).
At each turn the player chooses (pulls) an arm , they then get an observation of the distribution .
Regret minimization
[edit]The goal is to minimize the regret at time that is defined as
where
- is the mean of arm
- is the highest mean
- is the number of pulls of arm up to turn
The player has to find an algorithm that chooses at each turn which arm to pull based on the previous actions and observations to minimize the regret .
This is a trade-off problem between exploration (finding the arm with the highest mean) and exploitation (playing the arm which is perceived to be the best as much as possible).[1]
Algorithm
[edit]
The algorithm explores each arm times. For the rest of the game the algorithm exploits its discoveries by playing the arm with the highest mean. If the horizon is known, then the number of explorations can depend on .
Adaptations of the algorithm exist[2] and can be found in the literature for other settings.[3]
Pseudocode
[edit]The player chooses M
for each arm i do:
select arm i M times
update empirical mean mu[i]
for t from MK+1 to T do:
select arm a with highest empirical mean mu[a]
Theoretical results
[edit]
When all arms are -sub gaussian, by choosing to explore each arm times, the regret at time verify
the first term is considered the cost of the exploration
- .
The second term is the cost of not having explored enough, leading to a probability of not having an optimal arm as the arm with the highest empirical mean.
Increasing increases the first term while decreasing the second term. The best possible must depend on the which is unknown by the player.
For two arms with Gaussian distribution of variance , it was proved that ETC cannot achieve the asymptotic optimal regret of the Equation of Lai-Robbins.[4]
See also
[edit]References
[edit]- ^ a b Lattimore, Tor; Szepesvári, Csaba (2020). Bandit Algorithms. Cambridge University Press. doi:10.1017/9781108571401.
- ^ Jin, Tianyuan; Xu, Pan; Xiao, Xiaokui; Gu, Quanquan (2021). "Double Explore-then-Commit: Asymptotic Optimality and Beyond". Conference on Learning Theory. Proceedings of Machine Learning Research. pp. 2584–2633.
- ^ Nie, Guanyu; Agarwal, Mridul; Umrawal, Abhishek Kumar; Aggarwal, Vaneet; Quinn, Christopher John (2022). "An Explore-then-Commit Algorithm for Submodular Maximization under Full-Bandit Feedback". Uncertainty in Artificial Intelligence. Proceedings of Machine Learning Research. pp. 1541–1551.
- ^ Garivier, Aurélien; Kaufmann, Emilie; Lattimore, Tor (2016). "On Explore-Then-Commit Strategies". Advances in Neural Information Processing Systems 29.