Algorithm IMED

A run of IMED with 3 Beta arms. The arm 1 is the optimal arm

In multi-armed bandit problems, IMED (for Indexed Minimum Empirical Divergence) is an algorithm developed in 2015 by Junya Honda and Akimichi Takemura. It is the first algorithm proved to be asymptotically optimal respect to the problem-dependant Lai–Robbins lower bound^[1] for distributions in $(-\infty ,1]$ ^[2].

Multi-armed bandit problem

The Multi-armed bandit problem is a sequential game where one player has to choose at each turn between $K$ actions (arms). Behind every arm $a$ there is an unknown distribution $\nu _{a}$ that lies in a set ${\mathcal {D}}$ known by the player (for example, ${\mathcal {D}}$ can be the set of Gaussian distributions or Bernoulli distributions).

At each turn $t$ the player chooses (pulls) an arm $a_{t}$ , he then gets an observation $X_{t}$ of the distribution $\nu _{a_{t}}$ .

Regret minimization

The goal is to minimize the regret at time $T$ that is defined as

R_{T}:=\sum _{a=1}^{K}\Delta _{a}\mathbb {E} [N_{a}(T)]

where

$\mu _{a}:=\mathbb {E} [\nu _{a}]$ is the mean of arm $a$
$\mu ^{*}:=\max _{a}\mu _{a}$ is the highest mean
$\Delta _{a}:=\mu ^{*}-\mu _{a}$
$N_{a}(t)$ is the number of pulls of arm $a$ up to turn $t$

The player has to find an algorithm that chooses at each turn $t$ which arm to pull based on the previous actions and observations $(a_{s},X_{s})_{s<t}$ to minimize the regret $R_{T}$ .

This is a trade-off problem between exploration to find the best arm (the arm with the highest mean) and exploitation to play as much as possible the arm that we think is the best arm.^[3]

Applications

Multi-armed bandit algorithms are used in a variety of fields; for example, they have applications in clinical trials, recommender systems, telecommunications^[4], and agriculture.^[5]

Algorithm IMED

The algorithm compute an index at each turn for each arm. Then it pull the arm with the smallest index.^[2]

The index is the sum of two terms. The first is the cost of to transport the empirical distribution to a distribution such that the arm is optimal. The second is the cost of behing pulled to much.

Formally, the index $I_{a}(t)$ of an arm $a$ at turn $t$ is defined as follow:

I_{a}(t):=N_{a}(t)K_{inf}({\hat {\nu }}_{a}(t),{\hat {\mu }}^{*}(t))+\ln(N_{a}(t))

where

${\mathcal {K}}_{inf}(\nu ,\mu ,{\mathcal {D}}):=\inf \left\{\mathrm {KL} (\nu ,{\tilde {\nu }})\ |\ {\tilde {\nu }}\in {\mathcal {P}}([-\infty ,1]),\ \mathbb {E} [{\tilde {\nu }}]>\mu \right\}$
$\mathrm {KL}$ is the Kullback–Leibler divergence
${\mathcal {P}}([-\infty ,1])$ is the set of distribution in $[-\infty ,1]$
${\hat {\nu }}_{a}(t)$ is the empirical distribution of arm $a$ at turn $t$
${\hat {\mu }}^{*}(t)$ is the highest empirical mean of turn $t$

Remark : For arms $a$ that verify ${\hat {\mu }}_{a}(t)={\hat {\mu }}^{*}(t)$ we have $K_{inf}({\hat {\nu }}_{a}(t),{\hat {\mu }}^{*}(t))=0$ . Then there index is equal to $\ln(N_{a}(t))$

Pseudocode

for each arm i do:
    n[i] ← 1; nu[i] ← None; mu[i] ← None
for t from 1 to K do:
    select arm t
    observe reward r
    n[t] ← n[t] + 1
    nu[t] ← update empirical distribution
    mu[t] ← update empirical mean
for t from K+1 to T do:
    mu* ← highest mu
    for each arm i do:
        scoreK[i] ← n[i] K_inf(nu[i],mu*)
        scoreN[i] ← ln(n[i])
        index[i] ← scoreK[i] + scoreN[i]
    select arm a with smallest index[a]
    observe reward r
    n[a] ← n[a] + 1
    nu[a] ← update empirical distribution
    mu[a] ← update empirical mean

Theoretical results

In the multi-armed bandit problem we have the asymptotic Lai–Robbins lower bound^[1] asymptotic lower bound on regret. The algorithm IMED is the first algorithm that matches this lower bound for distribution in $(-\infty ,1]$ in the first order. If the distribution are also bounded then it also match the second order. It is the first algorithm that match the second under of this lower bound.^[2]

Lai–Robbins lower bound

In 1985 Lai and Robbins proved an asymptotic, problem-dependent lower bound on regret^[1]. In 2018, Aurelien Garivier, Pierre Menard and Gilles Stoltz proved a refined lower bound that gives the second order ^[6]

It states that for every consistent algorithm on the set ${\mathcal {P}}([-\infty ,1])$ — that is, an algorithm for which, for every $(\nu _{1},\dots ,\nu _{K})\in {\mathcal {P}}([-\infty ,1])^{K}$ , the regret $R_{T}$ is subpolynomial (i.e. $R_{T}=o_{T\to +\infty }(T^{\alpha })$ for all $\alpha >0$ ) — we have:

R_{T}\geq \left(\sum _{a:\mu _{a}<\mu ^{*}}{\frac {\Delta _{a}}{{\mathcal {K}}_{\inf }(\nu _{a},\mu ^{*})}}\right)\ln T-\Omega _{T\to +\infty }(\ln \ln T).

This bound is asymptotic (as $T\to +\infty$ ) and gives a first-order lower bound of order $\ln T$ with the optimal constant in front of it and the second order in $-\Omega (\ln \ln T)$ .

Regret bound for IMED

If the distribution of every arm $a$ is $(-\infty ,1]$ ( i.e. $\nu _{a}\in {\mathcal {P}}([-\infty ,1]))$ then the regret of the algorithm IMED verify

R_{T}\leq \left(\sum _{a:\mu _{a}<\mu ^{*}}{\frac {\Delta _{a}}{{\mathcal {K}}_{\inf }(\nu _{a},\mu ^{*})}}\right)\ln T+O(1)

^[2]

If all the distribution $\nu _{a}$ are bounded then it exists a constant $C>0$ such that for $T$ large enough the regret of IMED is upper bounded by

R_{T}\leq \left(\sum _{a:\mu _{a}<\mu ^{*}}{\frac {\Delta _{a}}{{\mathcal {K}}_{\inf }(\nu _{a},\mu ^{*})}}\right)\ln T-C\ln \ln T

^[2]

Computation time

The algorithm only requiere to compute the $K_{inf}$ for suboptimal arms who are pulled $O(\ln T)$ times, which make it a lot faster than KL-UCB. A faster version of IMED was developed in 2023 to make it even faster, using a Taylor development of the $K_{inf}$ in the first order ^[7].

References

^ ^a ^b ^c Lai, T.L.; Robbins, Herbert (1985). "Asymptotically Efficient Adaptive Allocation Rules". Advances in Applied Mathematics. 6 (1): 4–22. doi:10.1016/0196-8858(85)90002-8.
^ ^a ^b ^c ^d ^e Honda, Junya; Takemura, Akimichi (2015). "Non-Asymptotic Analysis of a New Bandit Algorithm for Semi-Bounded Rewards". Journal of Machine Learning Research. 16 (113): 3721–3756.
^ Lattimore, Tor; Szepesvári, Csaba (2020). Bandit Algorithms. Cambridge: Cambridge University Press.
^ Bouneffouf, Djallel; Rish, Irina (2019). "A survey on practical applications of multi-armed and contextual bandits". arXiv:1904.10040 [cs.LG].
^ Gautron, Romain; Baudry, Dorian; Adam, Myriam; Falconnier, Gatien N; Hoogenboom, Gerrit; King, Brian; Corbeels, Marc (2024). "A new adaptive identification strategy of best crop management with farmers". Field Crops Research. 307. Elsevier: 109249.
^ Garivier, Aurélien; Ménard, Pierre; Stoltz, Gilles (2019). "Explore first, exploit next: The true shape of regret in bandit problems". Mathematics of Operations Research. 44 (2). INFORMS: 377--399.
^ Baudry, Dorian; Pesquerel, Fabien; Degenne, Rémy; Maillard, Odalric-Ambrym (2023). "Fast Asymptotically Optimal Algorithms for Non-Parametric Stochastic Bandits". Advances in Neural Information Processing Systems. 36: 11469–11514.

[Lai1985-1] Lai, T.L.; Robbins, Herbert (1985). "Asymptotically Efficient Adaptive Allocation Rules". Advances in Applied Mathematics. 6 (1): 4–22. doi:10.1016/0196-8858(85)90002-8.

[Honda2015-2] Honda, Junya; Takemura, Akimichi (2015). "Non-Asymptotic Analysis of a New Bandit Algorithm for Semi-Bounded Rewards". Journal of Machine Learning Research. 16 (113): 3721–3756.

[Lattimore-3] Lattimore, Tor; Szepesvári, Csaba (2020). Bandit Algorithms. Cambridge: Cambridge University Press.

[Bouneffouf2019-4] Bouneffouf, Djallel; Rish, Irina (2019). "A survey on practical applications of multi-armed and contextual bandits". arXiv:1904.10040 [cs.LG].

[Gautron2024-5] Gautron, Romain; Baudry, Dorian; Adam, Myriam; Falconnier, Gatien N; Hoogenboom, Gerrit; King, Brian; Corbeels, Marc (2024). "A new adaptive identification strategy of best crop management with farmers". Field Crops Research. 307. Elsevier: 109249.

[Garivier2019-6] Garivier, Aurélien; Ménard, Pierre; Stoltz, Gilles (2019). "Explore first, exploit next: The true shape of regret in bandit problems". Mathematics of Operations Research. 44 (2). INFORMS: 377--399.

[Baudry2023-7] Baudry, Dorian; Pesquerel, Fabien; Degenne, Rémy; Maillard, Odalric-Ambrym (2023). "Fast Asymptotically Optimal Algorithms for Non-Parametric Stochastic Bandits". Advances in Neural Information Processing Systems. 36: 11469–11514.

[1]

Multi-armed bandit problem

Regret minimization

Applications

Algorithm IMED

Pseudocode

Theoretical results

Lai–Robbins lower bound

Regret bound for IMED

Computation time

See also

References