A multi-armed bandit problem occurs when the benefits of each option are not known or are incompletely known at the time of allocation, but can be discovered as time passes and resources are reallocated. The name comes from a particular visualization of this problem.
Imagine a gambler playing several different slot machines (sometimes called “one-armed bandits”), each of which has a different possible return (aka, some arms are superior to others, but the gambler doesn’t know which ones). The gambler wants to maximize his total reward and to do this, every round he can choose an arm to pull from whatever number of arms he has. Resulting from this predicament iterated over many rounds, the gambler has two choices: he can either keep playing whichever arm has had the greatest return so far, or he can take a random action to pull some other arm, knowing that while some may be more optimal than his current best arm, some may be less. In machine learning, the tradeoff between these options is called the exploration/exploitation tradeoff.
This may seem like a highly specific, non-generalizable problem, but its applications range from clinical trials to financial portfolio design to adaptive routing to feature experimentation. The exploration/exploitation trade-off is seen in any agent incapable of simultaneously planning and executing.
And in general, multi-armed bandit algorithms (aka multi-arm bandits or MABs) attempt to solve these kinds of problems and attain an optimal solution which will cause the greatest returns and the lowest total regret.
Types of Multi-Armed Bandits
There are different approximate solutions to the multi-armed bandit problem. The simplest such solution is called the “epsilon-greedy” algorithm, and all it does is, given a small decimal value epsilon (ε), it spends ε%
of the time exploring and (1-ε)%
exploiting. This algorithm is called “greedy” because of all the exploiting.
There are many variations on the basic epsilon-greedy algorithm: strategies for finite experiments such as epsilon-first (pure exploration followed by pure exploitation) and epsilon-decreasing (decreasing value of ε over the course of the experiment), as well as strategies which can be used on infinite or continuous experiments, such as value-difference-based epsilon (automatically reduced ε based on machine learning process) and contextual-epsilon-greedy (value of ε computed based on situation). There are also probability-matching (also called Thompson sampling or Bayesian Bandits) strategies which involve matching the number of pulls to the probability of a certain arm being the optimal one.
You may note similarities to A/B/n testing in the process of finding the optimal alternative among many for the purpose of exploiting it.
Benefits and Drawbacks
Multi-armed bandit algorithms are best used for two use cases: either very short experiments where the time it would take to gather significant data in an A/B test is prohibitive (like finding the best headline for a hot new article), or else in very long or ongoing experiments where waiting for a “final answer” from an A/B test doesn’t make sense (like optimizing each user’s news feed).
The main problem with bandit algorithms is their difficulty to implement. If an organization is falling at all short in their DevOps practices, trying to implement a bandit will bring that out. Further, because there aren’t many data scientists who are also excellent programmers, bandits are frequently more expensive since they take more people.