Do a simple search on Google – ‘how do bandit algorithms work’ ?

Do the results look confusing ? Some links (here1, here2) say they are better than A/B. Then there are other links which say otherwise (here3, here4).

In fact, when one hears about Bandit Problems, there are couple of questions to think about:

**Questions:**

*1.Is it an ‘Experiment Strategy’ ?*

- MAB gets compared with A/B tests. So is it an ‘experiment strategy’ like A/B testing ?

*2. Is it an ‘ML Algorithm’ ?*

- Bandit algorithms select the most optimal ‘action’. So is it fundamentally an ML Algorithm ?
- If yes, whats the relation between these ‘bandit problems’ v/s supervised ML algos like Logistic Regression and Decision Trees.

*3. Where do the algorithms like epsilon-greedy, UCB etc fit into ?*

**Thoughts:**

- The correct way of looking at bandit problems is to think of it as an optimization problem for online interactive systems.
- The goal of bandit algorithms is to select the best policy that will maximize rewards. The space of policies is extremely large (or infinite)

- In literature, people have treated bandit problems in different settings:
- Multi Armed Bandit setting
- Contextual Bandit

- Multi Armed Bandit setting.
- In the MAB setting, there are a few known approaches for selecting the best policy.
- Naive
- Epsilon-Greedy
- Upper Confidence Bounds.

- In the MAB setting, there are a few known approaches for selecting the best policy.
- Contextual Bandit.
- In one of my previous posts I noted the ML reduction stack in VW for the contextual bandits problem. In a separate post, I have also noted some thoughts on the use of the IPS score for conterfactual evaluation.
- In the Full Information Setting, the task of selecting the best policy is mapped to a
**cost-sensitive classification**problem where:- context <-> example
- action <-> label/class
- policy <-> classifier
- reward <-> gain / (negative) cost

- Thereby, we can use known supervised techniques like Decision Trees, Logistic Regression etc. to solve the cost-sensitive classification problem.
- This was an interesting insight for me, and helped me answer the question #2 above

- In the Partial Information aka. Bandit setting, there would be two more issues we would like to handle
- Filling in missing data.
- Overcoming Bias.

- The Partial Information aka. Bandit setting can further be looked into in 2 different ways:
- Online.
- In the online setting the problem has been solved in different ways
- Epsilon-Greedy / Epoch Greedy [Langford & Zhang].
- “Monster” Algorithm [Dudik, Hsu, Kale, Langford]
- They mostly vary in how they optimize regret. And/Or computational efficiency.

- Offline.
- This is where Counterfactual evaluation and Learning comes in..

- Online.
- Bandit algorithms are not just an alternate ‘experiment strategy’ that is ‘better’ or ‘worse’ than A/B tests.
- The objectives behind doing an A/B test are different from the objectives of using a bandit system (which is to do continuous optimization).

- Typic issues to consider for bandit problems:
- Explore-Exploit
- exploit what has been learned
- explore to learn which behaviour might give best results.

- Context
- In the contextual setting (‘contextual bandit’) there are many more choice available. unlikely to see the same context twice.

- Selection bias
- the exploit introduces bias that must be accounted for

- Efficiency.

- Explore-Exploit

**References:**

- https://support.google.com/analytics/answer/2844870?hl=en&ref_topic=1745207
*This post gives a nice overview of how Google’s Analytics Content Experiment platform uses the multi-armed bandit approach for managing online experiments*

- https://www.youtube.com/watch?v=gzxRDw3lXv8
*Rob Schapire explains the fundamentals of the bandits problem. Very Cool!*

- http://www.cs.cornell.edu/~adith/CfactSIGIR2016/
*Tutorial by Thorsten Joachims on the bandits problem*

- http://engineering.richrelevance.com/bandits-recommendation-systems/
*Nice read!*

- http://stevehanov.ca/blog/index.php?id=132
- https://www.chrisstucchio.com/blog/2012/bandit_algorithms_vs_ab.html
- https://vwo.com/blog/multi-armed-bandit-algorithm/https://www.chrisstucchio.com/blog/2015/dont_use_bandits.html