28 | September | 2016 | abgoswam's tech blog

Do a simple search on Google – ‘how do bandit algorithms work’ ?

Do the results look confusing ? Some links (here1, here2) say they are better than A/B. Then there are other links which say otherwise (here3, here4).

In fact, when one hears about Bandit Problems, there are couple of questions to think about:

Questions:

1.Is it an ‘Experiment Strategy’ ?

MAB gets compared with A/B tests. So is it an ‘experiment strategy’ like A/B testing ?

2. Is it an ‘ML Algorithm’ ?

Bandit algorithms select the most optimal ‘action’. So is it fundamentally an ML Algorithm ?
If yes, whats the relation between these ‘bandit problems’ v/s supervised ML algos like Logistic Regression and Decision Trees.

3. Where do the algorithms like epsilon-greedy, UCB etc fit into ?

Thoughts:

The correct way of looking at bandit problems is to think of it as an optimization problem for online interactive systems.
- The goal of bandit algorithms is to select the best policy that will maximize rewards. The space of policies is extremely large (or infinite)
In literature, people have treated bandit problems in different settings:
- Multi Armed Bandit setting
- Contextual Bandit
Multi Armed Bandit setting.
- In the MAB setting, there are a few known approaches for selecting the best policy.
  - Naive
  - Epsilon-Greedy
  - Upper Confidence Bounds.
Contextual Bandit.
- In one of my previous posts I noted the ML reduction stack in VW for the contextual bandits problem. In a separate post, I have also noted some thoughts on the use of the IPS score for conterfactual evaluation.
- In the Full Information Setting, the task of selecting the best policy is mapped to a cost-sensitive classification problem where:
  - context <-> example
  - action <-> label/class
  - policy <-> classifier
  - reward <-> gain / (negative) cost
- Thereby, we can use known supervised techniques like Decision Trees, Logistic Regression etc. to solve the cost-sensitive classification problem.
  - This was an interesting insight for me, and helped me answer the question #2 above
- In the Partial Information aka. Bandit setting, there would be two more issues we would like to handle
  - Filling in missing data.
  - Overcoming Bias.
The Partial Information aka. Bandit setting can further be looked into in 2 different ways:
- Online.
  - In the online setting the problem has been solved in different ways
  - Epsilon-Greedy / Epoch Greedy [Langford & Zhang].
  - “Monster” Algorithm [Dudik, Hsu, Kale, Langford]
  - They mostly vary in how they optimize regret. And/Or computational efficiency.
- Offline.
  - This is where Counterfactual evaluation and Learning comes in..
Bandit algorithms are not just an alternate ‘experiment strategy’ that is ‘better’ or ‘worse’ than A/B tests.
- The objectives behind doing an A/B test are different from the objectives of using a bandit system (which is to do continuous optimization).
Typic issues to consider for bandit problems:
- Explore-Exploit
  - exploit what has been learned
  - explore to learn which behaviour might give best results.
- Context
  - In the contextual setting (‘contextual bandit’) there are many more choice available. unlikely to see the same context twice.
- Selection bias
  - the exploit introduces bias that must be accounted for
- Efficiency.

References:

https://support.google.com/analytics/answer/2844870?hl=en&ref_topic=1745207
- This post gives a nice overview of how Google’s Analytics Content Experiment platform uses the multi-armed bandit approach for managing online experiments
https://www.youtube.com/watch?v=gzxRDw3lXv8
- Rob Schapire explains the fundamentals of the bandits problem. Very Cool!
http://www.cs.cornell.edu/~adith/CfactSIGIR2016/
- Tutorial by Thorsten Joachims on the bandits problem
http://engineering.richrelevance.com/bandits-recommendation-systems/
- Nice read!
http://stevehanov.ca/blog/index.php?id=132
https://www.chrisstucchio.com/blog/2012/bandit_algorithms_vs_ab.html
https://vwo.com/blog/multi-armed-bandit-algorithm/https://www.chrisstucchio.com/blog/2015/dont_use_bandits.html

Here are some of my notes based on a toe-dipping into R

Tips:

use .libPaths() to get the path where the libraries are located:
- > .libPaths() –> this is in RGui
  [1] “C:/Users/agoswami/Documents/R/win-library/3.3”
  [2] “C:/Program Files/R/R-3.3.0/library”

One can look for the Rprofile file like so:
- $ find . -name ‘*Rprofile’ -type f 2>/dev/null
  ./Microsoft/MRO/R-3.2.5/library/base/R/Rprofile –> this is MRO (Microsoft R Open)
  ./Microsoft/MRO-for-RRE/8.0/R-3.2.2/library/base/R/Rprofile –> this is MRS (Microsoft R Server)
  ./Microsoft SQL Server/130/R_SERVER/library/base/R/Rprofile –> this is MRS (Microsoft R Server) through SQL install
  ./R/R-3.3.0/library/base/R/Rprofile –> this is a regular R install

To look at packages available, use the following:
- https://www.r-bloggers.com/list-of-user-installed-r-packages-and-their-versions/
- ```
ip <- as.data.frame(installed.packages()[,c(1,3:4)])
rownames(ip) <- NULL
ip <- ip[is.na(ip$Priority),1:2,drop=FALSE]
print(ip, row.names=FALSE)
```
- this shows the difference in the packages between MRO, MRS and Base R

References:

abgoswam's tech blog

Data Science, Machine Learning, CS Theory, Systems & Web

Day: September 28, 2016

Bandit Problems: an ‘Experiment Strategy’ or a ‘ML Algorithm’ ?

The initialization sequence in R