Seems like a lifetime ago, but I suddenly wanted to pull up some reference books I had used when preparing for INMO-2000 (class 10), INMO-2002 (class 11) and INMO-2003 (class 12)

**Resources:**

Skip to content
#
math

# Indian National Math Olympiad

# Video Lectures on Discrete Mathematics and Algorithms

# Alias method

# Bandits for Online Recommendations

# Balanced Ternary

# Bandit Problems: an ‘Experiment Strategy’ or a ‘ML Algorithm’ ?

# Euclid’s algorithm and GCD

# Confidence Intervals and Significance Levels

# Counterfactual Evaluation and Learning

Seems like a lifetime ago, but I suddenly wanted to pull up some reference books I had used when preparing for INMO-2000 (class 10), INMO-2002 (class 11) and INMO-2003 (class 12)

**Resources:**

**Stony Brook **[my Alma Mater 🙂 ]

- http://www3.cs.stonybrook.edu/~algorith/math-video/
- http://www3.cs.stonybrook.edu/~algorith/video-lectures/

**IMSc** (Indian Institute of Mathematical Sciences):

- https://www.youtube.com/watch?v=Lv3o4OOHG_w
- https://www.youtube.com/watch?v=fH0V-sQrBqE
- Check out the channel! Some very cool videos

**MIT**

- https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-046j-introduction-to-algorithms-sma-5503-fall-2005/index.htm
- https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-042j-mathematics-for-computer-science-fall-2010/index.htm
- https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-006-introduction-to-algorithms-fall-2011/index.htm
- https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-046j-design-and-analysis-of-algorithms-spring-2015/index.htm

“You are given an n-sided die where side i has probability p_{i} of being rolled. What is the most efficient data structure for simulating rolls of the die?”

A very similar question was posted to me recently :

- http://stackoverflow.com/questions/4511331/randomly-selecting-an-element-from-a-weighted-list
- http://stackoverflow.com/questions/3120035/indexing-count-of-buckets/3120179#3120179

The approaches above are very cool, and illustrate the use of augmented search trees.

However, it seems there is a better method for this problem – and it has been out there for a while now. This was a fascinating read :

Additional pointers for the alias method:

I came across this interesting set of blog posts by Sergei Feldman on the use of bandit approaches in online recommendation.

In particular, the one I really enjoyed reading was the comparison of the approaches needed to solve the multi armed bandit problem. Need to play around with his code someday

**References:**

I was solving this math problem which had to do with representing every Natural number as a summation/subtraction of distinct power of 3

Interestingly this led me to this branch of mathematics called ‘Balanced Ternary’. Check it out!

Exploration of this problem gave me interesting insights about base representation of a number, something that I have been keeping in the backburner for a long while now. Finally got a chance to follow up on this.

**References:**

- https://en.wikipedia.org/wiki/Balanced_ternary
- http://math.stackexchange.com/questions/119606/represent-every-natural-number-as-a-summation-subtraction-of-distinct-power-of
- http://userpages.wittenberg.edu/bshelburne/BalancedTernaryTalkSu09.pdf

**Problem:**

**Code:**

Do a simple search on Google – ‘how do bandit algorithms work’ ?

Do the results look confusing ? Some links (here1, here2) say they are better than A/B. Then there are other links which say otherwise (here3, here4).

In fact, when one hears about Bandit Problems, there are couple of questions to think about:

**Questions:**

*1.Is it an ‘Experiment Strategy’ ?*

- MAB gets compared with A/B tests. So is it an ‘experiment strategy’ like A/B testing ?

*2. Is it an ‘ML Algorithm’ ?*

- Bandit algorithms select the most optimal ‘action’. So is it fundamentally an ML Algorithm ?
- If yes, whats the relation between these ‘bandit problems’ v/s supervised ML algos like Logistic Regression and Decision Trees.

*3. Where do the algorithms like epsilon-greedy, UCB etc fit into ?*

**Thoughts:**

- The correct way of looking at bandit problems is to think of it as an optimization problem for online interactive systems.
- The goal of bandit algorithms is to select the best policy that will maximize rewards. The space of policies is extremely large (or infinite)

- In literature, people have treated bandit problems in different settings:
- Multi Armed Bandit setting
- Contextual Bandit

- Multi Armed Bandit setting.
- In the MAB setting, there are a few known approaches for selecting the best policy.
- Naive
- Epsilon-Greedy
- Upper Confidence Bounds.

- In the MAB setting, there are a few known approaches for selecting the best policy.
- Contextual Bandit.
- In one of my previous posts I noted the ML reduction stack in VW for the contextual bandits problem. In a separate post, I have also noted some thoughts on the use of the IPS score for conterfactual evaluation.
- In the Full Information Setting, the task of selecting the best policy is mapped to a
**cost-sensitive classification**problem where:- context <-> example
- action <-> label/class
- policy <-> classifier
- reward <-> gain / (negative) cost

- Thereby, we can use known supervised techniques like Decision Trees, Logistic Regression etc. to solve the cost-sensitive classification problem.
- This was an interesting insight for me, and helped me answer the question #2 above

- In the Partial Information aka. Bandit setting, there would be two more issues we would like to handle
- Filling in missing data.
- Overcoming Bias.

- The Partial Information aka. Bandit setting can further be looked into in 2 different ways:
- Online.
- In the online setting the problem has been solved in different ways
- Epsilon-Greedy / Epoch Greedy [Langford & Zhang].
- “Monster” Algorithm [Dudik, Hsu, Kale, Langford]
- They mostly vary in how they optimize regret. And/Or computational efficiency.

- Offline.
- This is where Counterfactual evaluation and Learning comes in..

- Online.
- Bandit algorithms are not just an alternate ‘experiment strategy’ that is ‘better’ or ‘worse’ than A/B tests.
- The objectives behind doing an A/B test are different from the objectives of using a bandit system (which is to do continuous optimization).

- Typic issues to consider for bandit problems:
- Explore-Exploit
- exploit what has been learned
- explore to learn which behaviour might give best results.

- Context
- In the contextual setting (‘contextual bandit’) there are many more choice available. unlikely to see the same context twice.

- Selection bias
- the exploit introduces bias that must be accounted for

- Efficiency.

- Explore-Exploit

**References:**

- https://support.google.com/analytics/answer/2844870?hl=en&ref_topic=1745207
*This post gives a nice overview of how Google’s Analytics Content Experiment platform uses the multi-armed bandit approach for managing online experiments*

- https://www.youtube.com/watch?v=gzxRDw3lXv8
*Rob Schapire explains the fundamentals of the bandits problem. Very Cool!*

- http://www.cs.cornell.edu/~adith/CfactSIGIR2016/
*Tutorial by Thorsten Joachims on the bandits problem*

- http://engineering.richrelevance.com/bandits-recommendation-systems/
*Nice read!*

- http://stevehanov.ca/blog/index.php?id=132
- https://www.chrisstucchio.com/blog/2012/bandit_algorithms_vs_ab.html
- https://vwo.com/blog/multi-armed-bandit-algorithm/https://www.chrisstucchio.com/blog/2015/dont_use_bandits.html

Euclid’s algorithm for computing the GCD has a lot of interesting extensions for exploration.

**References:**

- http://www.cut-the-knot.org/blue/Euclid.shtml#
- https://en.wikipedia.org/wiki/Euclidean_algorithm
*Factor tree**Bézout’s identity**Extended Euclidean Algorithm*

**Code:**

In a previous post , I mentioned about expected value and variance of different distributions.

Taking the same statistical concepts further, we now want to compute confidence intervals for our estimate.

**Note:**

- While thinking about Confidence Intervals, it is a good exercise to identify what distribution is representative of your estimate.
- The reason this is needed is because the confidence interval is dependent on standard deviation. As such, it would be necessary to know how you are computing your standard deviation.
- An alternative would be if we compute the variance from base principles.

(https://en.wikipedia.org/wiki/Standard_deviation)

- Here are a couple of very interesting post that explains the relationship between confidence intervals, statistical levels and P-values in a very simple way.
- http://blog.minitab.com/blog/adventures-in-statistics/understanding-hypothesis-tests:-significance-levels-alpha-and-p-values-in-statistics
- https://www.quora.com/Whats-the-difference-between-significance-level-and-confidence-level
- http://sphweb.bumc.bu.edu/otlt/MPH-Modules/EP/EP713_RandomError/EP713_RandomError6.html

- CTR: Here are some interesting discussions around computing confidence intervals for a metric like CTR

**References:**

- http://www.stat.yale.edu/Courses/1997-98/101/confint.htm
- http://onlinestatbook.com/2/estimation/mean.html
- http://www.measuringu.com/blog/ci-10things.php
- https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval

**Videos:**

- https://www.youtube.com/watch?v=tFWsuO9f74o
- 2 videos here are helpful in understanding the basic concept

**Code:**

I came across this very interesting talk by some folks at Cornell on Counterfactual Evaluation.

Some thoughts:

- Systems which deal with a lot of offline evaluation might benefit a lot if they log they probability score when choosing the best action, because it would enable them to compute the IPS scores.
- Counterfactual evaluation deals with the
**offline**scenario. There are 2 primary parts to it, both of which I think go hand in hand.- Evaluation of a given policy
- IPS seems to be a very attractive measure for counterfactual evaluation, because it produces an unbiased estimate of the utility function.

- Learning the best policy
- For both evaluation and learning, the
*standard*approach would be to model the reward. aka a reward predictor.- I have to admit the reward predictor approach is much more intuitive for me.

- The approach proposed by Joachim’s et al. is how to do
*better.*- For
**evaluation**, they propose “modeling a bias” approach using IPS as the evaluation metric. - For
**learning**, they use the AMO (“arg-max oracle”) approach i.e. reduce the problem of finding the best policy to a weighted multi-class classification problem. In a previous post I had mentioned about this reduction which is implemented in the VW library.

- For

- Evaluation of a given policy

- For
**online settings**, the contextual bandit problem can be solved using Epsilon Greedy / Epoch Greedy.- Schapire’s video explains this and proposes a new algorithm to solve it with better regret bounds, and fewer calls

**References:**

- http://www.cs.cornell.edu/~adith/CfactSIGIR2016/
*Conterfactual Evaluation. Mostly deals with evaluation and learning of policies in***offline**scenarios.

- https://www.youtube.com/watch?v=gzxRDw3lXv8
*Discusses an approach to solve the contextual bandit problem in the online setting. Provides a good overview of the conextual bandits problem.*

- http://research.microsoft.com/en-us/um/cambridge/events/mls2013/downloads/counterfactual_reasoning.pdf

**Code:**