Alias method

“You are given an n-sided die where side i has probability pi of being rolled. What is the most efficient data structure for simulating rolls of the die?”

A very similar question was posted to me recently  :

The approaches above are very cool, and illustrate the use of augmented search trees.

However, it seems there is a better method for this problem – and it has been out there for a while now. This was a fascinating read :

Additional pointers for the alias method:



Balanced Ternary

I was solving this math problem which had to do with representing every Natural number as a summation/subtraction of distinct power of 3

Interestingly this led me to this branch of mathematics called ‘Balanced Ternary’. Check it out!

Exploration of this problem gave me interesting insights about base representation of a number, something that I have been keeping in the backburner for a long while now. Finally got a chance to follow up on this.





Bandit Problems: an ‘Experiment Strategy’ or a ‘ML Algorithm’ ?

Do a simple search on Google –  ‘how do bandit algorithms work’ ?

Do the results look confusing ?  Some links (here1, here2) say they are better than A/B.  Then there are other links which say otherwise (here3, here4).

In fact, when one hears about Bandit Problems, there are couple of questions to think about:


1.Is it an ‘Experiment Strategy’ ?

  • MAB gets compared with A/B tests. So is it an ‘experiment strategy’ like A/B testing ?

2. Is it an ‘ML Algorithm’ ?

  • Bandit algorithms select the most optimal ‘action’. So is it fundamentally an ML Algorithm ?
  • If yes, whats the relation between these ‘bandit problems’ v/s supervised ML algos like Logistic Regression and Decision Trees.

3. Where do the algorithms like epsilon-greedy, UCB etc fit into ?



  • The correct way of looking at bandit problems is to think of it as an optimization problem for online interactive systems.
    • The goal of bandit algorithms is to select the best policy that will maximize rewards. The space of policies is extremely large (or infinite)
  • In literature, people have treated bandit problems in different settings:
    • Multi Armed Bandit setting
    • Contextual Bandit
  • Multi Armed Bandit setting.
    • In the MAB setting,  there are a few known approaches for selecting the best policy.
      • Naive
      • Epsilon-Greedy
      • Upper Confidence Bounds.
  • Contextual Bandit.
    • In one of my previous posts I  noted the ML reduction stack in VW for the contextual bandits problem. In a separate post, I have also noted some thoughts on the use of the IPS score for conterfactual evaluation.
    • In the Full Information Setting, the task of selecting the best policy is mapped to a cost-sensitive classification problem where:
      • context <-> example
      • action <-> label/class
      • policy <-> classifier
      • reward <-> gain / (negative) cost
    • Thereby, we can use known supervised techniques like Decision Trees, Logistic Regression etc. to solve the cost-sensitive classification problem.
      • This was an interesting insight for me, and helped me answer the question #2 above
    • In the Partial Information aka. Bandit setting, there would be two more issues we would like to handle
      • Filling in missing data.
      • Overcoming Bias.
  • The Partial Information aka. Bandit setting can further be looked into in 2 different ways:
    • Online.
      • In the online setting the problem has been solved in different ways
      • Epsilon-Greedy / Epoch Greedy [Langford & Zhang].
      • “Monster” Algorithm [Dudik, Hsu, Kale, Langford]
      • They mostly vary in how they optimize regret. And/Or computational efficiency.
    • Offline.
      • This is where Counterfactual evaluation and Learning comes in..
  • Bandit algorithms are not just an alternate ‘experiment strategy’ that is  ‘better’ or ‘worse’ than A/B tests.
    • The objectives behind doing an A/B test are different from the objectives of using a bandit system (which is to do continuous optimization).
  • Typic issues to consider for bandit problems:
    • Explore-Exploit
      • exploit what has been learned
      • explore to learn which behaviour might give best results.
    • Context
      • In the contextual setting (‘contextual bandit’) there are many more choice available. unlikely to see the same context twice.
    • Selection bias
      • the exploit introduces bias that must be accounted for
    • Efficiency.



Euclid’s algorithm and GCD

Euclid’s algorithm for computing the GCD has a lot of interesting extensions for exploration.






Confidence Intervals and Significance Levels

In a previous post , I mentioned about  expected value and variance of different distributions.

Taking the same statistical concepts further, we now want to compute confidence intervals for our estimate.


  • While thinking about Confidence Intervals, it is a good exercise to identify what distribution is representative of your estimate.
    • The reason this is needed is because the confidence interval  is dependent on standard deviation.  As such, it would be necessary to know how you are computing your standard deviation.
    • An alternative would be if we compute the variance from base principles.










Counterfactual Evaluation and Learning

I came across this very interesting talk by some folks at Cornell on Counterfactual Evaluation.

Some thoughts:

  • Systems which deal with a lot of offline evaluation might benefit a lot if they log they probability score when choosing the best action, because it would enable them to compute the IPS scores.
  • Counterfactual evaluation deals with the offline scenario.  There are 2 primary parts to it, both of which I think go hand in hand.
    • Evaluation of a given policy
      • IPS seems to be a very attractive measure for counterfactual evaluation, because it produces an unbiased estimate of the utility function.
    • Learning the best policy
    • For both evaluation and learning, the standard approach would be to model the reward. aka a reward predictor.
      • I have to admit the reward predictor approach is much more intuitive for me.
    • The approach proposed by Joachim’s et al. is how to do better.
      • For evaluation, they propose “modeling a bias” approach using IPS as the evaluation metric.
      • For learning, they use the AMO (“arg-max oracle”) approach i.e. reduce the problem of finding the best policy to a weighted multi-class classification problem. In a previous post I had mentioned about this reduction which is implemented in the VW library.



  • For online settings, the contextual bandit problem can be solved using Epsilon Greedy / Epoch Greedy.
    • Schapire’s video explains this and proposes a new algorithm to solve it  with better regret bounds, and fewer calls