Feature Scaling in SGD

SGD is the perfect algorithm for use in online learning. Except it has one major drawback – is sensitive to feature scaling.

In some of my trials with the SGD learner in scikit-learn, I have seen terrible performance if I don’t do feature scaling.

Which begs the question – How does VW do feature scaling ? After all VW does online learning.


It seems VW uses a kind of SGD that is scale variant:




Bandit Problems: an ‘Experiment Strategy’ or a ‘ML Algorithm’ ?

Do a simple search on Google –  ‘how do bandit algorithms work’ ?

Do the results look confusing ?  Some links (here1, here2) say they are better than A/B.  Then there are other links which say otherwise (here3, here4).

In fact, when one hears about Bandit Problems, there are couple of questions to think about:


1.Is it an ‘Experiment Strategy’ ?

  • MAB gets compared with A/B tests. So is it an ‘experiment strategy’ like A/B testing ?

2. Is it an ‘ML Algorithm’ ?

  • Bandit algorithms select the most optimal ‘action’. So is it fundamentally an ML Algorithm ?
  • If yes, whats the relation between these ‘bandit problems’ v/s supervised ML algos like Logistic Regression and Decision Trees.

3. Where do the algorithms like epsilon-greedy, UCB etc fit into ?



  • The correct way of looking at bandit problems is to think of it as an optimization problem for online interactive systems.
    • The goal of bandit algorithms is to select the best policy that will maximize rewards. The space of policies is extremely large (or infinite)
  • In literature, people have treated bandit problems in different settings:
    • Multi Armed Bandit setting
    • Contextual Bandit
  • Multi Armed Bandit setting.
    • In the MAB setting,  there are a few known approaches for selecting the best policy.
      • Naive
      • Epsilon-Greedy
      • Upper Confidence Bounds.
  • Contextual Bandit.
    • In one of my previous posts I  noted the ML reduction stack in VW for the contextual bandits problem. In a separate post, I have also noted some thoughts on the use of the IPS score for conterfactual evaluation.
    • In the Full Information Setting, the task of selecting the best policy is mapped to a cost-sensitive classification problem where:
      • context <-> example
      • action <-> label/class
      • policy <-> classifier
      • reward <-> gain / (negative) cost
    • Thereby, we can use known supervised techniques like Decision Trees, Logistic Regression etc. to solve the cost-sensitive classification problem.
      • This was an interesting insight for me, and helped me answer the question #2 above
    • In the Partial Information aka. Bandit setting, there would be two more issues we would like to handle
      • Filling in missing data.
      • Overcoming Bias.
  • The Partial Information aka. Bandit setting can further be looked into in 2 different ways:
    • Online.
      • In the online setting the problem has been solved in different ways
      • Epsilon-Greedy / Epoch Greedy [Langford & Zhang].
      • “Monster” Algorithm [Dudik, Hsu, Kale, Langford]
      • They mostly vary in how they optimize regret. And/Or computational efficiency.
    • Offline.
      • This is where Counterfactual evaluation and Learning comes in..
  • Bandit algorithms are not just an alternate ‘experiment strategy’ that is  ‘better’ or ‘worse’ than A/B tests.
    • The objectives behind doing an A/B test are different from the objectives of using a bandit system (which is to do continuous optimization).
  • Typic issues to consider for bandit problems:
    • Explore-Exploit
      • exploit what has been learned
      • explore to learn which behaviour might give best results.
    • Context
      • In the contextual setting (‘contextual bandit’) there are many more choice available. unlikely to see the same context twice.
    • Selection bias
      • the exploit introduces bias that must be accounted for
    • Efficiency.



Using Vowpal Wabbit : Tips and Tricks

As I play around more with the machine learning toolkit Vowpal Wabbit, I realized there are several subtle flags/functionality in the toolkit.

Here is an effort to aggregate my leanings in using this toolkit.


  • -q [ –quadratic ] arg
    • Create and use quadratic features
    • -q is a very powerful option. It takes as an argument a pair of two letters. Its effect is to create interactions between the features of two namespaces. Suppose each example has a namespaceuser and a namespace document, then specifying -q ud will create an interaction feature for every pair of features (x,y) where x is a feature from the user namespace and y is a feature from the document namespace. If a letter matches more than one namespace then all the matching namespaces are used. In our example if there is another namespace url then interactions between url and document will also be modeled. The letter : is a wildcard to interact with all namespaces. -q a: (or -q :a) will create an interaction feature for every pair of features (x,y)where x is a feature from the namespaces starting with a and y is a feature from the all namespaces. -q :: would interact any combination of pairs of features.
  • –print
    • Use this to understand how VW constructs   the highlighted number of features.
    • When using contextual bandit mode, you will notice it gets added automatically per action
  • Feature ‘116060’
    • This is a constant feature with value 1, that essentially captures the intercept term in a linear model.
    • You may come across this feature if you look closely into the VW output.
  • Output Feature Ranking and Feature Weights Using VW
    • Is it possible to output the feature rankings after every update ?
      • try –audit
    • Is it possible to output the feature rankings at the end of training ?
      • use a combination of –-readable_model  foo and –l1 1e-3.  Any features surviving a high level of l1 regularization must be important according to the gradient.
    • Is it possible to output the feature weights after every update ?
      • possible, but this will be expensive.  In between each example you can put in a special examples with a ‘tag’ that says save_<filename>..
    • Is it possible to output the feature weights at the end of training ?
      •  that’s what –f does.  If you want it in a readable format then use –readable_model.
  • The learning rate schedule is influenced by 4 parameters, outlined here.
    • Look into the code below regarding how I am doing parameter sweeps over these 4 parameters.
    • capture
  • More to come…



  1. https://github.com/datasciencedojo/meetup/tree/master/getting_started_with_vowpal_wabbit
  2. http://mlwave.com/tutorial-titanic-machine-learning-from-distaster/
  3. http://zinkov.com/posts/2013-08-13-vowpal-tutorial/
  4. http://stackoverflow.com/questions/24822288/correctness-of-logistic-regression-in-vowpal-wabbit



ML Reductions for Contextual Bandit

In a previous post, I had mentioned about ML Reductions in general.

For the specific case of Contextual bandit class of problems, there are 2 seminal papers. This goes hand in hand with the counterfactual evaluation and learning problem which i have discussed in other posts (here, here)





Counterfactual Evaluation and Learning

I came across this very interesting talk by some folks at Cornell on Counterfactual Evaluation.

Some thoughts:

  • Systems which deal with a lot of offline evaluation might benefit a lot if they log they probability score when choosing the best action, because it would enable them to compute the IPS scores.
  • Counterfactual evaluation deals with the offline scenario.  There are 2 primary parts to it, both of which I think go hand in hand.
    • Evaluation of a given policy
      • IPS seems to be a very attractive measure for counterfactual evaluation, because it produces an unbiased estimate of the utility function.
    • Learning the best policy
    • For both evaluation and learning, the standard approach would be to model the reward. aka a reward predictor.
      • I have to admit the reward predictor approach is much more intuitive for me.
    • The approach proposed by Joachim’s et al. is how to do better.
      • For evaluation, they propose “modeling a bias” approach using IPS as the evaluation metric.
      • For learning, they use the AMO (“arg-max oracle”) approach i.e. reduce the problem of finding the best policy to a weighted multi-class classification problem. In a previous post I had mentioned about this reduction which is implemented in the VW library.



  • For online settings, the contextual bandit problem can be solved using Epsilon Greedy / Epoch Greedy.
    • Schapire’s video explains this and proposes a new algorithm to solve it  with better regret bounds, and fewer calls




Comparing ML algos : Multi Armed bandit, Contextual Bandit, Logistic Regression, Online Learning

We have a system running Multi-Armed Bandit.

So when it came to select the next generation of ML algo to try out, we had a few choices:

  1. Multi-Armed Bandit  (we had this running)
    • This entails ranking the items based on their respective conversion rates till that point of time.
  2. Contextual Bandit
    • We use Vowpal Wabbit for this.
    • Internally Vowpal Wabbit treats contextual bandit in 2 distinct ways:
      • Without Action Dependent Features (non-ADF)
      • With Action Dependent Features (ADF)
    • Interestingly there is a difference between non-ADF and ADF modes.
      • In non-ADF mode, the VW creates multiple models (i.e. creates a model for each class).
      • In ADF mode, VW creates a single model.
  3. Logistic Regression.
    • This entails reducing the problem to a binary classification problem.
    • Then using the model to score the items. Finally ranking the items based on the model score.
  4. Online ML
    • Again treating this as a binary classification model, except this time we are updating the model in an online fashion.


Interestingly, on the dataset I was using I didn’t see much of a difference in algorithmic performance across the 4 different algorithms above.






Machine Learning Reductions.

I was recently reading this paper which discusses Machine Learning Reductions.

It has a very interesting Abstract which got me hooked. The references contains two papers which explain ML reductions in general.


“Machine learning involves optimizing a loss function on unlabeled data points given examples of labeled data points, where the loss function measures the performance of a learning algorithm. We give an overview of techniques, called reductions, for converting a problem of minimizing one loss function into a problem of minimizing another, simpler loss function. This tutorial discusses how to create robust reductions that perform well in practice. The reductions discussed here can be used to solve any supervised learning problem with a standard binary classification or regression algorithm available in any machine learning toolkit. We also discuss common design flaws in folklore reductions.”