Counterfactual Evaluation and Learning

I came across this very interesting talk by some folks at Cornell on Counterfactual Evaluation.

Some thoughts:

  • Systems which deal with a lot of offline evaluation might benefit a lot if they log they probability score when choosing the best action, because it would enable them to compute the IPS scores.
  • Counterfactual evaluation deals with the offline scenario.  There are 2 primary parts to it, both of which I think go hand in hand.
    • Evaluation of a given policy
      • IPS seems to be a very attractive measure for counterfactual evaluation, because it produces an unbiased estimate of the utility function.
    • Learning the best policy
    • For both evaluation and learning, the standard approach would be to model the reward. aka a reward predictor.
      • I have to admit the reward predictor approach is much more intuitive for me.
    • The approach proposed by Joachim’s et al. is how to do better.
      • For evaluation, they propose “modeling a bias” approach using IPS as the evaluation metric.
      • For learning, they use the AMO (“arg-max oracle”) approach i.e. reduce the problem of finding the best policy to a weighted multi-class classification problem. In a previous post I had mentioned about this reduction which is implemented in the VW library.



  • For online settings, the contextual bandit problem can be solved using Epsilon Greedy / Epoch Greedy.
    • Schapire’s video explains this and proposes a new algorithm to solve it  with better regret bounds, and fewer calls




One thought on “Counterfactual Evaluation and Learning

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s