I came across this very interesting talk by some folks at Cornell on Counterfactual Evaluation.

Some thoughts:

- Systems which deal with a lot of offline evaluation might benefit a lot if they log they probability score when choosing the best action, because it would enable them to compute the IPS scores.
- Counterfactual evaluation deals with the
**offline**scenario. There are 2 primary parts to it, both of which I think go hand in hand.- Evaluation of a given policy
- IPS seems to be a very attractive measure for counterfactual evaluation, because it produces an unbiased estimate of the utility function.

- Learning the best policy
- For both evaluation and learning, the
*standard*approach would be to model the reward. aka a reward predictor.- I have to admit the reward predictor approach is much more intuitive for me.

- The approach proposed by Joachim’s et al. is how to do
*better.*- For
**evaluation**, they propose “modeling a bias” approach using IPS as the evaluation metric. - For
**learning**, they use the AMO (“arg-max oracle”) approach i.e. reduce the problem of finding the best policy to a weighted multi-class classification problem. In a previous post I had mentioned about this reduction which is implemented in the VW library.

- For

- Evaluation of a given policy

- For
**online settings**, the contextual bandit problem can be solved using Epsilon Greedy / Epoch Greedy.- Schapire’s video explains this and proposes a new algorithm to solve it with better regret bounds, and fewer calls

**References:**

- http://www.cs.cornell.edu/~adith/CfactSIGIR2016/
*Conterfactual Evaluation. Mostly deals with evaluation and learning of policies in***offline**scenarios.

- https://www.youtube.com/watch?v=gzxRDw3lXv8
*Discusses an approach to solve the contextual bandit problem in the online setting. Provides a good overview of the conextual bandits problem.*

- http://research.microsoft.com/en-us/um/cambridge/events/mls2013/downloads/counterfactual_reasoning.pdf

**Code:**

[…] about the use of Inverse Propensity Score in Counterfactual Evaluation and Learning. in one of my previous posts i have elaborated on […]