I came across this very interesting talk by some folks at Cornell on Counterfactual Evaluation.
Some thoughts:
- Systems which deal with a lot of offline evaluation might benefit a lot if they log they probability score when choosing the best action, because it would enable them to compute the IPS scores.
- Counterfactual evaluation deals with the offline scenario. There are 2 primary parts to it, both of which I think go hand in hand.
- Evaluation of a given policy
- IPS seems to be a very attractive measure for counterfactual evaluation, because it produces an unbiased estimate of the utility function.
- Learning the best policy
- For both evaluation and learning, the standard approach would be to model the reward. aka a reward predictor.
- I have to admit the reward predictor approach is much more intuitive for me.
- The approach proposed by Joachim’s et al. is how to do better.
- For evaluation, they propose “modeling a bias” approach using IPS as the evaluation metric.
- For learning, they use the AMO (“arg-max oracle”) approach i.e. reduce the problem of finding the best policy to a weighted multi-class classification problem. In a previous post I had mentioned about this reduction which is implemented in the VW library.
- Evaluation of a given policy
- For online settings, the contextual bandit problem can be solved using Epsilon Greedy / Epoch Greedy.
- Schapire’s video explains this and proposes a new algorithm to solve it with better regret bounds, and fewer calls
References:
- http://www.cs.cornell.edu/~adith/CfactSIGIR2016/
- Conterfactual Evaluation. Mostly deals with evaluation and learning of policies in offline scenarios.
- https://www.youtube.com/watch?v=gzxRDw3lXv8
- Discusses an approach to solve the contextual bandit problem in the online setting. Provides a good overview of the conextual bandits problem.
- http://research.microsoft.com/en-us/um/cambridge/events/mls2013/downloads/counterfactual_reasoning.pdf
Code:
[…] about the use of Inverse Propensity Score in Counterfactual Evaluation and Learning. in one of my previous posts i have elaborated on […]