A very interesting problem in ML is : What to try next ? Andrew Ng has some very interesting insights on this topic. (See the reference section below)
- Nowadays most ML platforms, e.g. AzureML give the ability to do parameter sweeps.
- Most of the time they also do cross validation when doing sweeps.
- This simplifies model selection, the platforms will automatically select the parameters during cross validation which give the best accuracy/AUC on the cross validation dataset.
- This is usually the 1st thing to do for pretty much all ML problems.
- However, an interesting question still remains esp from a practical standpoint –
- Should I focus more on feature engineering i.e. add more features. OR Should I focus more on getting more data
- For these cases I would generally use learning curves.
- There are some nuances. So let me explain what I usually do.
- Plot of Training Error v/s Cross Validation Error.
- This usually indicates whether I am currently suffering from a high bias (underfit) problem or a high variance (overfit) problem.
- High Bias (underfit):
- high training error. high generalization (CV) error
- High Variance (overfit):
- low training error. high generalization (CV) error
- High Variance (Overfitting) : Plot how the Error / Accuracy varies with increasing data.
- A good idea here is to use log-base2 scale on the x-axis.
- Using a log-base-2 scheme gives a good sense of how much the Error/Accuracy with decrease/increase with more data
- Based on the intuition above the following steps can be taken.
What to Try next ? |
Underfit (high bias) |
Overfit (high variance) |
Getting More Training Examples |
No |
Yes |
Try smaller set of features |
No |
Yes. But first see if you can get more training examples. |
Additional features |
Yes |
Maybe. If we get a feature that gives a strong signal then yes add it. But also invest in more data collection in parallel. |
Code:
References: