Need to understand:
- https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255
- https://discuss.pytorch.org/t/why-do-we-need-to-set-the-gradients-manually-to-zero-in-pytorch/4903/20
- https://gist.github.com/thomwolf/ac7a7da6b1888c2eeac8ac8b9b05d3d3