In this section, you will learn how to distribute computation in TensorFlow; the importance of knowing how to do this is highlighted as follows:
- Run more experiments in parallel (namely, finding hyperparameters, for example, gridsearch)
- Distribute model training over multiple GPUs (on multiple servers) to reduce training time
One famous use case was when Facebook published a paper that was able to train ImageNet in 1 hour (instead of weeks). Basically, it trained a ResNet-50 on ImageNet on 256 GPUs, distributed on 32 servers, with a batch size of 8,192 images.