Implementing Synchronized Multi-GPU Batch Normalization, Do It Exactly Right
Why synchronize the BN layer?
In deep learning frameworks (Caffe, Torch. Tensorflow, PyTorch and etc.) , the implementation of Batch Normalization is only normalize the data within every single GPU due to the Data Parallelism. We implement synchronize BN for some specific tasks such as semantic segmentation, object detection because they are usually memory consuming and the mini-batch size within a single GPU is too small for BN. Therefore, we discuss the synchronized implementation here.
What is Batch Normalization (BN) and how it works?
Batch Normalization was introduced in the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , which dramatically speed up the training process of the network (enables larger learning rate) and makes the network less sensitive to the weight initialization. The idea is performing the normalization within the mini-batch. The training mode:
- Forward Pass: For the input data , the data are normalized to be zero-mean and unit-variance, then scale and shit:
where and are the learnable scale and shift parameters.
- Backward Pass:
We need to consider the partial gradients from output , and the gradients from and , because the mean and variance are the function of the input: (We use the notations of partial gradients here.)
- Data Parallel in Deep Learning Frameworks:
Standard DataParallel pipeline of public frameworks (MXNet, PyTorch…) in each training iters:
- duplicate the network (weights) to all the GPUs,
- split the training batch to each GPU,
- forward and backward to calculate gradient,
- update network parameters (weights) then go to next iter.
Therefore, the standard Batch Normalization only normalize the data within each GPU individually.
Synchronized Batch Normalization implementation
The mean and variance need to be calculated across all the GPUs. Instead of synchronizing twice for calculating global mean and then variance, we apply a very simple strategy. We can calculate the sum of elements and sum of square of the elements in each GPU, then apply all reduce operation to sum accross GPUs. Then calculate the global mean and global variance
- can be calculated locally in each GPU.
- Calculate the gradient of sums and . The gradients are handled by all reduce operation during the backward.
We discussed this
Sync Once implementation in our recent paper Context Encoding for Semantic Segmentation.