The answer is through trial and error, taking chances, following my nose, embracing opportunity, living the dream and living the nightmare, venturing out on the limb, surviving the falls…sometimes…
In most training circumstances, PyTorch runs in eager execution mode, where a kernel is launched on the GPU, results are returned, and the next operation is launched. But this pattern can slow down training when kernels are small relative to their launch time. Depending on the model, kernel launch time can take the majority of training time. Removing those launch times would provide higher GPU utilization.
We implement Nvidia’s MLPerf CUDA Graphs tool as a generalized function that checks a model for a graphable
section prior to training or inference.
This tool will look for a graphable
segment of the model, and use the image and batch sizes specified in the configuration to trace the necessary kernels. During training, the graphable section launches with a single operation. In our testing on 8 A100 GPUs, we see an increase in average utilization from ~65% without CUDA Graphs, to ~85% with CUDA Graphs.
GPU kernels often require frequent synchronizations across threads, meaning that if one thread finishes sooner, it stops to wait on other threads to catch up. While synchronization is necessary to prevent race conditions, some operations will synchronize more often than necessary to ensure safety. The latest MLPerf submission includes special versions of bounding box operations for anchor generation and ROI align designed to employ only the truly necessary synchronizations. Including these “SyncFree” operations in SMCV provided an additional 5% reduction in step time.
When training is distributed across multiple GPUs, it’s necessary for devices to communicate their status to each other. In the synchronous data parallel pattern, this means averaging (all-reducing) gradients across all GPUs so that everyone takes the same update. In PyTorch, all-reduce occurs at the end of the training step, when loss.backward()
is called, and can take a significant amount of time in the training loop. In order to reduce its impact on step time, we use Nvidia’s Apex and manually call PyTorch’s all-reduce to perform operations asynchronously, so that we can simultaneously prepare the next batch of data for training, as shown in the code snippet below.
While it does add a few lines of code, the technique is generally applicable across many deep learning models, and shows approximately a 5% improvement in model step time.
Combined, these tools lower our training time to standard MLPerf convergence (.377/.339 box and segmentation mAP scores) by approximately 35% compared to the times we published last year. Both tests were performed on p4d instances with 8 A100 GPUs.
VersionTraining Time (Mins)Box mAPSeg mAPSMCV85.980.3780.341MLPerf 2.058.040.3770.342
The purpose of MLPerf is to produce the fastest training times on a set of standard machine learning tasks. As a result, models are built to run in a specific environment, making it difficult to apply in practical settings. To make training easier, we show how to build a SageMaker ready MLPerf Docker image, and convert Nvidia’s MLPerf 2.0 model to run in PyTorch Lightning.
To enable Debugger on your training job, just add the Debugger Lightning callback to your trainer.
Then when launching a training job, pass a collection config specifying what tensors to capture, and what reductions to apply.
The configuration will generate a json file that is automatically picked up by the Debugger callback in your training script, telling the Debugger hook what information to collect on your model.
Debugger will automatically generate model events files that read into TensorBoard. To monitor a training job with TensorBoard, output the write location from the SageMaker estimator.
Then launch TensorBoard with the S3 location as your logdir
In the first part of this blog, we described how we could train Mask RCNN in under an hour by adapting the latest MLPerf model. MLPerf is restrictive on what parameters can be tuned to improve training. For example, entrants must use the same learning rate and schedule. For practical implementations, we want to explore how we can improve the hyper-parameters outside these limitations. But hyper-parameter tuning is often a haphazard process of trial and error, until you find something effective. Instead, we can use the Debugger hook and TensorBoard to examine what is happening inside the model during training, and make informed decisions on our tuning.
Below is the TensorBoard output for a model trained with the standard MLPerf hyper-parameters. We have set the Debugger hook to collect loss, image predictions, and the mean and L2 norms of all weights and gradients.
Notice that the gradients for the mask head are about an order of magnitude larger than those for the box head. This means that in order to balance box and mask performance, the model might benefit from placing more weight on the box head, at the cost of slightly lower mask performance. We can do this by adjusting the bounding box regression weights. The standard hyper-parameters for Mask RCNN set these to (10, 10, 5, 5), so we might try increasing them to (20, 20, 10, 10).
If we examine the L2 norm of the model weights, we find a consistent discontinuity at 12,000 steps. Under the standard Mask RCNN training schedule, this is when the first learning rate decay occurs. A sudden shift in the direction or magnitude of the weights can be a sign that our training schedule is pushing the model weights t too large of values, or in the wrong direction, before the weight decay reigns them in. It’s also a sign that we might be able to converge faster using a smoother learning rate decay schedule. A common method is to use a cosine weight decay, that smoothly transitions to a decline, rather than using constant values followed by sudden drops.
Using the increased bounding box regression weights, and cosine decay, the model converges in 3,000 fewer steps, cutting another ~15% off our training time, from 58 minutes down to 48 minutes, while maintain the same mAP score convergence of 37.7/33.9.
Before I read these books, I had read some reviews which had me wondering if this one would be something I’d like, reviews from people whose opinions I trust. It’s True, so the book stayed on my “TBR…
It has come to my attention that millions of Americans seem to believe they get to have a say over other American’s lives, rights and bodies. These people believe they get to make critical…
Since the last Fed policy meeting in mid-December, US stocks (tech, in particular) have continue to power higher, US bond yields have rose (with 10yr yield being 10 basis points higher) and the USD…