Title: INA: Accelerating DNN Training via In-Network Aggregation
Advisors: Luis Ceze, Arvind Krishnamurthy
Abstract: Recent growth of DNN development outpaces the growth in network bandwidth and GPU raw power. This causes the experiment time for finding the right neural network model to explode from a few days on a single machine to several weeks even months on a cluster with thousands of machines. DNN Frameworks such as MXNET enabled training DNNs on cluster or datacenter level to bring down training time. However, even with aggressive optimizations built-in, we found deficiency when scaling to more machines and deeper neural networks which stops the system throughput from scaling linearly, due to frequent parameter updates in the physical network, limited physical network and parameter server capacity.
This talk characterizes a detailed DNN training of today, pinpoints the current bottleneck in large scale DNN training, proposes principles and designs of In-Network Aggregation (INA) for accelerating datacenter level DNN workloads by exploiting parallel parameter aggregation right inside the network, in a hierarchical manner. INA removes the bottleneck in parameter synchronization by reducing both required network bandwidth and latency for DNN training with techniques such as in-place parallel stream aggregation and merging. We also project the effectiveness of INA using an INA model, which sees a sizable improvement in training time for RESNET-269 when compared with current state-of-the-art. Our results show the feasibility of running much larger DNNs with INA on existing datacenter infrastructures than they currently can handle.