Title: Efficient Neural Network Serving System
Advisor: Arvind Krishnamurthy and Matthai Philipose
Supervisory Committee: Arvind Krishnamurthy (Co-Chair), Matthai Philipose (Co-Chair, MSR), Raadhakrishnan Poovendran (GSR, EE), Tom Anderson, and Ali Farhadi
Today, Deep Neural Networks (DNNs) can recognize faces, detect objects, and transcribe speech, with (almost) human performance. We expect that DNN-based applications will soon become an important workload. Despite of their excellent performance, the computational demands of DNN workloads are high enough that such applications strain device battery, cloud cost budgets, and latency requirement. Existing techniques such as model optimization and batching, though seem promising, have to make compromises in the real-world deployment. Model optimization trades off accuracy for lower computation demand. Batching, on the other hand, cannot achieve optimal performance due to low latency requirement and increased model variations caused by model optimization.
In this proposal, we demonstrate that it is possible to build end-to-end systems that execute DNNs efficiently at low cost, low latency, and high accuracy. To achieve this, we leverage the existing techniques including model optimizations and batching, and propose three systems to improve the efficiency. Sequential specialization takes advantage of temporal locality in the streaming settings and produces low-cost and high-accuracy DNN models at runtime. MCDNN describes an approximation-based execution framework across mobile devices and cloud that achieves high accuracy under resource constraints. It uses a heuristic scheduler that allocates resources proportionally to their frequency of use and systematically trades off accuracy for resource use. Third, we present Nexus, a DNN execution service optimized for GPU clusters. Nexus models DNN as networks of dense linear algebra operations and adopts a batching-aware cluster resource allocation and scheduling framework to improve GPU utilization. We evaluate all three systems under realistic workloads, and results show that these systems can achieve significant improvement in cost, accuracy, and utilization compared to baselines.