Datacenter Scheduling for Approximate Computing and Machine Learning

The rapidly growing size of data and the complexity of analytics present new challenges to the data processing systems at large scale. Modern distributed computing systems need to support not only batch processing jobs, but also more advanced applications analyzing multimedia data and training machine learning models. Given the high costs of data processing, resource management is crucial. New applications and data expose vastly different requirements which make traditional resource management systems obsolete, and at the same time offer new opportunities which lead to new system designs for better performance.

We present resource management systems that can efficiently schedule and utilize hardware resources with the insights from the advanced distributed applications. We identify and study the following three key scenarios in big-data analytics:

(i) VideoStorm: Video cameras are pervasively deployed for security and smart city scenarios, with millions of them in large cities worldwide. Achieving the potential of these cameras requires efficiently analyzing the live videos in real-time. VideoStorm is a video analytics system that processes thousands of video analytics queries on live video streams over large clusters. We consider two key characteristics of video analytics: resource-quality tradeoff with multi-dimensional configurations, and variety in quality and lag goals. VideoStorm’s offline profiler generates query resource-quality profile, while its online scheduler allocates resources to queries to maximize performance on quality and lag, in contrast to the commonly used fair sharing of resources in clusters.

(ii) SLAQ: Training machine learning (ML) models with large datasets can incur significant resource contention on shared clusters. This training typically involves many iterations that continually improve the quality of the model. Yet in exploratory settings, better models can be obtained faster by directing resources to jobs with the most potential for improvement. SLAQ is a cluster scheduling system for approximate ML training jobs that aims to maximize the overall job quality. When allocating cluster resources, SLAQ explores the quality-runtime trade-offs across multiple jobs to maximize system-wide quality improvement. SLAQ leverages the iterative nature of ML training algorithms, by collecting quality and resource usage information from concurrent jobs, and then generating highly-tailored quality-improvement predictions for future iterations.

 

Main Publications

Live Video Analytics at Scale with Approximation and Delay-Tolerance
Haoyu Zhang, Ganesh Ananthanarayanan, Peter Bodik, Matthai Philipose, Paramvir Bahl, and Michael J. Freedman
Proc. 14th Symposium on Networked Systems Design and Implementation
(NSDI ’17) Boston, MA, March 2017.  [pdf]

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning
Haoyu Zhang*, Logan Stafman*, Andrew Or, and Michael J. Freedman
Proc. ACM Symposium on Cloud Computing
(SoCC ’17) Santa Clara, CA, September 2017.  [pdf]
Best Paper Award.

Other Publications

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning
Haoyu Zhang, Logan Stafman, Andrew Or, and Michael J. Freedman
1st SysML Conference
(SysML ’18) Stanford, CA, February 2018.  [extended abstract]