Title: Building and accelerating a declarative platform for cloud machine learning
Advisor: Linda Shapiro
Supervisory Committee: Linda Shapiro (Chair), Mark Ganter (GSR, ME), Srikanth Kandula (Microsoft Research), and Magda Balazinska
Machine learning over big-data is blooming with numerous applications including traffic video analytics, internet user modeling and fraud detection. Processing machine learning queries on previous cloud platforms is yet unsatisfactory; manual and ad-hoc tuning is required for degrees of parallelism, resource usage, and algorithmic optimizations.
We describe Optasia, a novel dataflow system that efficiently processes large-scale machine learning queries on the cloud. Optasia maps machine learning operations to a SQL-like declarative language and applies a powerful query optimizer for auto parallelization and multi-query de-duplication. Evaluation on a traffic video dataset shows many-fold improvements in query completion time and resource usage relative to existing systems. To further accelerate the query processing, we note that, machine learning algorithms are modeled as user-defined functions (UDFs) in current dataflow systems. Predicates on the UDF outputs (e.g., “has-red-cars=true”), even if highly selective, force a large amount of data to be needlessly analyzed. Accelerating machine learning queries with UDFs is yet an open question. In this work, we review and extend predicate pushdown by constructing and applying probabilistic predicates (PPs) to filter data blobs that do not satisfy the query predicate; such filtering is parametrized to different target accuracies. To support complex query predicates, we augment a cost-based query optimizer to choose plans with appropriate combinations of simpler probabilistic predicates. Experiments on several practical machine learning workloads show that query processing can be boosted by as much as 10x without losing accuracy.