Title: PipeGen: Automatic Generation of Data Pipes for Hybrid Analytics

Advisors: Magda Balazinska & Alvin Cheung

Abstract: Moving data between database systems with different storage formats, capabilities, and query languages has become increasingly common in modern data analytics. A typical way to handle data movement is to materialize the transferred data to the file system using a common intermediate format (e.g., CSV) or manually construct data transfer logic, which we refer to as data pipes. Unfortunately, the former approach necessitates excessive file system IO and data conversions, while the latter requires repeatedly implementing custom dedicated data pipes between each pair of systems. In this work, we develop a tool called PipeGen that automatically generates optimized data pipes between arbitrary pairs of database systems without materializing the data to be transmitted to the file system. The use of our generated data pipes enables users to easily leverage functionality in external DBMSs for which no efficient data transfer mechanism exists. We implement a prototype of PipeGen and evaluate it by automatically generating data pipes between five DBMSs. Our generated data pipes perform within 15% of manually-implemented streaming binary versions, and attain a speedup of 2.3 compared to file system materialization.

Place: 
Database Lab, CSE 405, Paul G. Allen Center
When: 
Friday, January 22, 2016 - 13:30 to 14:30