Title: In-Database Analytics for NoSQL Key-Value Stores
Advisors: Bill Howe, Dan Suciu
Abstract: Key-value databases emphasize record-level read-write operations, relying on external infrastructure such as Spark or MapReduce to implement complex analytics (e.g., matrix math, relational joins, machine learning). The "external system" approach is expensive for small yet complex analytics, requiring long code paths to extract data from the database and prepare native data structures. In response, developers implement custom applications that push simple filters and sums into the database's scans in order to maximize in-database processing. We generalize this approach to provide native, in-database support for complex analytics and evaluate its performance relative to external systems.
To begin, we implement the Graphulo API for sparse matrix processing inside the scan-time iterators of the Apache Accumulo NoSQL database. We compare the performance of running queries in Graphulo vs. off-the-shelf MapReduce and two in-memory linear algebra libraries on several matrix algorithms. We find that memory requirements and I/O differences between in-database and external system approaches are key factors that determine when one outperforms the others.
Next, we implement relational algebra in Accumulo with the same approach. More so than sparse matrices, relations have many mappings onto key-values; it is important to design a good mapping since the key-value "physical schema" controls how data are grouped and partitioned and therefore the performance of ingest and queries. We cataloged a number of key-value design patterns used in practice and generalize them into a grammar of access paths that insulate applications from the details of data layout. We prototyped a query compiler that generates iterator code based on access paths and show how the compiler was used in a cybersecurity application.
Monday, December 5, 2016 - 11:00 to 12:30