The Second Annual DB Extravaganza in Honor of

Professor Joseph M. Hellerstein

February 6 (Wednesday) 10am-12pm

Database Lab (CSE 605)

 

Description The DB Extravaganza is a chance for members of the database group to present work to both a distinguished guest and each other. The talks are short (five minutes) and intended to stimulate discussion. The work presented varies from mature material that has appeared in conferences to more speculative work that is early in the pipeline.

 

Evan Welbourne

Complex Event Specification for Non-Expert Users of RFID Data

 

A key challenge in pervasive computing is the transformation of sensor data into higher-level events (e.g., Bob is getting coffee, Alice is printing a paper). Emerging event detection systems can provide suitably flexible services for extracting these high-level events, but they require event specifications in complex, expert-oriented languages. We present the design and implementation of SCENIC, a tool that allows end users to graphically specify spatio-temporal event definitions using an intuitive iconic language and storyboard metaphor. SCENIC automatically translates user-specified events into an event language used by a sophisticated event detection engine. We show that non-experts can use SCENIC to generate event specifications which produce meaningful results when run over real experimental data collected in an RFID-based pervasive computing environment.

Christopher Ré

PEEX+: Extracting Events from Probabilistic Streams

 

A major problem in detecting events in streams of data is that the data can be imprecise, e.g., RFID data. However, current state-of-the-art event detection systems such as Cayuga, SASE or SnoopIB, assume the data is precise.  Noise in the data can be captured using techniques such as hidden Markov models.  Inference on these models creates a stream of probabilistic events which cannot be directly queried by existing systems. To address this challenge we propose PEEX+, an event processing system for probabilistic event streams. By exploiting the probabilistic nature of the data, PEEX+ yields a much higher recall and precision than deterministic techniques operating over only the most probable tuples.  In this talk, I will give an overview of our approach and explain some preliminary results.

 

This is joint work with Julie Letchner, Magdalena Balazinska and Dan Suciu.

Julie Letchner

Event Detection on Archived Probabilistic Streams With Cross-Time Correlations

 

Recent work in our group has developed streaming algorithms for event detection on probabilistic streams with Markovian correlations.  Once these streams are archived on disk, however, the unique structure of Markovian stream data can be leveraged to answer a broader and more compelling set of queries.  We will overview our system for managing these streams by giving examples of the novel query types it supports and by providing insights about query optimizations on archived correlated streams.

Abhay Jha

Query Evaluation with soft keys 

 

Key Violations often occur in real-life datasets, especially in those integrated from different sources. Enforcing constraints strictly on these datasets is not feasible. We formalize the notion of soft-key constraints on probabilistic databases, which allow for violation of key constraint by penalizing every violating world by a quantity proportional to the violation. To represent our probabilistic database with constraints, we define a new class of markov networks, Size-Constrained Markov Networks(SCMN) and show that many inference problems that cannnot be solved for a general Markov Network can be evaluated in PTIME for an SCMN. We also study the evaluation of conjunctive queries on relations with soft keys and present a dichotomy that separates this set into those in PTIME and the rest which are #P-Hard.
Joint work with Vibhor Rastogi, Dan Suciu

YongChul Kwon

Moirae: History Enhanced Monitoring

 

Today, many applications process continuous streams of information in near real-time. These applications include financial services, network monitoring, computer system monitoring, and sensor-based environment monitoring. These applications also archive the data streams they process for reasons including audit, billing, quality assurance, and forensic analysis. This archived data is an abundant source of information that may help improve the quality of the real-time monitoring tasks. Existing stream processing systems, however, provide limited support for exploiting stream data archives. The goal of the Moirae project is to provide a new type of stream processing engine that integrates near real-time information with data archives to enable new types of smart monitoring tasks. In this talk, we present the current status and future goals of the Moirae project.

Mike Cafarella

Uncovering the Relational Web

 

The World-Wide Web consists of a huge number of unstructured hypertext documents, but it also contains structured data in the form of HTML tables.  We extracted 14.1 billion HTML tables from a general-purpose web crawl, and estimate that 154 million of these tables contain high-quality relational-style data.  Because each relational table has its own ``schema'' of labeled and typed columns, each such table can be considered a small structured database.  The resulting corpus of HTML-embedded relations forms the largest collection of databases that we are aware of.

A database collection of this scale poses unique problems.  First, simply constructing a high-quality collection is difficult, as non-relational tables make up more than 98% of the raw HTML tables that we encounter. Second, no current query tool is designed to query millions of databases with a very diverse set of schemas (2.6 million unique schemas, in our case).  To solve both of these problems, we propose the WebTables system. WebTables uses trained statistical classifiers to both recover relations (with metadata) from the web crawl, and to rank these recovered relations as part of a keyword-driven structured-data search engine. WebTables also generates statistical schema information that can be applied to several traditional database problems.

Nodira Khoussianova

Automat. Low Cost. Great Data.

 Web-search engines offer a keyword interface because its simplicity appeals to the masses.
On the other hand, it has been noted repeatedly that keyword search does not enable formulating more complex information needs (such as mashups).

To date, the main approach to handling these information needs is based on ideas from data integration, examples being CIMPLE, Yahoo Pipes, etc.  These systems are designed for a particular domain represented by a mediated schema, use wrappers to extract data from a pre-defined set of data sources, reconcile the data across these sources, and answers queries formulated over the mediated schema. Although powerful, these systems require the user to locate sources and create semantic mappings between them. Thus requiring the user to have an expert knowledge of the set of relationships and attributes we expect to query.

In this talk, we present Automat, a system that engages the user in solving complex information tasks on the web, and therefore offers more flexibility and coverage than the data integration approach. With Automat, the user starts by issuing a keyword query, and is presented with the structured data found on the web that is relevant to the query. The structured data may be extracted from HTML tables, lists, or sources. Automat then helps the user to organize the data and further structure and manipulate it with a set of operators.

Wolfgang Gatterbauer

Incorporating uncertainty metrics into a general-purpose data integration system

 

In discovery-oriented fields such as life sciences, researchers often have to quickly retrieve information related to a given item of interest. Such queries commonly require integrating several distinct data bases as the evidence linking relevant information to the query item can often not be found in one single source. This "workflow-like" concatenation of different sources, however, can quickly lead to a large number of possible answers, many of which are actually less relevant.

 

In order to avoid overloading the user with irrelevant information, we apply a formal probabilistic framework to data integration. In this approach, we let data items retain their uncertainty in each step of the integration process and rank query results according to their relevance before presenting them to the user.

 

This talk will shortly outline our approach to Uncertain Information Integration (UII) and highlight two current research foci: (i) how can we efficiently evaluate relevance queries over a probabilistic graph from integrated data sets; and (ii) how robust are our results to varying quantifications of the uncertainties?

 

 

Todd Detwiler, Wolfgang Gatterbauer, Brent Louie, Dan Suciu, Peter Tarczy-Hornoch

Fei Wu

Autonomously Semantifying Wikipedia

 

Berners-Lee’s compelling vision of a Semantic Web is hindered by a chicken-and-egg problem, which can be best solved by a bootstrapping method — creating enough structured data to motivate the development of applications. This talk argues that autonomously “Semantifying Wikipedia” is the best way to solve the problem. We choose Wikipedia as an initial data source, because it is comprehensive, not too large, high-quality, and contains enough manually derived structure to bootstrap an autonomous, self-supervised process. We identify several types of structures which can be automatically enhanced in Wikipedia (e.g., link structure, taxonomic data, infoboxes, etc.), and we describe a prototype implementation of a self-supervised, machine learning system which realizes our vision.

Joint work with Daniel S. Weld.