The Second Annual DB
Extravaganza in Honor of
Professor Joseph M. Hellerstein
February 6 (Wednesday) 10am-12pm
Database Lab (CSE 605)
Description The DB Extravaganza is a chance for members of the database group to present work to both a distinguished guest and each other. The talks are short (five minutes) and intended to stimulate discussion. The work presented varies from mature material that has appeared in conferences to more speculative work that is early in the pipeline.
|
Complex Event Specification for Non-Expert
Users of RFID Data |
|
|
|
A key challenge in pervasive computing is the transformation of sensor data into higher-level events (e.g., Bob is getting coffee, Alice is printing a paper). Emerging event detection systems can provide suitably flexible services for extracting these high-level events, but they require event specifications in complex, expert-oriented languages. We present the design and implementation of SCENIC, a tool that allows end users to graphically specify spatio-temporal event definitions using an intuitive iconic language and storyboard metaphor. SCENIC automatically translates user-specified events into an event language used by a sophisticated event detection engine. We show that non-experts can use SCENIC to generate event specifications which produce meaningful results when run over real experimental data collected in an RFID-based pervasive computing environment. |
|
PEEX+: Extracting Events from Probabilistic Streams |
|
|
|
A major problem in detecting events in streams of data is that the data can be imprecise, e.g., RFID data. However, current state-of-the-art event detection systems such as Cayuga, SASE or SnoopIB, assume the data is precise. Noise in the data can be captured using techniques such as hidden Markov models. Inference on these models creates a stream of probabilistic events which cannot be directly queried by existing systems. To address this challenge we propose PEEX+, an event processing system for probabilistic event streams. By exploiting the probabilistic nature of the data, PEEX+ yields a much higher recall and precision than deterministic techniques operating over only the most probable tuples. In this talk, I will give an overview of our approach and explain some preliminary results.
This is joint work with Julie Letchner, Magdalena Balazinska and Dan Suciu. |
|
Event Detection on Archived Probabilistic Streams With Cross-Time Correlations |
|
|
|
Recent work in our group has developed streaming algorithms for event detection on probabilistic streams with Markovian correlations. Once these streams are archived on disk, however, the unique structure of Markovian stream data can be leveraged to answer a broader and more compelling set of queries. We will overview our system for managing these streams by giving examples of the novel query types it supports and by providing insights about query optimizations on archived correlated streams. |
|
Abhay Jha |
Query Evaluation with soft keys |
|
|
Key Violations often occur in real-life datasets,
especially in those integrated from different sources. Enforcing constraints
strictly on these datasets is not feasible. We formalize the notion of
soft-key constraints on probabilistic databases, which allow for violation of
key constraint by penalizing every violating world by a quantity proportional
to the violation. To represent our probabilistic database with constraints,
we define a new class of markov networks,
Size-Constrained Markov Networks(SCMN) and show that many inference problems
that cannnot be solved for a general Markov Network
can be evaluated in PTIME for an SCMN. We also study the evaluation of
conjunctive queries on relations with soft keys and present a dichotomy that
separates this set into those in PTIME and the rest which are #P-Hard. |
|
Moirae: History Enhanced Monitoring |
|
|
|
Today, many applications process continuous streams of information in near real-time. These applications include financial services, network monitoring, computer system monitoring, and sensor-based environment monitoring. These applications also archive the data streams they process for reasons including audit, billing, quality assurance, and forensic analysis. This archived data is an abundant source of information that may help improve the quality of the real-time monitoring tasks. Existing stream processing systems, however, provide limited support for exploiting stream data archives. The goal of the Moirae project is to provide a new type of stream processing engine that integrates near real-time information with data archives to enable new types of smart monitoring tasks. In this talk, we present the current status and future goals of the Moirae project. |
|
Uncovering the Relational Web |
|
|
|
The World-Wide Web consists of a huge number of
unstructured hypertext documents, but it also contains structured data in the
form of HTML tables. We extracted 14.1 billion HTML tables from a
general-purpose web crawl, and estimate that 154 million of these tables
contain high-quality relational-style data. Because each relational
table has its own ``schema'' of labeled and typed columns, each such table
can be considered a small structured database. The resulting corpus of
HTML-embedded relations forms the largest collection of databases that we are
aware of. |
|
Automat. Low Cost. Great Data. |
|
|
Web-search engines offer a keyword interface because
its simplicity appeals to the masses. |
|
|
Incorporating uncertainty metrics into a general-purpose data integration system |
|
|
|
In discovery-oriented fields such as life sciences, researchers often have to quickly retrieve information related to a given item of interest. Such queries commonly require integrating several distinct data bases as the evidence linking relevant information to the query item can often not be found in one single source. This "workflow-like" concatenation of different sources, however, can quickly lead to a large number of possible answers, many of which are actually less relevant.
In order to avoid overloading the user with irrelevant information, we apply a formal probabilistic framework to data integration. In this approach, we let data items retain their uncertainty in each step of the integration process and rank query results according to their relevance before presenting them to the user.
This talk will shortly outline our approach to Uncertain Information Integration (UII) and highlight two current research foci: (i) how can we efficiently evaluate relevance queries over a probabilistic graph from integrated data sets; and (ii) how robust are our results to varying quantifications of the uncertainties?
Todd Detwiler, Wolfgang Gatterbauer, Brent Louie, Dan Suciu, Peter Tarczy-Hornoch |
|
Autonomously Semantifying Wikipedia |
|
|
|
Berners-Lee’s
compelling vision of a Semantic Web is hindered by a chicken-and-egg problem,
which can be best solved by a bootstrapping method — creating enough structured
data to motivate the development of applications. This talk argues that
autonomously “Semantifying Wikipedia” is the best way
to solve the problem. We choose Wikipedia as an initial data source, because
it is comprehensive, not too large, high-quality, and contains enough
manually derived structure to bootstrap an autonomous, self-supervised
process. We identify several types of structures which can be automatically
enhanced in Wikipedia (e.g., link structure, taxonomic data, infoboxes, etc.), and we describe a prototype
implementation of a self-supervised, machine learning system which realizes
our vision. |