Management of Probabilistic Data

Description

Management of Probabilistic Data

NSF Proposal Number: 0713576

NSF Abstract page: http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0713576

Databases and data management tools have a deterministic semantics: a data item either is in the data set or is not. But when data comes from multiple sources, or is extracted automatically, it often contains a variety of imprecisions that are difficult to model using the standard deterministic semantics: the same data item may have different representations in different sources and matching algorithms are imprecise; schemas differ across sources and schema matching tools are imprecise or incomplete or both; data at different sources may hold contradictory information; finally, some data, such as sensor data, is inherently probabilistic and hence imprecise. This project represents all sources of imprecision in a single uniform way, as data with a probabilistic semantics, and extendeds today's data management tools to manage efficiently data with probabilistic semantics.

Re-designing databases to handle probabilistic data is a daunting task. This project studies two problems that lie at the core of probabilistic data management: the complexity of the query evaluation problem on probabilistic databases, and the view materialization problem (deciding whether a view can be materialized and whether it can be used in other query plans). The results of this research will consists of a range of fundamental techniques to be used in a general purpose probabilistic query processor.

Intellectual Merits . The project makes new contributions that lie at the intersection of several disparate fields: logic, probability theory, knowledge representation, and traditional query processing and optimization. The project enhances the understanding of the query evaluation problem on probabilistic data, develops new algorithms for efficiently evaluating such queries and for materializing views, while leveraging existing database technology.

Broader Impact . Searching large information spaces (the Web; large collections of scientific databases; Homeland Security data) is one the new and most challenging frontiers in Computer Science. The innovation that is needed to support complex searches that scale to large and heterogeneous information spaces, has to come from the data management research community. This project makes contributions to broaden our ability to search large information spaces. If successful, the project will be one of the pieces that will help data management technology undergo a new paradigm shift, from supporting complex queries with deterministic semantics, to supporting complex explorations with probabilistic semantics.

Principal Investigator: Dan Suciu

Students:

Related Project: MystiQ


Supported by:

NSF IIS-0713576 and a Gift from Microsoft

Members

Chris Re, Dan Suciu, Magdalena Balazinska, Abhay Jha, Vibhor Rastogi, Evan Welbourne, Nodira Khoussainova, Gaetano Borriello, Nilesh Dalvi,

Publications

Chris Re, Dan Suciu,
Approximate Lineage for Probabilistic Databases
In VLDB, 2008
Chris Re, Dan Suciu,
Management of Data with Uncertainties
In CIKM, pp. 3-8, 2008
Chris Re, Magdalena Balazinska, Dan Suciu,
Event Queries on Correlated Probabilistic Streams
In SIGMOD, 2008
Abhay Jha, Vibhor Rastogi, Dan Suciu,
Evaluating Queries in the Presence of Soft Key Constraints
In PODS, 2008
Evan Welbourne, Nodira Khoussainova, Magdalena Balazinska, Gaetano Borriello, Dan Suciu,
Cascadia: A System for Specifying, Detecting, and Managing RFID Events
In MobySYS, 2008
Dan Suciu,
Probabilistic Databases
Published in SIGACT News, vol. 39 , no. 2 , pp. 111-124 , June , 2008
Chris Re, Nilesh Dalvi, Dan Suciu,
Efficient Top-k Query Evaluation on Probabilistic Data
In ICDE, 2007
Nilesh Dalvi, Dan Suciu,
Management of Probabilistic Data: Foundations and Challenges
In PODS, pp. 1-12, 2007
Note: (invited talk)
Chris Re, Dan Suciu,
Efficient Evaluation of HAVING Queries on a Probabilistic Database
In Proceedings of DBPL, 2007
Dan Suciu,
Tutorial at the EDBT Summer School: Techniques for managing probabilistic data
Unpublished ,2007
Nilesh Dalvi, Chris Re, Dan Suciu,
Query Evaluation on Probabilistic Databases
Published in IEEE Data Engineering Bulletin, vol. 29 , no. 1 , pp. 25-31 , 2006
Nodira Khoussainova, Magdalena Balazinska, Dan Suciu,
Towards correcting input data errors probabilistically using integrity constraints
In MobiDB, pp. 43-50, 2006
Nilesh Dalvi, Dan Suciu,
Efficient Query Evaluation on Probabilistic Databases
In VLDB, 2004