Management of Probabilistic Data
Description
Management of Probabilistic Data
NSF Proposal Number: 0713576
NSF Abstract page:
http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0713576
Databases and data management tools have a deterministic semantics: a
data item either is in the data set or is not. But when data comes
from multiple sources, or is extracted automatically, it often
contains a variety of imprecisions that are difficult to model using
the standard deterministic semantics: the same data item may have
different representations in different sources and matching algorithms
are imprecise; schemas differ across sources and schema matching tools
are imprecise or incomplete or both; data at different sources may
hold contradictory information; finally, some data, such as sensor
data, is inherently probabilistic and hence imprecise. This project
represents all sources of imprecision in a single uniform way, as data
with a probabilistic semantics, and extendeds today's data management
tools to manage efficiently data with probabilistic semantics.
Re-designing databases to handle probabilistic data is a daunting
task. This project studies two problems that lie at the core of
probabilistic data management: the complexity of the query evaluation
problem on probabilistic databases, and the view materialization
problem (deciding whether a view can be materialized and whether it
can be used in other query plans). The results of this research will
consists of a range of fundamental techniques to be used in a general
purpose probabilistic query processor.
Intellectual Merits
. The project makes new contributions that
lie at the intersection of several disparate fields: logic,
probability theory, knowledge representation, and traditional query
processing and optimization. The project enhances the understanding
of the query evaluation problem on probabilistic data, develops new
algorithms for efficiently evaluating such queries and for
materializing views, while leveraging existing database technology.
Broader Impact
. Searching large information spaces (the Web;
large collections of scientific databases; Homeland Security data) is
one the new and most challenging frontiers in Computer Science. The
innovation that is needed to support complex searches that scale to
large and heterogeneous information spaces, has to come from the data
management research community. This project makes contributions to
broaden our ability to search large information spaces. If
successful, the project will be one of the pieces that will help data
management technology undergo a new paradigm shift, from supporting
complex queries with deterministic semantics, to supporting complex
explorations with probabilistic semantics.
Principal Investigator:
Dan Suciu
Students:
-
Christopher Re (works on representations and approximations of
views on probabilistic databases)
-
Abhay Jha (develops query evalution techniques over tables
with soft key constraints)
-
Vibhor Rastogi (works on complexity aspects of query evaluation
over soft key constraints)
-
Nodira
Koussainova
(applications of materialized views to RFID data)
-
Chris Re
(works
on the view materialization problem and on probabilistic constraints)
Related
Project:
MystiQ
Supported by:
NSF IIS-0713576 and a Gift from Microsoft
Members
Dan Suciu, Vibhor Rastogi, Chris Re, Nilesh Dalvi, Magdalena Balazinska, Abhay Jha, Evan Welbourne, Nodira Khoussainova, Gaetano Borriello,
Publications
Dan Suciu,
Embracing Uncertainty in Large-Scale Computational Astrophysics
In MUD, 2009
Vibhor Rastogi, Chris Re, Dan Suciu,
Query Containment of Tier-2 Queries over a Probabilistic Database
In MUD, 2009
Chris Re, Dan Suciu,
General Database Statistics Using Entropy Maximization
In DBPL, 2009
Nilesh Dalvi, Chris Re, Dan Suciu,
Queries and Materialized Views on Probabilistic Databases
Unpublished ,2009
Note: to appear in JCSS
Nilesh Dalvi, Chris Re, Dan Suciu,
Probabilistic Databases: Diamonds in the Dirt (Extended Version)
Unpublished ,2009
Nilesh Dalvi, Chris Re, Dan Suciu,
Probabilistic Databases: Diamonds in the Dirt
Published in CACM, vol. 52 , no. 7 , pp. 86-96 , 2009
Chris Re, Dan Suciu,
Approximate Lineage for Probabilistic Databases
In VLDB, 2008
Chris Re, Dan Suciu,
Management of Data with Uncertainties
In CIKM, pp. 3-8, 2008
Chris Re, Magdalena Balazinska, Dan Suciu,
Event Queries on Correlated Probabilistic Streams
In SIGMOD, 2008
Abhay Jha, Vibhor Rastogi, Dan Suciu,
Evaluating Queries in the Presence of Soft Key Constraints
In PODS, 2008
Evan Welbourne, Nodira Khoussainova, Magdalena Balazinska, Gaetano Borriello, Dan Suciu,
Cascadia: A System for Specifying, Detecting, and Managing RFID Events
In MobySYS, 2008
Dan Suciu,
Probabilistic Databases
Published in SIGACT News, vol. 39 , no. 2 , pp. 111-124 , June , 2008
Chris Re, Nilesh Dalvi, Dan Suciu,
Efficient Top-k Query Evaluation on Probabilistic Data
In ICDE, 2007
Nilesh Dalvi, Dan Suciu,
Management of Probabilistic Data: Foundations and Challenges
In PODS, pp. 1-12, 2007
Note: (invited talk)
Chris Re, Dan Suciu,
Efficient Evaluation of HAVING Queries on a Probabilistic Database
In Proceedings of DBPL, 2007
Dan Suciu,
Tutorial at the EDBT Summer School: Techniques for managing probabilistic data
Unpublished ,2007
Nilesh Dalvi, Chris Re, Dan Suciu,
Query Evaluation on Probabilistic Databases
Published in IEEE Data Engineering Bulletin, vol. 29 , no. 1 , pp. 25-31 , 2006
Nodira Khoussainova, Magdalena Balazinska, Dan Suciu,
Towards correcting input data errors probabilistically using integrity constraints
In MobiDB, pp. 43-50, 2006
Nilesh Dalvi, Dan Suciu,
Efficient Query Evaluation on Probabilistic Databases
In VLDB, 2004