|
I am a graduate student in
the department of Computer
Science and Engineering at
the University of Washington
advised by Dan
Suciu. My interests are theoretical and practical problems in
data management. Details of my work can be
found here. I recently accepted a
position as an assistant professor at
the University of
Wisconsin--Madison. I believe that the future of computing is in
data management. If you agree, are an outstanding student, and are
looking to begin graduate work, please send me an email.
Ongoing Project Descriptions
MystiQ is a probabilistic
relational database designed to handle imprecision resulting both from newer
applications such as information
extraction and social
networking data and classical applications such as object reconciliation
and data cleaning. The central theme is processing complex SQL queries on
large amounts of probabilistic relational data. This work has developed
techniques such as extensional
plans for aggregates, multisimulation,
materialized
views of probabilistic data, processing of NOT EXISTS predicates,
and approximate
lineage. A recent overview of the system is in our upcoming SUM 2008 paper. For a broader, biased
look at the state of the art, see our tutorial (powerpoint part I & II) that was
delivered at VLDB 2008 in Auckland, New Zealand or the extended version of
our upcoming CACM paper.
Lahar is a successor to the Peex project which is a part
of the larger Markovian Streams
Project. The goal of both projects is to manage data from the RFID ecosystem, which is a building
wide RFID deployment at the Paul Allen Center at the University of
Washington. The technical contribution of this work is a suite of algorithms and access
methods to manage data in both near real-time and historical streams. This
project is joint work with Julie Letchner and
Prof. Magdalena Balazinska.
For an overview, please see our article IEEE Journal of Internet Computing, Challenges for Event
Queries over Markovian Streams. And for a more detailed account, see
our ICDE 2009 research paper Access Methods for
Markovian Streams, or check out our upcoming demo at VLDB 2009 in Lyon, France.
NB: We plan
to publish the data from the RFID ecosystem soon, please check http://lahar.cs.washington.edu for
details. We will also be donating this data to the pdbench project. If you
have probabilistic/uncertain data, I encourage you to donate it to this great
project!
Completed Project Descriptions
Dedupalog is a
declarative language for specifying deduplication tasks. In our upcoming ICDE
2009 paper, Large-Scale
Deduplication with Constraints using Dedupalog, we define a syntax
and semantics for our new language. Further, we provide algorithms that can
cluster massive datasets extremely fast, e.g., cluster all of citeseer
in a minute or two. The technical key is an extremely
scalable algorithm that we prove is a constant-factor approximation of the
optimal for a large fragment of dedupalog programs. This is joint work with
Dr. Arvind Arasu and
Prof. Dan Suciu that was done while visiting the DMX group at Microsoft
Research. This paper has been invited to a special issue of TKDE for the best papers in ICDE 2009.
Galax is an open-source implementation
of XQuery 1.0, the W3C XML Query Language. My work on Galax included the
design of the algebraic compiler which recovered classical optimizations,
notably join optimizations, inside the full XQuery language. This work has
continued without me to produce some very cool work
at SIGMOD 2008.
XQuery! (read: XQuery-Bang) is
a fully compositional update language that extends XQuery 1.0, the W3C XML Query
Language. The contribution is recovering classical database optimizations
(joins, cursors and indices) while at the same time providing imperative
features (variable assignment).
SilkRoute is a platform to
translate XQuery to SQL in a performant and largely complete way. It allows
users to publish their relational data effectively and easily. XBrain is a
web-based application built on SilkRoute designed to allow researchers to
query SIG’s Brain Mapping Database. The query language used is XQuery, and
the resulting XML can be viewed directly or automatically transformed into
HTML, CSV, or visualized on an image of brain regions.
|