Graduate Student
Computer Science & Engineering
University of Washington abhaykj@cs.washington.edu
About Me
I am a final year PhD student at the CSE department, University of Washington. I work in Probabilistic Databases and my advisor is Dan Suciu. Before this, I got my undergraduate degree at CSE, IIT Bombay. I am currently looking for postdoc/full-time positions in industry/academia.
Research Interests
My research lies at the intersection of Logic and Probabilistic Inference. Our goal is to (i) allow users to encode uncertainty in databases in the form of soft logical assertions, and then (ii) to support expressive, SQL-like queries over the resulting probabilistic database. Since both queries and the model are defined logically, my work has identified how different algorithms for probabilistic inference, viz. Binary Decision Diagrams, d-DNNF, Tree/Path-decomposition, compare in the setting where probabilistic models are defined logically.
Publications
Abhay Jha and Dan Suciu. Probabilistic Databases with MarkoViews. Under Submission[pdf]
Most of the work on query evaluation in probabilistic databases has
focused on the simple tuple-independent data model, where all tuples
are independent random events. Several efficient query evaluation
techniques exists in this setting, such as safe plans, algorithms
based on OBDDs, tree-decomposition and a variety of approximation
algorithms. However, complex data analytics tasks often require
complex correlations between tuples, and here query evaluation is
significantly more expensive, or more restrictive.
In this paper, we propose MVDBs as a framework both for
representing complex correlations and for efficient query
evaluation. An MVDB specifies correlations by views, called MarkoViews, on the
probabilistic relations and declaring the weights of the view's
outputs. An MVDB is a (very large) Markov Logic
Network. We make two sets of contributions. First, we show that query evaluation
on an MVDB is equivalent to evaluating a Union of Conjunctive Query(UCQ) over
a tuple-independent database. The translation is
exact (thus allowing the techniques developed for tuple independent
databases to be carried over to MVDB), yet it is
novel and quite non-obvious (some resulting probabilities may be
negative!).
This translation in itself though may not lead to much gain since the translated query gets complicated as we try to capture more correlations. Our second contribution is
to propose a new query evaluation strategy that exploits offline compilation to speed up online query evaluation. Here we utilize and extend our prior work on compilation of
UCQ. We
validate experimentally our techniques on a large probabilistic
database with MarkoViews inferred from the DBLP data.
Vibhav Gogate and Abhay Jha. Advances in Lifted Importance Sampling. In AAAI 2012 (accepted for Oral Presentation)
Abhay Jha and Dan Suciu. On the tractability of Query Compilation and Bounded Treewidth. In ICDT 2012[pdf]
Probabilistic Inference is intractable in general, since it subsumes
hard problems like #SAT. While #SAT is known to be tractable when
the treewidth of the formula is bounded, the existing notions of
treewidth fail to capture even the simplest tractable class of
Boolean formulas: read-once formulas. We introduce here a new
notion of treewidth of a Boolean formula, called the expression
treewidth, as the smallest treewidth of any DAG-expression
representing the Boolean formula. This is strictly stronger than
the existing definitions of primal and incidence treewidth, and, in
particular every read-once formula has an expression treewidth = 1.
Our first major result is to show that if the expression treewidth
of a Boolean formula is bounded, then the formula has an OBDD of
polynomial size. At a more refined level, if the expression
pathwidth of the Boolean formula is bounded, then it has OBDD of
bounded width (and, hence, of linear size).
Our second result is to prove an interesting converse:
if the Boolean formula has an OBDD of bounded width, then its
expression pathwidth is constant. Hence, for any boolean formula : bounded expression pathwidth is
equivalent to having a constant-width OBDD.
In general we prove the hierarchy : bounded pathwidth $\subsetneq$ bounded treewidth $\subsetneq$ polynomial OBDD. But over UCQ, we show that the three classes collapse and hence the notions
of bounded pathwidth, bounded treewidth, polynomial-size OBDD are equivalent. But over $UCQ^{\neq}$(UCQ Queries with $\neq$), we show that the class of bounded treewidth is strictly contained in polynomial OBDD, while we conjecture that bounded pathwidth and bounded treewidth are still equivalent.
Abhay Jha and Dan Suciu. Knowledge Compilation meets Database Theory : Compiling Queries to Decision Diagrams. In ICDT 2011 [pdf]
The goal of Knowledge Compilation is to represent a Boolean
expression in a format in which it can answer a range of
"online-queries" in PTIME. The online-query of main interest to
us is "model counting", because of its application to query
evaluation on probabilistic databases, but other online-queries can
be supported as well s.a. testing for equivalence, testing for
implication, etc. In this paper we study the following problem.
Given a database query q, decide whether its lineage can be
compiled efficiently into a given target language. We consider four
target languages, of strictly increasing expressive power (when the
size of compilation is restricted to be polynomial in the data
size): Read-Once Boolean formulae, OBDD, FBDD and d-DNNF. For
each target, we study the class of database queries that admit
polynomial size representation: these queries can also be evaluated
in PTIME over probabilistic databases. When queries are restricted
to conjunctive queries without self-joins, it was known that these
four classes collapse to the class of hierarchical queries, which is
also the class of PTIME queries over probabilistic databases. Our
main result in this paper is that, in the case of Unions of
Conjunctive Queries (UCQ), these classes form a strict hierarchy.
Thus, unlike conjunctive queries without self-joins, the expressive
power of UCQ differs considerably w.r.t. these target compilation
languages. Moreover, we give a complete characterization of the
first two target languages, based on the query's syntax.
Abhay Jha, Vibhav Gogate, Alexandra Meliou and Dan Suciu. Lifted Inference Seen from the Other Side : The Tractable Features . In NIPS 2010[pdf]
Lifted Inference algorithms for representations that combine first-order logic and graphical models have been the focus of much recent research. All lifted algorithms developed to date are based on the same underlying idea: take a standard probabilistic inference algorithm (e.g., variable elimination, belief propagation etc.) and improve its efficiency by exploiting repeated structure in the first-order model. In this paper, we propose an approach from the other side in that we use techniques from logic for probabilistic inference. In particular, we define a set of rules that look only at the logical representation to identify models for which exact efficient inference is possible. Our rules yield new tractable classes that could not be solved efficiently by any of the existing techniques.
Wolfgang Gatterbauer, Abhay K. Jha and Dan Suciu. Dissociation and Propagation for Efficient Query Evaluation over Probabilistic Databases. In MUD 2010[pdf]
Queries over probabilistic databases are either safe, in which case they can be evaluated entirely in a relational database engine, or unsafe, in which case they need to be evaluated with a general-purpose inference engine at a high cost. We propose a new approach by which every query is evaluated inside the database engine, by using a new method called dissociation. A dissociated query is obtained by adding extraneous variables to some atoms until the query becomes safe. We show that the probability of the original query and that of the dissociated query correspond to two well-known scoring functions on graphs, namely graph reliability (which is #P-hard), and the propagation score (which is related to PageRank and is in PTIME): When restricted to graphs, standard query probability is graph reliability, while the dissociated probability is the propagation score. We define a propagation score for self-join-free conjunctive queries and prove that it is always an up-per bound for query reliability, and that both scores coincide for all safe queries. Given the widespread and successful use of graph propagation methods in practice, we argue for the dissociation method as a highly efficient way to rank probabilistic query results, especially for those queries which are highly intractable for exact probabilistic inference.
Abhay Jha, Dan Olteanu and Dan Suciu. Bridging the gap between Intensional and Extensional Query Evaluation in Probabilistic Databases. In EDBT 2010[pdf]
There are two broad approaches to query evaluation over probabilistic databases: (1) Intensional Methods proceed by manipulating expressions over symbolic events associated with uncertain tuples. This approach is very general and can be applied to any query, but requires an expensive postprocessing phase, which involves some general-purpose probabilistic inference. (2) Extensional Methods, on the other hand, evaluate the query by translating operations over symbolic events to a query plan; extensional methods scale well, but they are restricted to safe queries.
In this paper, we bridge this gap by proposing an approach that can translate the evaluation of any query into extensional operators, followed by some post-processing that requires probabilistic inference. Our approach uses characteristics of the data to adapt smoothly between the two evaluation strategies. If the query is safe or becomes safe because of the data instance, then the evaluation is completely extensional and inside the database. If the query/data combination departs from the ideal setting of a safe query, then some intensional processing is performed, whose complexity depends only on the distance from the ideal setting.
Abhay Jha, Vibhor Rastogi and Dan Suciu. Query evaluation with Soft-Key Constraints. In PODS 2008[pdf]
Key Violations often occur in real-life datasets, especially in those integrated from different sources. Enforcing constraints strictly on these datasets is not feasible. In this paper we formalize the notion of soft-key constraints on probabilistic databases, which allow for violation of key constraint by penalizing every violating world by a quantity proportional to the violation. To represent our probabilistic database with constraints, we define a class of markov networks, where we can do query evaluation in PTIME. We also study the evaluation of conjunctive queries on relations with soft keys and present a dichotomy that separates this set into those in PTIME and the rest which are #P-Hard.
Abhay Kumar Jha, Ming Xiong and Krithi Ramamritham. Mutual Consistency in Real-Time Databases. In RTSS 2006[pdf]
A real-time database is composed of real-time objects whose values remain valid only within their validity intervals. Each object in the database models a real world entity. The freshness of these objects is maintained by update transactions that sample the real world entities. The literature proposes various ways to derive a schedule of transactions that preserves the freshness (also known as absolute consistency) of these objects. But these approaches do not take care of the mutual consistency of the objects, i.e., whether together they represent a logical state of the system. We investigate the problem of checking whether, given an update transaction schedule, a periodic query would be able to read mutually consistent values. We propose solutions for both single- and multiple-query cases in the presence of non-preemptable query executions. Specifically, we first investigate formulas that give the maximal value of mutual gaps among a set of data read at a certain point in time. (A mutual gap for two object values read from the database refers to the difference between the times at which the two objects were updated.) We then propose design approaches to (1) decide the period and relative deadline of a query so that it would guarantee mutual consistency; (2) decide if a given set of queries with relative deadlines and periods can guaranteemutual consistency. Finally, we suggest ways of reducing the complexity of our proposed approaches for both harmonic periods and general cases.
Qiang Wang, Abhay Kumar Jha and M. Tamer Ozsu. An XML Routing Synopsis for Unstructured P2P Networks. In WAIMW 2006[pdf]
Many emerging applications that use XML are distributed, usually over large peer-to-peer (P2P) networks on the Internet. The deployment of an XML query shipping system over P2P networks requires a specialized synopsis to capture XML data in routing tables. In this paper, we propose a novel graph-structured routing synopsis, called kd-synopsis, for deployment over unstructured super-peer based P2P networks. This synopsis is based on length-constrained FBsimulation relationship, which allows the balancing of the precision and size of the synopsis according to different space constraints on peers with heterogeneous capacity. We report comprehensive experiments to demonstrate the effectiveness of the kd-synopsis.