================================================================ CSE 344 -- Fall 2011 Lecture 10: Datalog ================================================================ As an alternative to abstract query languages based on algebra, one can use a form of logic to express queries. Just like SQL, these queries express the WHAT rather than the HOW. Today, we will discuss the Datalog ("database logic") logical query language. Next lecture, we will discuss the relational calculus, which is another declarative language for expressing queries. The relational calculus has influenced the design of the SQL query language. Why are we discussing Datalog and the Relational Calculus? For relational calculus, the answer is simple: it is the foundation behind SQL, so you need to understand it. For datalog, the key reason for studying it is that it is heavily used in the database theory literature. Many important and fundamental problems in database research require rigorous reasoning about queries and their properties. Datalog is much more convenient to use in those cases than SQL or even relational algebra. As one example, there has recently been great interest in managing imprecise information. Recall IMDB and how it contains dirty data. In practice, data is almost always dirty. Integrity constraints can help us catch when problems arise but it is not always possible to clean the data. For example, a QFC database can include two "Preferred customers" with the same name, same phone number, but two different addresses. One approach to handling such dirty data is to annotate the data with probabilities. Perhaps one address is more likely to be correct than the other one because it was more recently added to the database? Once we annotate base data with probabilities, we can ask queries, and receive results also annotated with probabilities and ranked according to their likelyhood of being correct. It is computationally intensive to execute queries with such probabilities. In fact, some queries are #P-hard to answer while others can be computed in polynomial time (Prof. Dan Suciu is behind this crucial and very well cited result). Reasoning about the complexity of various queries under such new data models is much easier to do in datalog rather than relational algebra or SQL. Hence, if you ever want to do research in theoretical data management, you must learn datalog. For everyone else, it is still very useful to know at least a little datalog because you need to understand whether theoreticians actually came up with something useful in practice ;-) For example, Joe Hellerstein from Berkeley who is a systems database faculty is using datalog-like languages to reason about network protocols and distributed systems. Others are using datalog to reason about massive-scale parallel query processing. Datalog is thus a veru useful, theoretical tool. You need to know about it! ----------------------------- Datalog consists of "if-then" rules. A datalog rule is: Q(args) :- R1(args1), R2(args2), ... sometimes also written as: Q(args) :- R1(args1) AND R2(args2) AND ... Terminology: the rule head = Q(args) the rule body = R1(args1), R2(args2), ... atom or subgoal = any one of Ri(argsi) head variables = the variables occurring in the head existential variables = all the other variables Ri(argsi) is also called a relational atom. Ri(argsi) evaluates to true when relation Ri contains the tuple described by argsi. For example, for relation R(A,B) containing tuples (1,2) and (3,4), R(1,2) is true, R(3,4) is true but R(5,6) is false. In addition to relational atoms, we can also have arithmetic atoms of the form x < y or 2x > 4. Subgoals may be negated with NOT. A datalog program is a collection of one or more rules: rule1 rule2 ... Each of the rules expresses the idea that from certain combinations of tuples in certain relations, we may infer that some other tuple must be in some other relation, or in the answer to a query. ----------------------- Examples of single rules: Schema: /*****************************************************/ /* Purchase(pid, product, price, quantity) */ /* Product(pname, manufacturer) */ /* (Purchase.product refers to Product.pname) */ /*****************************************************/ Find all products under 9.99: A(y) :- Purchase(x,y,z,u), z < 9.99 What is the meaning of this datalog program? Whenever we can find a tuple in Purchase that satisfies the predicate z < 9.99, then we have a tuple in the result relation A with attribute y equal to the value of the product attribute of the Purchase tuple. Find all manufacturers that manufacture some products under 9.99 A(v) :- Purchase(x,y,z,u), z < 9.99, Product(y, v) Find manfuacturers that manufacture products both < 9.99 and > 999.99 A(v) :- Purchase(x1,y1,z1,u1), z1 < 9.99, Product(y1, v), Purchase(x2,y2,z2,u2), z2 > 999.99, Product(y2, v), Change this query so it returns manfuacturer that sold THE SAME PRODUCT both under 9.99 and over 999.99 Examples of multiple rules: /************************/ /* Friend(name1, name2) */ /* Enemy(name1, name2) */ /************************/ Find Joe's friends, and Joe's friends of friends. A(x) :- Friend('Joe', x) A(x) :- Friend('Joe', z), Friend(z, x) What does this compute ? Mystery(x,y) :- Friend(x,z), Friend(z,y) What does this compute ? M1(x,y) :- Friend(x,z), Friend(z,y) M2(x,y) :- M1(x,z), M1(z,y) A(x,y) :- M2(x,z), M2(z,y) ---------------------- We allow only *non-recursive* datalog in this course, meaning the following: the relation in the head of rule k cannot appear in the body of rules ---------------------- Meaning of a datalog rule = a logical statement ! M(x,y) :- Friend(x,z), Friend(z,y) Means: forall x. forall y. forall z. (Friend(x,z) and Friend(z,y) ==> M(x,y)) and M is the smallest relation that has this property Note: logically equivalent to: forall x. forall y. (exists z. Friend(x,z) and Friend(z,y)) ==> M(x,y) That's why the variables not in the head are called "existential variables". ---------------------- Can we express everything in the Relational Algebra in datalog ? In class: 1. Union R(A,B) U S(A,B) 2. Intersection 3. Selection 4. Join 5. Projections 6. Difference For union we need two rules: U(x,y) :- R(x,y) U(x,y) :- S(x,y) Intersection I(x,y) :- R(x,y),S(x,y) Selection x>100 AND y='some string' S(x,y) :- R(x,y), x > 100, y='some string' Alternatively S(x,y) :- R(x,'some string) Selection x>100 OR y='some string' Need two rules S(x,y) :- R(x,y), x > 100 S(x,y) :- R(x,y), y='some string' Cross-product C(x,y,u,v) :- R(x,y)S(u,v) Natural join J(x,y,z) :- R(x,y)S(y,z) Projections P(x) :- R(x,y) To express difference, we add negation D(x,y) :- R(x,y) NOT S(x,y) Another example with negation: Find Joe's friends' of friends that are not Joe's friends. A(x) :- Friend('Joe',y), Friend(y,x), not Friend('Joe',x) --------------------- what's wrong ? A(x,y) :- Friend('Joe',y), Enemy(y,z) A(z) :- Friend('Joe', x), x != z A(x) :- Friend('Joe', x), not (Friend(x,y)) These rules generate infinite numbers of answers! --------------------- "Safe datalog rule": every variable occurs in a positive atom. Every rule must be safe ! --------------------- Extra examples. 1) Translate the following relational algebra expressions into datalog: R(A,B,C) S(D,E,F) T(G,H) 1.1) project_{A} ( select_{B=3} (R) ) A(a) :- R(a,3,c) Can also write: A(a) :- R(a,3,_) Underscore is used to denote an "anonymous variable", a variable that appears only once. 1.2) project_{A} ( select_{B=3} (R) join_{R.A=S.D} select_{E=5}(S) ) A(a) :- R(a,3,c), S(a,5,f) 1.3) project_{A} ( select_{B=3 and C > 100} (R) ) A(a) :- R(a,3,c), c > 100 2) Translate the following datalog queries into SQL A(a) :- R(a,_,c), S(c,_,e), T(e,f,_), f > 100 SELECT R.a FROM R, S, T WHERE R.A = S.D AND S.F = T.G AND T.H > 100 --------------------- Comparison between datalog and relational algebra 1) Every expression in the basic relational algebra can be expressed as a Datalog query. 2) But operations in the extended relational algebra (grouping, aggregation, and sorting) have no corresponding features in the version of datalog that we discussed today. 3) Similarly, datalog can express recursion, which relational algebra cannot.