Keys for XML
Peter Buneman* Susan Davidson* Wenfei Fan# Carmem Hara% Wang-Chiew Tan*
[This document is also available in .ps and .pdf format.]
Abstract: We discuss the definition of keys for XML
documents, paying particular attention to the concept of a relative
key, which is commonly used in hierarchically structured documents and
scientific databases.
Keywords: Keys, Relative Keys.
1 Introduction
Keys are an essential part of
database design [2,
14]: they
are fundamental to data models and conceptual design; they provide the means by
which one tuple in a relational database may refer to another tuple; and they
are important in update, for they enable us to guarantee that an update will
affect precisely one tuple. More philosophically, if we think of a tuple as
representing some real-world entity, the key provides an invariant connection
between the tuple and entity.
If XML documents are to do double duty as
databases, then we shall need keys for them. In fact, a cursory examination1 of existing DTDs reveals a number of cases in which
some element or attribute is specified -- in comments -- as a "unique
identifier". Moreover a number of scientific databases, which are typically
stored in some special-purpose hierarchical data format which is ripe for
conversion to XML, have a well-organized hierarchical key
structure.
Various forms of key specification for XML are to be found in
the XML standard [7], XML Data
[13],
XML Schema [17].
Through the use of ID attributes in a DTD [7], one can
uniquely identify an element within an XML document. However, it is not clear
that ID attributes are intended to be used as keys rather than internal
"pointers". For example, ID attributes are not scoped. In contrast to keys, they
are unique within the entire document rather than among a designated set of
elements. As a result, one cannot, for example, allow a student (element) and a
person (element) to use the same SSN as an ID. Moreover using ID attributes as
keys means that we are limiting ourselves to unary keys and, of course, to using
attributes rather than elements. Finally, one can specify at most one ID
attribute for an element type, while in practice one may want more than one key.
XML Data introduces a notion of keys explicitly. However, its keys can only be
specified in types and moreover, can only be defined for element types rather
than for certain collections of elements.
XML Schema has a more elaborate
proposal, which is the starting point of this paper. The proposal extends the
key specification of XML Data by allowing one to specify keys in terms of XPath
[11]
expressions. There are a number of technical problems in connection with XPath.
XPath is a relatively complex language in which one can not only move down the
document tree, but also sideways or upwards, not to mention that predicates and
functions can be embedded as well. The main problem with XPath is that questions
about equivalence or inclusion of XPath expressions are, as far as the authors
are aware, unresolved; and these issues are important if we want to reason about
keys as we do in relational databases. Yet until we know how to determine the
equivalence of XPath expressions, there is no general method of saying whether
two such specifications are equivalent. Another technical issue is value
equality. XML Schema restricts equality to text, but the authors have
encountered cases in which keys are not so restricted. A more detailed
discussion can be found in section 7.1.
However,
the main reason for writing this note is that none of the existing key proposals
address the issue of hierarchical keys, which appear to be ubiquitous in
hierarchically structured databases, especially in scientific data formats. A
top-level key may be used to identify components of a document, and within each
component a secondary key is used to identify sub-components, and so on.
Moreover, the authors believe that the use of keys for citing parts of a
document is sufficiently important that it is appropriate to consider key
specification independently of other proposals for constraining the structure of
XML documents.
How then, are we to describe keys for XML or, more
generally, for semistructured data? From the start, how we identify components
of XML documents is very different from the way we identify components of
relational databases. Consider the following two structures:
<db>
<student>
<name> Smith </name>
<course> Math2 </course>
<grade> B </grade>
</student>
<student>
<name> Jones </name>
<course> Math2 </course>
<grade> A+ </grade>
</student>
<student>
<name> Brown </name>
<course> Phil5 </course>
<grade> A- </grade>
</student>
</db>
|
name |
course |
grade |
Smith |
Math2 |
B |
Jones |
Math2 |
A+ |
Brown |
Phil5 |
A- |
|
To
identify a tuple in the relation we need to know, say, that name and
course constitute a key. In the absence of a key the only way we can be
sure of uniquely identifying a tuple is to give the entire tuple. For relational
databases, the way we specify a key constraint is to say that if two tuples
agree on their key attributes they agree everywhere. By contrast, XML documents
are, first of all, documents and we can therefore use the position in the
document (say a byte offset) to identify some part of it, therefore the way we
might constrain the XML document is to say that if two elements agree on the
name and course subelements then they are the same element.
Put in the contrapositive: two distinct student elements must differ on a
name or course subelement. This raises two issues that precede
any discussion of the structure of keys: that of node identification and that of
equality. The latter is a thorny topic, but needs some attention.
Organization. The rest of the paper is organized as follows.
Section 2
introduces the notion of node addresses and value equality. Node addresses are
used in node equality testing, i.e., testing whether two nodes are the same node
and value equality is used for testing whether two nodes have the same value.
Section 3
introduces our path expression language which is used in the definition of keys
discussed in section 4. Section 5 addresses issues
in connection with reasoning about XML keys. The concept of relative or
hierarchical keys together with its alternative notation is discussed in
section 6.
In section 7, we examine the
XML-Schema proposal in some detail, discuss an alternative form of keys and
various issues concerning keys.
2 Node addresses and equality
The Document Object Model (DOM) [3] provides
some insight into a semantics for XML documents. According to the DOM, a
document is a hierarchical structure of nodes. Nodes are of several types, but
there are three types that are important to this discussion: element nodes,
attribute nodes, and text nodes. As illustrated in Figure 1 text nodes (T) have
no name but carry text, attribute nodes (A) have both a name and carry text, and
element nodes (E) have a name. Element nodes may have children; attribute and
text nodes are terminal. In addition the DOM specifies how to reach the children
of an element node. Text and element children are held in what is essentially an
array, the index in the array being determined by the order of the subelements
in the document. Attribute children are held in a dictionary. The name of the
attribute, which must be unique within an element, is used as the index. These
indexes, an integer for an element or text child, or the name prefixed by an "@"
for attributes, are shown as edge labels in Figure 1. The important
point here is that the edge labels uniquely identify children.
A
consequence of this model is that a path of edge labels from the root uniquely
identifies a node. We shall call such paths node addresses and write
them ál1#...#lnñ, for example á1#2#1ñ and á1#3#@numñ. Node addresses will be our basic means of identifying
nodes. Note that an attribute name can only occur at the end of a node address.
We can also talk about the address of a subnode relative to a node. For
example any subnode of a node with address áañ will have a node address
of the form áa# bñ where ábñ is the address of the subnode relative to áañ. By a subnode of
a node x we mean any node in the subtree rooted at x, not
necessarily a child node of x.
<db>
<composer>
<name> J.S. Bach </name>
<born> 1685 </born>
<work num="BWV82">
<title> Ich habe genug </title>
</work>
<work num="BWV552">
</work>
</composer>
<composer period="baroque">
<name> G.F. Handel </name>
<work num="HWV19">
<title> Art Thou Troubled? </title>
</work>
</composer>
</db>
|
Figure 1: Some XML and its representation as a
tree
Value equality. Equality is essential to the
definition of keys, and in order to define keys we need first to define equality
of the "values" associated with nodes. XML-Schema restricts equality to text
nodes, but the authors have encountered cases in which keys are not so
restricted. An immediate example is that when one treats name as a key
for person nodes, name may have a complex structure consisting
of first-name and last-name subelements. A more general way of
describing equality is to use tree equality. The value of a node is specified by
giving (1) a set S of relative addresses of its subnodes, (2) a partial
function from S to names and (3) a partial function from S to
strings. Two nodes are value-equal if they agree on (1), (2) and (3).
With respect to the textual representation of an XML element, this definition
states that the order of attributes is unimportant in defining equality. Observe
that the order of subelements is specified and preserved by their indexes
(integers).
Notation. We shall use =v for value equality.
It should be
pointed out that neither equality of text nodes nor tree equality is entirely
satisfactory in the presence of types. XML-Schema does a thorough job of
defining base types, and one might want to use this to define a coarser form of
equality. For example, áid
type="int"ñ 0 á/idñ and
áid type="int"ñ -0 á/idñ should
probably be treated as value-equal. Also, there are types such as real numbers
for which equality is problematic. A complete specification of keys would have
to take account of these issues.
3 Path Expressions
A
path expression is an expression involving node names (tags and attribute names)
that describes a set of paths in the document tree.
The choice of what
language we use to define path expressions is important to the expressive power
of keys, and there are a number of choices. In XML-Schema, XPath [11]
expressions are used, while in semistructured data regular expressions [1] have been
used. Neither subsumes the other. In the following analysis we shall assume two
properties of path expressions:
- There should be a concatenation operation: P.Q is the result
of following first the path P and then the path Q.
- A path should move down the tree. That is if we start at a node
n1 and, by following a path described by
P, we reach a node n2 then
n2 is a subnode of n1 (the address n1 is
a prefix of the address n2.)
The
second property is not enjoyed by XPath. We shall discuss the choice of a
language of path expressions later, but in the meantime adopt for illustrative
purposes a simple language that is certainly a subset of both XPath and regular
expressions. Our language for path expressions has the following syntax:
- The empty path, "e".
- A node name (a tag or attribute name).
- A wild card "_", matching any single node name.
- An arbitrary path "_*".
- The concatenation of paths P.Q, where P and Q
are paths defined by these rules.
We have chosen an alternative syntax
to that of XPath because the concatenation operation, which is central to our
understanding of keys, does not have a uniform representation in XPath. However,
the translation to XPath is straightforward: Any path meant to start from the
root is prefixed with "/". In XPath, "/" itself denotes the root node. "." is
used as the empty path in place of "e", "*" in place of
"_" and "//" in place of "_*". Also, "/" is used as the path concatenator in
place of ".". In XPath, "/" is used as a separator between location steps.
Therefore, we have to disallow certain concatenations. If for example we
concatenate a/b with /c/d we get
a/b//c/d with an entirely different
meaning.
We shall use the notation n[[P]] to denote the set
of nodes (node addresses) reached by starting at node n and following a
path that conforms to (is in the language of) P. We shall sometimes use
[[P]] as an abbreviation for root[[P]]. The syntax is
borrowed from Wadler's [18]
description of semantics for patterns in XSL. Examples (from Figure 1):
á2#2ñ[[title]] |
= |
{ á2#2#1ñ } |
[[composer._]] |
= |
{ á1#1ñ, á1#2ñ, á1#3ñ, á1#4ñ,á2#1ñ, á2#2ñ
} |
á2#2ñ[[_*]] |
= |
{ á2#2ñ, á2#2#1ñ, á2#2#1#1ñ, á2#2#@numñ} |
[[composer.work]] |
= |
{ á1#3ñ, á1#4ñ, á2#2ñ
} |
[[_*.num]] |
= |
{ á1#3#@numñ,
á1#4#@numñ, á2#2#@numñ } |
In some cases,
it will be useful to restrict the path expression language so that paths are
merely sequences of labels and do not contain _ or _*. Such paths are called
simple paths. For example, composer.work is a simple path.
4 Definition of Keys
In
defining a key we specify two things: a set on which we are defining the key (in
relational databases this is a relation -- the set of tuples identified by a
relation name) and the "attributes" (relational terminology for a set of column
names) which together uniquely identify elements in the set. This is the
motivation for our central definition of a key specification, which is
a pair (Q,{P1, ... ,
Pn}) where Q is a path
expression and {P1, ... ,
Pn} is a set of simple path
expressions. The idea is that the path expression Q identifies a set of
nodes, which we refer to as the target set, on which the key constraint
is to hold. Let us refer to Q as the target path, and the set
{P1, ... , Pn} as the key paths. These correspond to the
absolute and relative location paths respectively in XPath terminology. Observe
that for any node n Î [[Q]] there is a
set of nodes n[[Pi]] found
by following Pi from n.
There is no restriction on the size of n[[Pi]]; in particular it may be empty. The key paths
constrain the target set as follows: Take any two nodes (n1,n2) Î [[Q]] and consider the pair of sets of nodes found
by following the key path Pi from
n1 and n2, (n1[[Pi]],
n2[[Pi]]). If there is a non-empty intersection with
respect to value equality for all such pairs of sets of nodes then the nodes
n1 and n2 are the same node. For example, consider the following key
definition:
(person.employees,
{name.firstname, name.lastname})
The target path person.employees identifies a set of nodes in the
document. This
is the target set. Each of these nodes will define a subtree with an employees label at the root. Within such a subtree
we will find zero or more key paths name.firstname and zero or more key paths name.lastname. Two nodes n1, n2 in the target
set are distinct if either they do not agree on any of the nodes reachable via
key path name.firstname or they do not agree
on any of the nodes reachable via name.lastname.
As another example, observe
that the document in Figure 1 satisfies the key
(composer, {name}):
There are two nodes at the end of the target path composer. For each node, there is one element in
the set of nodes found by following the key path name, "J.S.Bach" and "G.F.Handel". These elements
are not value-equal. Less intuitively, the document also satisfies the key
(composer, {born})
since the subelement <born> only appears in the first
composer and is absent from the second composer.
We are now ready to
give the formal definition of a key. For reasons which will emerge shortly, it
is useful to define a key with respect to a given node in the document rather
than assuming that the target path starts at the root.
Definition.
A node n satisfies a key specification
(Q,{P1,... , Pk}) iff for any n1, n2 in
n[[Q]], if for all i, 1 £ i
£ k, there exist z1 Î n1[[Pi]] and
z2 Î
n2[[Pi]] such that z1 =v
z2, then n1 = n2. That is,
|
" n1 , n2 Î
n[[Q]] (( |
|
$ z1 Î
n1[[Pi]] $ z2
Î n2[[Pi]] (z1 =v
z2)) ® n1 = n2). | | |
Note
that both forms of equality are used in the definition of a key. The first deals
with value-equality (=v) while the second
is node equality (=). Two nodes are node equal if they have the same node
address.
When we talk about document satisfying a key specification we
mean that the root of the document satisfies the key specification. The key has
no impact on those nodes at which some key path is missing, i.e. nodes
n such that n[[Pi]]
is empty for some Pi. Observe that
for any n1, n2 in [[Q]], if Pi is missing at either n1 or n2 then
n1[[Pi]] and n2[[Pi]] are by
definition disjoint. This is similar to unique constraints introduced
in XML-Schema. In contrast to unique constraints, however, our notion of key
specification is capable of comparing nodes at which a key path may lead to
multiple nodes. As an example, consider a key (A, {B}) expressed with
respect to the root of the following document:
<db>
<A> <B> 1 </B> </A>
<A> <B> 1 </B> <B> 2 </B> </A>
</db>
This key asserts that an A element is
uniquely identified by the values of its B
subelements. The document does not satisfy the key because the B subelement in the first A element and the first B subelement of the second A element have the same value. And with our
definition of keys, these two A elements are required to be the same element.
Here
are some further examples of keys, expressed with respect to the root of a
document.
(_*.person, {id}) |
Any person
element, if it has id subelements, is
uniquely identified by the values of the id's. In other words, any two person elements are disjoint on their id fields up to value-equality. |
(person, {e}) |
Any two person nodes immediately under the root have
different values (e is the empty path). |
(employees, {}) |
An empty key. This means that the path employees, if it exists, is unique at the
root. That is, there is at most one employees node immediately under the
root. |
(_*,{id}) |
Any element that has id subelements is uniquely identified by the
values of the id's. That is, any two
nodes are disjoint on their id fields
up to value-equality. Note that an id element does not have to
have an id itself. This key captures
the semantics of an ID attribute in the XML standard in that id
is unique within the entire document. |
As with keys
in relational databases, this definition of a key asserts that the values
associated with key paths uniquely identify a node in the target set. However
since one cannot require XML documents to be in some kind of first normal form,
there are important differences between the two definitions. First, the paths
that define keys need not exist 2 and do not have to be unique. In contrast, in
relational databases since key values cannot be null, the key must exist.
Moreover, first normal form requires attribute values to be atomic values, not
sets. Second, our key paths specify a set of addresses within a document, unlike
the relational case in which keys specify a value.
There
are, of course, other ways of defining keys, both more and less restrictive than
what we have described. Some justification of the choices is in order.
- We have used a set of key paths to define a key. In order to talk
about a set (as opposed to a tuple or list) of path expressions we need to be
able to talk about equality of path expressions. The equivalence of two path
expressions in our language of path expressions is decidable, as it is for the
more general class of regular expressions.
- Given that we have defined equality on trees, do we need to have more than
one key path in a key specification? We could always design our documents so
that all the key "attributes" are represented as subnodes of some node. The
problem here is that we would have to constrain the node to contain only these
subnodes for tree equality to have the desired effect. This seems to be too
restrictive and constitutes unnecessary interference between key
specifications and data models.
- The definition of key satisfaction differs significantly from the
relational case by allowing a (possibly empty) set of nodes at the end of each
key path. We shall examine a more restrictive definition in which key
satisfaction requires each of the key paths to exist and to be unique from any
node in n[[Q]] in Section 7.
- The language of path expressions may be regarded both as too weak and too
powerful. Consider the key (Q,{P1,..., Pk}):
For now, we have allowed Q to be an arbitrary path expression but have
restricted the Pi to be simple
paths. Would one ever want an arbitrary path (_*) in one of the
Pi? Also, it is not hard to come
up with examples in which one would like something more powerful to express
Q, e.g., (person.(mother|father)*,
{id}). This means a person element followed by zero or more father or mother elements. Our emphasis is that the
language of path expressions is provisional, and that allowing arbitrary path
expression for the Pi merely
complicates the definition of key but does not change much in the way of the
theory.
5 Key Inference
In relational
databases one can infer some keys from the presence of others. Indeed, if a set
S of attributes is a key for a relation R, then any superset of
S is also a key for R. This obvious fact is of great importance in
query optimization. Keys are typically used as physical indexes, and this simple
inference rule tells us when we have enough information to use such an index.
For XML keys as we have presented them so far, the inference rules are far from
obvious. These rules are fully discussed in a companion paper [
8]. Here are some examples.
Fact. If
(Q, S) is a key and S Í
S', then so is (Q, S').
This is the counterpart of
the relational inference rule. Below are two examples that have no such
counterpart.
Fact. If (Q.Q',{P}) is a key
then so is (Q,{Q'.P}).
This is sound because in a
document with a tree-like structure, sharing of nodes is not allowed. As a
result, if a node is identified in a tree then its ancestors are also
determined. In other words, if a key path P uniquely identifies a node
n in [[Q.Q']] then Q'.P is a key path for the
ancestor of n in [[Q]].
Fact. If (Q,
S) is a key and Q' is contained in Q (i.e., the path
language defined by Q' is included in the one defined by Q), then
(Q', S) is also a key.
This fact is sound because any key of
the set [[Q]] is also a key for any subset of [[Q]]. Observe that
[[Q']] is a subset of [[Q]] if Q' is contained in
Q.
The last fact requires one to reason about the inclusion of
path expressions.
Key inference is closely related to the question of
key implication: suppose it is known that an XML document satisfies certain
keys, does it follow that the document must necessarily satisfy some other key?
We have developed algorithms for reasoning about the inclusion of certain
classes of path expressions as well as for determining implication of XML keys.
A detailed discussion of these algorithms as well as finite axiomatization and
complexity results in connection with our key languages can be found in
[8].
Another
natural question to ask is whether key constraints are finitely satisfiable. In
relational databases, all keys are finitely satisfiable: given any schema
S and any finite set S of keys, one can always
construct a finite database instance of S that satisfies S. The same holds for XML documents under our definition of a
key.
Fact. For any finite set S of keys,
there exists an (finite) XML document satisfying S.
This last fact only holds because key paths may be
missing. Recall the (_*, id) example: if key paths were required to exist
at all nodes specified by the target path the XML document would have to be
infinite to satisfy the key (see strong keys in section 7.)
Also, we
note that the last fact only holds in the absence of DTDs. To illustrate this,
let us consider a simple key
j = (X, { })
and a
simple DTD D: <!ELEMENT foo (X, X)>
Obviously, there exists a finite XML document that conforms to the DTD
D (see, e.g., Fig. 2 (a)), and there is a
finite XML document that satisfies the key j (e.g.,
Fig. 2 (b)).
However, there is no XML document that both conforms to D and satisfies
j. This is because D requires an XML tree to
have two distinct X elements, whereas j requires
that there is at most one X node immediately under the root. This shows
that DTDs interact with XML key constraints. It should be mentioned that keys
defined in other proposals for XML, such as those introduced in XML Schema
[17],
also interact with DTDs or other type systems for XML. For a study of the
interaction between constraints such as keys and DTDs see [12].
Figure 2: An XML tree conforming to D, and an XML
tree satisfying j
6 Relative Keys
The need for relative keys is partly motivated by scientific data formats.
Many scientific databases do not use conventional database technology, and even
those that do transmit their data in one of a variety of data formats. Some of
these data formats are general purpose (such as ASN.1, used in GenBank [6],
and ACeDB [16])
while others are domain specific (such as EMBL [4]).
These data formats have easy translations to XML. XML itself is also emerging as
a standard for data exchange, especially with micro-array data (see for example
the DTDs GEML [20] and
MAML [21]).
All of these specifications have a hierarchical structure, and typically at the
top level consist of a large set of entries (the order of which is usually
unimportant). Molecular biology databases contain particularly rich structures
of metadata. In the protein sequence database Swissprot [5] there is an
accession number (a key) for each entry. Within each entry there is a sequence
of citations, each of which is identified by a number 1,2,3... within the entry.
Thus to identify a citation fully, we need to provide both an accession number
for the entry and the number of the citation within the entry.
Another
intriguing example is to be found in linguistic databases.3 In this case the data sets (typically recordings of
speech) are held in files, but the metadata is provided in part by the directory
structure [19]: /timit/train/dr1/fcjf0/sa1.wav
(TIMIT corpus, training set, dialect region 1, female speaker, speaker-ID
"cjf0", sentence text "sa1", speech waveform file.) It would be quite reasonable
to represent such metadata in XML, but it is immediately obvious that it
requires a non-trivial hierarchical key structure.
In
relational database design we also find the notion of a hierarchical key
structure in weak entities. The key of a weak entity consists of the
parent key and some additional identification of the dependent entity [10]
(e.g. course Math120, section B).
To describe hierarchical key
structures we introduce the notion of a relative key, which consists of
a pair (Q, K) where Q is a path expression and K is
a key.
Definition.
A document satisfies a relative key specification (Q, (Q',S)) iff for all nodes
n in [[Q]], n satisfies the key (Q',S).
In other words (Q, K) is a relative key if K is a
key for every "sub-document" rooted at a node in [[Q]]. Examples:
(bible.book.chapter, (verse, {number})) |
A verse number uniquely identifies a verse
within a chapter. |
(bible.book, (chapter, {number})) |
Chapter numbers uniquely identify a chapter
within a book. |
(bible,
(book, {name})) |
If there is only one bible node immediately under the root, this is
the same as specifying a key (bible.book,{name}). |
Observe that in a
relative key (Q, (Q', S)), Q starts from the root
whereas Q' starts at a node in [[Q]]. It is for this reason that
we defined key satisfaction at arbitrary nodes.
Transitivity of
relative keys. The purpose of keys is to uniquely specify certain components
of a document. Obviously, a relative key such as (bible.book.chapter, (verse, {number})) alone does not uniquely identify a particular
verse in the bible. However we believe that if we give a book name, a chapter
number, and a verse number, we have specified a verse. It is this intuition that
we need to formalize.
First
observe that the relative key (e, (Q',S))
is equivalent to the key (Q',S). Thus keys defined in
section 4. are a
special case of relative keys. To distinguish these two notions we refer to the
former as absolute keys or simply keys. Now consider two
relative keys. We say that (Q1,
(Q1',S1)) immediately precedes (Q2, (Q2',S2)) if
Q2= Q1.Q1'. Also, any
absolute key immediately precedes itself. Define the precedes relation
as the transitive closure of the immediately precedes relation.
Definition. A set S of relative keys is
transitive if for any relative key (Q1, (Q1',S1))Î S there is a key (e, (Q2',S2))Î S which precedes
(Q1, (Q1',S1)).
As an
example, this set of keys is transitive:
(e, (bible.book, {name})) |
(bible.book, (chapter, {number})) |
This set
is not:
(e, (bible.book, {name})) |
(bible.book.chapter, (verse, {number})) |
Any
transitive set of relative keys must contain some absolute key.
Updatable relative keys. Consider the following (transitive) key
specification:
|
(e, (university, {name})) |
(university,
(dept.employee, {emp-id})) | |
To
identify an employee node in this
database, we need only to specify a university name and an emp-id within that university. However, to add
a new employee to the database, we clearly need to specify a department for the
employee. However, although this key specification is transitive, there is no
way to identify a department and hence there could be many ways to add an
employee. This motivates our final definition of updatability as shown
below: With updatability, one can always insert an element in the "keyed" part
of the document unambiguously by specifying where to insert the element using
keys.
Definition. A set S of relative
keys is updatable if it is transitive and whenever (Q1, (Q2.n,S1))Î S there is a relative key
(Q1, (Q2,S2))Î S where |Q2|>0. Here n is a node
name.
Informally, this definition gives us the property that every
element with a prefix along the path Q1.Q2 can be identified
through some keys. Therefore, it is easy to see that the addition of the
following key will make the previous example updatable. In particular, to insert
an employee, we now can specify which department they are in (in addition to the
university).
|
(university,
(dept, {dept-name})) | |
Even
though we can now add new employees, there is still something anomalous:
Although employees are nested under departments, nothing about the department is
necessary to identify them. This is reminiscent of the anomalies that occur in
non-second normal form of relational databases. There is something wrong with
the design of this document in that employees should not be children of
department nodes, but only of university nodes. The linkage between employees
and departments should be expressed through a foreign key. Formalizing the
concept of a well-designed document with respect to its key specification is
beyond the scope of this paper.
6.1 A notation for relative keys
If a system
of relative keys is transitive, it forms a hierarchical structure. We can
therefore create a compressed syntax for such systems. The basic syntactic form
is
Q1{P |
|
,...,P |
|
}.Q2{P |
|
,...,P |
|
}. ... .Qn{P |
|
,...,P |
|
} |
This describes a system of relative
keys: a relative key (Q1. ...
.Qi-1, (Qi,{P1i,...,Pkii})) is
defined for each i, 1£ i£ n. It should be noted that the first of these is of
the form (e, (Q1,{P11,...,Pk11})) and is a
key.
For example
bible{}.book{name}.chapter{number}.verse{number}
specifies the updatable system of keys:
(e, (bible,{})) |
(bible, (book, {name})) |
(bible.book, (chapter, {number})) |
(bible.book.chapter, (verse, {number})) |
So far the
key hierarchies we have specified are linear. Consider the following two
specifications:
company{name}.employee{id} |
company{name}.department{name}. |
It is helpful
to fold these into a single specification:
company{name}[.employee{id}, .department{name}]
This is simply a syntactic shorthand:
R[R1,...,Rn] for RR1,
..., RRn. As a further example,
consider
university{name}.school[{name}, .department[{name}, .student{id}]]
This is another example of a transitive
set of relative keys. It
is worthwhile to remark again that for identifying student nodes, one does not
need to be aware of which school the student belongs to. However, to insert a
new student into the document, one needs specify under which school (in addition
to which university) to insert the student element so as to avoid ambiguity.
Specifications such as these are reasonably compact and understandable.
Their importance is not only to ensure the internal consistency of a document,
but also to tell others how to cite a component of our document. This is
especially important if the document is subject to change. Even though we have
constructed a minimal system for describing hierarchical key structures, it
turns out that this takes us some way towards describing a data model. Contrast
relational database specification student(snum, name, major) and enroll(snum,cnum,grade) with a key
specification
[student{snum}[.name{}, .major{}], enroll{snum,cnum}.grade{}]
They describe closely related
structures. The specification [.name{}, .major{}] ensures that under a student node
there is at most one name and at most one
major node. However the key specification
allows other unspecified nodes to occur under a student node and, of course, it does not require any
kind of first normal form. Nevertheless, we can specify that our documents have
a structured "core" somewhat akin to the complex object or nested relational
structures that have been studied in databases [2]. Not
surprisingly there is close interaction between key constraints and data models
which requires much further study.
7 Discussion
Our
main reason for writing this document was to clarify the notion of a relative
key and to understand the hierarchical key structure that appears to occur
naturally in a variety of data formats. What we have described here is a
proposal for a key definition, and there are a number of variations on this
definition which should be considered. This section contains a brief review of
those alternatives, starting with the proposals in XML-Schema.
7.1 XML-Schema
XML-Schema includes a syntax for specifying keys which is
related to our definition, but there are some substantive differences, even if
we ignore the issue of relative keys. Possibly the most important of these is
that the language for path expressions is XPath. As
mentioned before, XPath is a language used for accessing parts of XML documents.
XPath
supports a variety of axes that allows one not only to move down an XML document
tree from a node, but also to move to its ancestors and siblings. Moreover, one
can embed predicates or even functions in XPath. Moreover, one can embed
predicates or even functions in XPath. For example
/A/B[last()]/C/D/E/ancestor::* selects all ancestor nodes along the
path A.B.C.D.E starting from the root. Observe that a predicate
(qualifier) is specified in the expression: B must be the last
B child of A. With such complex functionality, questions about
the equivalence or inclusion of XPath expressions remains open. As demonstrated
by examples in Section [5],
these issues are important if we want to reason about keys as we do -- for quite
practical purposes -- in relational databases. Here is a brief summary of the
other salient differences between our definitions and the XML-Schema
proposal.
- Equality.
- We have used a more general form of equality than that in XML-Schema.
However, as pointed out in Section 2 a full treatment of
equality might involve types or even some form of user-defined equality.
- Definition of the target set.
- In XML-schema the path expression that defines the target set is taken to
start at arbitrary nodes. Recall
that in a key (Q, (Q', S)) of our notation, the target path
Q always starts from the root (also recall that an absolute key (Q',
S) is equivalent to (e, (Q', S))). But it
is straightforward to let Q start from arbitrary node: one needs simply
to substitute _*.Q for Q in our notation. More specifically, we
write (_*.Q, (Q', S)) (observe that _*.Q starts from the
root). It is, of course, possible to "root" a path expression XML-Schema.
- Definition of key paths.
- XML-Schema talks about a list (not a set) of key paths. While
this avoids issues of equivalence of XPath expressions, one can construct keys
that are, presumably, equivalent, but have different or anomalous
presentations. For example (temporarily using [...] for lists): (person,[firstname, lastname]), (person,[lastname, firstname]), (person,[lastname,lastname, firstname])
impose the same constraint. Since the issue of equivalence of XPath
expressions is unresolved, there is no general method of checking whether two
such specifications are equivalent.
- Relative keys.
- While there is no direct notion of a relative key in XML-Schema, in
certain circumstances one can achieve a related effect. Consider for example:
book{name}.chapter{name}.verse{number}
In XML-Schema one can specify a key
for verse (this is not XML-Schema syntax)
as
(book.chapter.verse, [number, up.name, up.up.name])
Here
"up" is the XPath instruction to move up one
node. Thus part of the key is outside of the value of a verse node. One of the inferences one could make for
such a specification is that (book.chapter, [name,up.name]) is a key
provided the nodes in the target set all contain at least one
verse child node. Again, it is not clear how to reason generally
about such specifications.
7.2 Some stronger definitions of keys
The
definition of keys we have adopted in this paper is quite weak, which we believe
is in keeping with the semi-structured nature of XML. However, intuitively it
does not mirror the requirements imposed by a key in relational databases, i.e.
the uniqueness of a key and equality of key values. We now
explore a definition which captures both these requirements.
Strong
Keys. In a strong key definition, we require that the keys paths exist and
are unique, i.e. n[[Pi]] contains exactly one node for 1 £ i £ n. The key paths
constrain the target set as follows: Take any two nodes (n1,n2)Î [[Q]] and consider the pairs of nodes found by
following a key path Pi from
n1 and n2. If all such pairs of nodes are value-equal, then the
nodes n1 and n2 are the same node.
As an example of what it means
for a path expression to be unique, consider Figure 1: name is
unique at á1ñ, but work and num are not unique at
this node.
The definition of satisfaction for strong keys now becomes the
following.
Definition. A node n satisfies a key
specification (Q,{P1,... ,
Pk}) if
- For all n' in n[[Q]] and for all Pi (1£ i £ k), Pi is unique at n'.
- For any n1, n2 in n[[Q]], if n1[[Pi]]
=vn2[[Pi]]
(1£ i £ k)
then n1 = n2.
To distinguish the two definitions of keys
let us refer to keys defined above as strong keys and the keys defined
in Section 4 as weak keys, Given this strong notion of keys, let us
re-examine some examples given before.
(_*.person, {id}) |
Any two person elements, no matter where they occur, have
unique id subelements and differ on those elements. |
(person,
{e}) |
The interpretation of this key remains unchanged
under a strong key semantics. |
(employees, {}) |
Again, the semantics of this key is the same
with respect to the strong and weak key specifications. |
(_*,{k}) |
This requires that every element has a key
k, including any element whose name is
k. |
The last example illustrates that under a
strong key semantics, finite satisfiability (finite model property) does not
hold for all keys: The key (_*, {k})
imposes an infinite chain of k nodes and therefore, there is no finite
document satisfying it. The problem arises because we require that key paths
must exist. It should be mentioned that the corresponding key in XML-Schema,
(//*, [id]), is not meaningful either,
because an id node cannot have a base type if
it is to have an id subelement
itself.
Due to the existence requirement on key paths in the definition
of strong keys, a strong key imposes certain structural (typing) constraints
which are typically found in schema specifications in a traditional database
system. For example, the following document does not satisfy the strong key
(A, {B}) since the
key requires that B elements must exist under every A element
(and be unique). In other words, it does not allow keys paths to have a "null"
value. In contrast, the same document satisfies the weak key (A, {B}) as a weak key permits
"null" value. Observe, however, the weak key clearly does not allow one to
distinguish between these A elements.
<ROOT>
<A> 1 </A>
<A> 2 </A>
</ROOT>
It should be mentioned that the distinction between (traditional)
structural constraints (types) and (traditional) integrity constraints is not
always well-defined. It is dictated largely by what conventional programming
languages treat as types. See [9]
for detailed discussion on this topic.
The concept of relative keys can
be naturally adapted for strong keys as well. We say a document satisfies a
strong relative key specification (Q,(Q',S)) iff for all
nodes n in [[Q]], n satisfies the strong key
(Q',S).
The strong notion and weak notion of keys impose
different restrictions on key paths. At one end of the spectrum, all key paths
must exist and be unique (strong keys). At the other end, no structural
constraints are imposed on key paths (weak keys). There are also possibilities
in between; for example, adopting a slightly stronger notion of weak keys which
substitutes equality for value intersection of the node sets
reachable by a simple key path.
7.3 Choice of a path expression language
We
have used a language for path expressions that contains just enough to
illustrate most of the issues that occur in connection with keys for XML. In
order to reason about keys, it is essential that equivalence and inclusion of
path expressions are decidable. This is the case for the more expressive
language of regular expressions, and we could equally well have used this
language; none of the results would be affected. However the examples we found
that used the added expressive power were somewhat contrived, and it is not
clear whether this larger language is of practical use.
An interesting
issue is whether, in defining a key (Q,{P1, ..., Pn}),
the language used to describe the target path Q needs to be the same as
the language used to define the key paths P1, ..., Pn. One
could choose a simpler language for key paths that is a sublanguage of
the language for target paths. In fact, we only require that the composition
Q.Pi of a target path and a
key path should be in the language of target paths.
To simplify the
discussion, so far we have required key paths to be simple paths. However, we
could see no other benefit to simplifying the language of key paths. Below we
extend the current proposal by allowing key paths to include _ and _*, i.e., to
be expressed in the same language that defines target paths. To do so, we first
define a notion of value intersection. Observe that the regular
language defined by a path expression is a set of simple paths. Let us use r to range over simple paths. Given a path expression
P, we use r Î P
to denote the simple path r in the language defined by
P.
Value intersection. Let n1 and n2 be two nodes
in an XML tree T and P be a path expression in the language
defined in Section 3. The value
intersection of n1[[P]] and
n2[[P]], denoted by
n1[[P]] Çv n2[[P]], is defined as follows:
n1[[P]] Çv n2[[P]] = { (z, z') | $ r Î
P, zÎn1[[r]], z'În2[[r]], z=v
z'}
Intuitively, n1[[P]] Çv n2[[P]] consists of pairs of nodes that are value
equal and are reachable by following the same simple path in the language
defined by P starting from n1 and
n2, respectively.
Using this
notation, we extend our key specification as follows.
Key
specification. A key is a pair (Q,{P1, ..., Pn}),
where Q and Pi's are path
expressions in the language defined in Section 3. A node n
satisfies the key iff for any n1,
n2 in n[[Q]], if for all
i, 1 £ i £
k, the value intersection of n1[[Pi]] and
n2[[Pi]] is not empty, then n1 = n2. That is,
" n1 n2
Î n[[Q]] (( |
|
(n1[[Pi]] Çv n2[[Pi]]
¹ Ø) ®
n1 = n2). |
It
should be mentioned that the complexity results of [12] were
developed for this general definition of keys.
7.4 Node names as key values
The choice of an
appropriate definition for keys for XML will ultimately be determined by
practice. The aim of setting out a key specification is to cover the practical
cases without using definitions that are too complex to allow any kind of
reasoning about keys. Have the proposals in this paper covered the practical
cases?
There is one issue that may arise in "unconstrained" XML. Consider
the database
<db>
<parts>
<widget>
<id> 123 </id> <weight> 1.5 </weight>
</widget>
<widget>
<id> 234 </id> <weight> 2.5 </weight>
</widget>
<gadget>
<id> 123 </id> <weight> 3.2 </weight>
</gadget>
</parts>
</db>
The type of a part -- widget or
gadget -- is expressed in the tag. In
alternative XML representations it might be expressed as an attribute or
subelement of a part element. The key for a
part is to be taken as its type together with its id. With our current machinery, the key constraint can
be expressed as parts{}[.widget{id}, .gadget{id}]. However, if we introduce a new part
type, a thingy, the key specification will
have to be changed to include a key path involving thingy. No change would be needed in the alternative
representations. The problem arises because we are interchanging structure (the
names) with data (their values); but the ability to do this is supposed to be
one of the strong points of semistructured data and XML.
Our definition
of a key (weak or strong) can be extended to express this by adding a "virtual"
subelement, node-name to each named node,
whose value consists of the node name. With this extension, the key for our
example can be expressed as parts{}._{node-name, id}.
This
does not alter any of the properties we expect to hold for keys and appears to
account for any practical use of tag names in keys.
Acknowledgements. We are grateful to Byron Choi,
Hartmut Liefke, Arnaud Sahuguet, Keishi Tajima, Chris Brew and Henry Thompson
for a number of useful comments and discussions.
References
- [1]
- Serge Abiteboul, Peter Buneman, and Dan Suciu. Data on the Web. From
Relations to Semistructured Data and XML. Morgan Kaufman, 2000.
- [2]
- Serge Abiteboul, Richard Hull, and Victor Vianu. Foundations of
Databases. Addison-Wesley, 1995.
- [3]
- Vidur Apparao, Steve Byrne, Mike Champion, Scott Isaacs, Ian Jacobs,
Arnaud Le Hors, Gavin Nicol, Jonathan Robie, Robert Sutor, Chris Wilson, and
Lauren Wood. Document Object Model (DOM) Level 1 Specification. W3C
Recommendation, October 1998. http://www.w3.org/TR/REC-DOM-Level-1
- [4]
- Baker, Wendy and van den Broek, Alexandra and Camon, Evelyn and Hingamp,
Pascal and Sterk, Peter and Stoesser, Guenter and Tuli, Mary Ann. The EMBL
Nucleotide Sequence Database In Nucleic Acids Research, 28(1):19-23,
2000.
- [5]
- A. Bairoch and R. Apweiler. The SWISS-PROT protein sequence database
and its supplement TrEMBL. In Nucleic Acids Research, 28:45-48, 2000.
- [6]
- D. Benson and I. Karsch-Mizrachi and D. Lipman and J. Ostell and B.A. Rapp
and D. Wheeler. GenBank. In Nucleic Acids Research, 28(1):15-18,
2000.
- [7]
- Tim Bray, Jean Paoli, and C. M. Sperberg-McQueen. Extensible
Markup Language (XML) 1.0. World Wide Web Consortium (W3C), Feb 1998. http://www.w3.org/TR/REC-xml
- [8]
- Peter Buneman, Susan Davidson, Wenfei Fan, Carmem Hara, and Wang-Chiew
Tan. Reasoning about keys for XML. University of Pennsylvania.
Technical Report MS-CIS-00-26, 2000. http://db.cis.upenn.edu/DL/absolute-full.ps
- [9]
- Peter Buneman and Wenfei Fan and Jerome Simeon and Scott Weinstein.
Constraints for Semistructured Data and XML. SIGMOD Record 30(1),
March 2001.
- [10]
- Byron Choi and Arnaud Sahuguet. DTD Inquisitor Demonstration. http://xml.cis.upenn.edu/DTDi/
- [11]
- James Clark and Steve DeRose. XML Path Language (XPath). W3C
Working Draft, November 1999. http://www.w3.org/TR/xpath
- [12]
- Wenfei Fan and Leonid Libkin. On XML Integrity Constraints in the
Presence of DTDs. In PODS 2001.
- [13]
- Andrew Layman, Edward Jung, Eve Maler, and Henry S. Thompson.
XML-Data. W3C Note, January 1998. http://www.w3.org/TR/1998/NOTE-XML-data
- [14]
- Raghu Ramakrishnan and Johannes Gehrke. Database Management
Systems. McGraw-Hill Higher Education, 2000.
- [15]
- Arnaud Sahuguet. Everything You Ever Wanted to Know About DTDs, But
Were Afraid to Ask. In WebDB, 2000.
- [16]
- J. Sulston and Z. Du and K. Thomas and R. Wilson and L. Hillier and R.
Staden and N. Halloran and P. Green and J. Thierry-Mieg and L. Qiu. The C.
elegans genome sequencing project: {A} beginning. In Nature,
356(6364):37-41, 1992.
- [17]
- Henry S. Thompson, David Beech, Murray Maloney, and Noah Mendelsohn.
XML Schema Part 1: Structures. W3C Working Draft, April 2000.
http://www.w3.org/TR/xmlschema-1
- [18]
- Philip Wadler. A Formal Semantics for Patterns in XSL. Technical
report, Computing Sciences Research Center, Bell Labs, Lucent Technologies,
2000. http://www.cs.bell-labs.com/who/wadler/topics/xml.html
- [19]
- TIMIT. CDROM TIMIT Directory and File Structure. http://www.ldc.upenn.edu/readme_files/timit.readme.html
- [20]
- GEML. http://www.geml.org/
- [21]
- MAML. http://www.oasis-open.org/cover/maml.html
- *
- University of Pennsylvania. Email: {peter,susan,wctan}@saul.cis.upenn.edu. Supported in part
by Digital Libraries 2 grant DL-2 IIS 98-17444 and NSF DBI99-75206.
- #
- Temple University. Email: fan@joda.cis.temple.edu. Supported by RIF fund.
- %
- Universidade Federal do Parana, Brazil. Email: carmem@inf.ufpr.br
- 1
- We used the "DTD Inquisitor" of Byron Choi and Arnaud Sahuguet [15,
10].
- 2
- This might be taken as allowing null-valued keys, but whether we
should equate missing key paths with null values is arguable and depends on
the semantics of the languages we use to query XML documents.
- 3
- We are grateful to Mark Liberman and Steven Bird of the Linguistic Data
Consortium at the University of Pennsylvania for providing us with this
example.
This document was translated from LATEX
by HEVEA.