In Jennifer Widom's note, she does an excellent job laying out the background for the XML revolution and some of the research issues that XML raises for the database community. She spells out several important research directions that we must address in order to enable efficient manipulation of XML data. The issues raised include storage and indexing, efficient processing, data model and query language definitions and dealing with issues such as integrity constraints, views and triggers in the context of XML. I especially liked her mention of the area of Databases and Information Retrieval that is likely to get much more attention in the XML world, and wholly support her call for a common benchmark for XML and significant performance evaluation.
As she notes, many of these issues are already being studied in the community. In particular, some recent work on storage and indexing of XML data has significantly raised the bar on work in this area. The definition of a data model and a query language(s) for XML is being pursued by several influential committees. Nonetheless, the points raised in her document can keep us busy for a while to come.
In this short response I make two points based on what I found missing in Jennifer's note. The first point concerns what I believe will be one of the main applications enabled by XML, namely data integration. XML is providing a platform for organizations to share data in a common syntax (not semantics yet!). In a sense, XML is making the large scale data integration that was envisioned in systems such as TSIMMIS and Information Manifold an imminent reality. However, as I discuss, the context of XML raises several challenges for database research.
Still, this is not satisfying (surely, nobody is surprised to hear that I'm raising new issues in data integration). The second point I raise is whether there is any fundamentally new issue that is raised by the context of XML. My answer is no and yes. I agree with the conclusion that emerges from Jennifer's note that while we have to reconsider many of the issues we face in database systems, there is no fundamentally new issue. However, I do think there is a new way we should be thinking about these problems. Specifically, I argue that we need to reconsider the issues with different measures of performance and complexity (and I do not mean this solely for theoretical analysis).
In database research we are accustomed to measuring our performance in terms of scaleup in the size of the data (a.k.a. as data complexity), and in the size of the queries (query complexity). While the first measure is clearly the hallmark of our field, the second measure is key for tasks such as query optimization. I argue that we should validate our work against two new measures:
Like Jennifer, I will resist to insert links to related work throughout this document (largely because it's a pain in HTML).
Given this state of affairs, the prospect of integrating data from a large number of sources on the WWW which was envisioned in some of our research prototypes is becoming a possible reality. However, there are several issues that need to be considered to enable such integration: I mention just the main ones here, and again, some of these are already being addressed by current research.
Languages for source descriptions: A key element of a data integration system is a language for describing the contents and capabilities of data sources. These descriptions provide the semantic mapping between the data in the source and the relations in the mediated schema. We need to develop such languages for describing XML sources. The main novel challenges that arise are (1) the kinds of data restructuring that appear occur in XML are richer than in relational data, (2) we need to scale up to a very large number of sources, and (3) exploit the knowledge conveyed by accompanying DTD's.
Query reformulation algorithms: We need to develop algorithms for efficiently reformulating user queries (posed on a mediated schema) to queries that refer to the data sources. The known techniques for reformulation from the relational case do not extend easily to languages for querying XML.
Translation among DTD's There will be a significant need for tools that translate XML data conforming to one DTD into an XML document conforming to a different DTD (presumably with semantically related content).
Obtaining source descriptions: When the number of data sources grows, we can no longer manually write their content descriptions. We must develop methods for automatically (or at least, semi-automatically) computing source descriptions for newly introduced XML data sources.
My answer is that for the most part, we will work on the same problems (and, as a community that is comfortable with SQL, we'll grow to love the XML syntax). However, as we do that, I argue that we need to consider two new measures of complexity of the techniques we develop (in addition to the standard data and query complexities). These two measures of complexity should drive both our experimental validation and our theoretical analyses.
By far, the most prevalent type of irregularity is that objects have differing subsets of attributes from a possibly large set of attributes. A different kind of irregularity (which is harder to quantify) is that the attributes are structured differently in a tree rooted by the object. This is one of the reasons that languages for querying XML (or semi-structured data) allow regular path expressions.
The degree of irregularity in the data has major implications on how it is most efficiently stored and indexed, how it is described (e.g., by DataGuides) and subsequently on query processing.
We are current addressing the following issues:
Query execution for XML: We are adapting our query execution engine to handle XML data. In addition to designing a suitable algebra for manipulating XML, we are developing efficient algorithms for implementing the operators of the algebra.
Query reformulation: Building on our previous work in query reformulation, we are developing reformulation algorithms that scale up to a large number of data sources, and testing these algorithms experimentally.
Learning source descriptions: We are developing algorithms for automatically learning source descriptions (i.e., mappings between schemas of data sources and a mediated schema). Our approach is based on adapting algorithms recently developed in Machine Learning.
Intensional queries: We are developing algorithms for answering intensional queries to data integration systems (e.g., containment, equivalence and satisfiability queries).
Expressive power: We are studying the relative expressive power of a family of languages for querying semi-structured data (focusing on the effect of restructuring primitives on the expressive power).
Web-site management: We have developed the Tiramisu web-site management system. Like the Strudel system, Tiramisu takes a declarative approach to building web-sites. In contrast to Strudel, Tiramisu allows the implementation of a web-site to be done by an ensemble of cooperating web-site building tools, thus leveraging off their specific advantages.