Introduction

XML query processing is a fascinating field, using ideas from database theory, graph theory, and compiler theory. A few of the outstanding practical problems are:

This list is by no means exhaustive. Notably missing are topics such as full-text search and data mining, which will be included in future versions of this document as appropriate.

None of these are problems are new to XML, although XML does introduce constraints with sometimes far-reaching impact (especially the type system, node identity, and namespace scope). This document addresses each of the topics above to some degree of detail, with references for additional information, and highlights special aspects of XML when appropriate.

[On a personal note, I first became interested in this problem space while working on virtual collaborative environments, which typically have a large object database and at least two hierarchical views over it: the usual object inheritance hierarchy, and an additional "location" hierarchy.]

XML Abstractions and Realities

Most people who work on XML query processing eventually end up thinking about tree or graph models in the abstract, and then apply those results to XML data models as a special case of the more generic graph models. That's generally a fine approach, but it's important to understand which aspects of XML are significant to these problems, and which aspects can be ignored (easily addressed later, if desired).

Many research papers begin by simplifying the XML data model to element and text nodes only. This oversimplification can lead to many problems down the road. Most of the time, it's entirely appropriate to ignore DTD information, entities, CDATA sections, comments, and processing instructions; however, attributes and namespaces must be a part of any serious XML query processing research.

Attribute nodes are unordered and have important constraints (such as name uniqueness per element). As any implementor will tell you, namespace nodes cause no end of trouble -- in most XML query standards to date, namespaces are a strange blend of syntax (necessary to XML serialization) and semantics (part of the data model). Each namespace is scoped to an element, but appears later, among its attributes. It doesn't help that some XML standards and implementations treat namespace prefixes as significant or that there is no standard syntax for the triple of values that make up each XML name during parsing (the prefix, local, and namespace parts).

Some aspects of XML vary from one query language to the next. Some query data models allow consecutive text nodes, others merge them; some preserve syntactic information (such as character entities and CDATA sections), others do not.

On the flip side, XML imposes some constraints that can be inconvenient, such as requiring a single root element node. For most query languages, the impact is that not every query result can be (directly) serialized as XML.

Some published papers use incorrect testing methodologies, such as attempting to test XML parsing within a host application like a web brower (when the underlying XML parser is available) or using parsers (especially Java ones) in suboptimal ways. Make sure you test XML parsers directly, and with appropriate properties set. Some implementations have default settings that are non-conformant or designed for naive use (for example, non-caching), and consequently require a little work to achieve numbers characteristic of real-world use.

Query and View Definition Language Design

Understanding the design decisions made in existing XML query languages will help greatly in the design of new languages.

XML currently has three well-known and widely implemented query languages: XPath, XSLT, and XQuery. The simplest of the three, XPath provides a path-like syntax for referring to nodes in an existing document. XSLT is an XML-based syntax with an inherently recursive processing model ("template matching"), can copy nodes from multiple input documents and/or construct new XML nodes from scratch. XQuery is a mostly SQL-inspired syntax with a (first-order) functional processing model and functionality similar to that of XSLT.

XML has a few update and data management languages, of which SiXDML and XUpdate are probably the most widely-known (neither is standardized). Except for the occasional isolated research project, XML currently has no widely used full-text query language, no "query by example" language, and no equivalent of regular expressions (or graph grammars). The XQuery committee has considered both full-text and data manipulation features, but neither appears in XQuery 1.0.

XML query language designers face several difficult tasks and choices, in addition to the usual challenges facing programming language designers. You must first choose a data model and type system, choices that permeate the rest of the language.

The data model you choose will almost certainly be more expressive than XML itself (for example, if you allow integers, then an integer constant is not by itself valid XML; if you allow sequences of nodes, then a sequence of two documents or two attributes, is not by itself valid XML). Consequently, you must also decide how to handle queries that evaluate to values that are not XML.

If you choose XML Schema as the basis for your type system, you must understand that it already defines conversions from the textual representation to all simple types and from all simple types to a canonical textual representation -- which may or may not match your idea of string conversions -- but XML Schema does not define conversions between types. XML Schema also has concepts such as derivation by list, derivation by extension, substitution groups, nil, and local types that will pose difficulties for you. If you don't choose XML Schema as the basis for your type system, then you must decide how to handle XML that has a schema associated with it or contains meta-data attributes (such as xsi:nil="true"), and users may expect some level of interoperability with existing schema-based applications and formats.

Either way (assuming you allow manipulation of arbitrary XML), you must decide how to handle DTDs, entity references, and inline schemas.

If your language allows users to construct new XML nodes, you must decide what syntax you'll use to construct XML, and ensure that syntax handles all the nuances of XML that are important to your users (which may include whitespace, character references, CDATA sections, etc.).

Namespace scope and namespace nodes in the data model pose problems no matter how you elect to handle them. Personally, I think the best approach is to work with expanded names (local-name and namespace-uri parts), to ignore namespace nodes in the data model, and treat namespace declarations as serialization information only (required by the XML serialization format to asscoiate prefixes with namespace uris). However, sometimes users request fine-grained control over how namespaces are serialized.

If you want to allow sorting using arbitrary expressions as sort keys, your semantics probably requires second-order functions. If you want to allow simultaneous joining and sorting, then your semantics probably requires matrices ("tuples" in the relational sense). If you want to allow sequences of nodes, then you have to decide whether to allow the same node to appear more than once in a sequence ("duplicates"). If you decide not to allow duplicates, you have to decide how and when they're removed from expressions that may result in them.

Algebra (Query Representation and Semantics)

You realize pretty quickly that you need a query algebra before you can get anywhere in query processing research. Some examples that motivate this need are as follows:

Many people start by examining XPath, thinking that because its syntax is simple, its semantics must be simple too. There are plenty of examples that reveal the lie in this assumption, but first consider a simple kind of optimization that cannot be expressed in XPath 1.0:

Consider the query a | b. This can be optimized to *[name() = 'a' or name() = 'b'] so that the node kind is tested only once. Now consider optimizing the query a | @a. You might like to rewrite this into something like node()[name() = 'a' and (kind() = 'element' or kind() = 'attribute')] so that the name is tested only once -- but XPath doesn't have a kind() function and the node() test doesn't select attributes anyway. XPath has no way to express this rewrite.

As another example, consider the XPath a[b[. > 1]/@c > b[. < 1]/@c]. The meaning of this query can perhaps best be explained using an equivalent XQuery:

a[some $j in b[. > 1], $k in b[. < 1]
      satisfies $j/@c = $k/@c]

This XPath can be rewritten to eliminate the self axis, but doing so results in a query that is no longer expressible as XPath:

a[some $j in b, $k in b
  satisfies $j > 1 and $k < 1 and $j/@c = $k/@c]

Consequently, having an algebra that is more expressive than the available query languages is critical to their study.

As with query language design, a good starting point is to look at existing query algebras. Unfortunately, there currently doesn't exist a satisfactory standard XML query algebra. XQuery 1.0 and XSLT 2.0 are currently working on defining a common "formal semantics" (an algebra with normalization and type inferencing rules), but so far their formal semantics has no way to describe key concepts such as node identity and sorting, to name just two of its deficiencies.

There are some other XML algebra efforts such as XAT, and proprietary commercial algebras such as IBM's XQGM and Microsoft's QIL. There are also some tree and graph algebras that could be extended to XML.

One of the most important choices when designing an XML algebra (aside from all the choices listed in the previous section) is whether to choose set-based or iterative operators (or both). For example, will your algebra use OP(a, b) or for $i in a, $j in b return OP($i, $b) for any given operator OP? The former is usually easier to reason about algebraically, but the latter is generally easier for compilers of existing XML query languages to generate.

A well-designed XML algebra starts with the primitive data types of the language (such as node kinds, "atomic" types, and sequences of these) and defines simple operators for manipulating values of these types. If the language includes dynamic type operators (such as conversions) then some operands are themselves types (i.e., types are values in the algebra's data model) or else there is a combinatorial explosion of operators (to handle every destination type).

The algebra should take care to have small operators for every part of a larger language operation. For example, in XPath 1.0 the string value of a node is a complex operation equivalent to the XQuery concat(for $i in .//text() return data($i)). Even if you choose to have a single "string value of" operator, it's a good idea to also have the operators needed to represent this expanded form, so that an optimizer could perform certain reductions (like the ones considered previously in this section). This principle also implies that having single operators for node construction is rarely a good idea, unless you also provide algebraic operators for all the implied operations of the construction -- especially namespace scope and node copying.

View Composition

To my knowledge, very little work has been done on XML query/view composition. The current XML query standards barely mention it, and few implementations support it.

An XML view takes existing data and transforms it into a (possibly different) XML shape. The simplest view is the identity transformation. Another simple (but nontrivial) view is validation, which amounts to an identity transformation plus type conversions.

A slightly more interesting view example is the following XQuery:

declare function local:view() {
  for $i in doc('foo.xml')//foo
  return {$i/@*}
};

This view flattens out the foo element hierarchy, renames them as bar elements, and keeps (projects) only their attributes.

Now a query such as local:view()/bar[@baz = 'quux'] can be executed without materializing this view. Instead, the query can be composed with the view, and propoagated all the way back to the original document. This particular example would reduce to:

for $i in doc('foo.xml')//foo
where $i/@baz = 'quux'
return {$i/@*}

Query/view composition is already hard for the read-only case, due to construction "side-effects" such as copying, value extraction, and namespace scope -- all of which happen transparently in this example, giving the false impression that it's not very complicated algebraically. As it exists today, XQuery is not a very nice language for view definition, precisely due to these effects (many of which are irreversible).

The more interesting cases are DML queries that try to propagate changes through a view back to the original source data. These suffer from all the usual complications encountered with relational views (such as the "Halloween" problem) as well as new problems special to hierarchical data and operators.

Query Optimization

There are several ways to do XML query optimization: optimizations involving the query only, optimizations making use of metadata (schema information), optimizations involving data storage and statistics (including indexing). Many of these optimizations are already standard operating procedures in language compilers and databases -- loop-invariant code-motion, common subexpression elimination, index covers, etc. -- but XML complicates matters somewhat. XML is ordered, which gets in the way of loop rewrites. XML query languages generally have a concept of node identity, which gets in the way of eliminating temporary construction and variable substitution. Also, XML is not simple -- its data models and query languages tend to be quite large and filled with special-cases, all of which complicate query optimization.

There have been a number of published, peer-reviewed papers on XML query optimization, but surprisingly many of them are riddled with obvious errors. I'll not single out any by name here, but some of the top journals contained the following errors: one paper claims that expressions like a[count(b) > 1] should be reduced into a[b] (the actual comparison should be against 0 for that to work) and another claims that expressions like let $v := expr return f($v) can be reduced to just f(expr) (clearly false -- consider let $v := return $v is $v). Yet another paper claims that a[expr1][expr2] can be rewritten as a[expr1 and expr2] (clearly false if expr2 involves position()).

Of the papers that do not commit these blatant errors, many stop well short of the full query language or data model, limiting their usefulness in practice. One paper demonstrates how to eliminate all the reverse axes from paths ... but only if the path doesn't contain the union operator, absolute paths in predicates, etc. Another paper works on XML ... without namespaces. These are nice fantasies, but such results are essentially unusable in real-world XML query processors.

Progress in this area is severely hampered not only by the relatively poor quality of existing work, but also by the lack of a standard algebra and notation for describing XML queries and query rewrites.

Streaming Query

I think streaming XML query is one of the most appealing problems in all of XML query processing, in part because it's so algebraic in nature. It also has clear practical uses in lots of industrial XML software applications, especially XML messaging.

The basic idea with streaming query is to avoid buffering data as much as possible -- to execute the XML query in a streaming fashion. Every existing XML parser buffers all attributes of each element, in order to extract namespace declarations, so some buffering seems inevitable. Also, some queries and data -- such as those involving ID values -- should (or must) buffer.

Some researchers (and products) have attempted to solve streaming query by limiting the query language. However, this approach is doomed to fail, because the minute you allow the three operators /, name tests, and predicates, you already have non-streaming queries like /a/b[/a/c]. It's also not what end-users want -- for them, streaming is just a performance benefit; they don't usually want prohibitions imposed on their queries.

Victor Vianu has made some excellent progress in streaming XML query research, but there's plenty more work to be done (especially in implementating streaming query and defining a formal semantics for it).

Data Models

When I first got started with XML, I made the same mistake almost everyone makes -- I thought "XML" was synonymous with its data model. After awhile, I learned that part of the reason XML is so successful is that everyone has their own interpretation of what it actually is.

I wrote several pages about the more popular XML data models in chapter 3 of XQuery: The XML Query Language, but briefly: some people regard XML as markup tags in text. Others regard it as an untyped tree of text. Others regard it as a typed tree of values. Yet others regard it as a graph of labelled edges. Given that every XML query language so far has defined a new data model different from all the ones before it, I can safely predict that we've not yet found "the One" XML data model that will fit the majority of XML applications (if such a model even exists).

So XML data models are in a predicament similar to XML query algebras, except that they're further along in their thinking. Given the diversity of models in wide use today, researchers' best bets are to either pick one and stick with that, or else to try to handle the widespread diversity by coming up with their own models that encompass the others.

Storage and Indexing

TBD

Data Layout Optimization

TBD

Locks and Transactions

TBD

Mapping

TBD

Data Integration/Aggregation

TBD

Tools

TBD


Bibliography

[Brundage] XQuery: The XML Query Language, Michael Brundage, Addison-Wesley, 2004.
[Formal Semantics] XQuery 1.0 and XPath 2.0 Formal Semantics, World Wide Web Consortium (W3C) Working Draft, 2003.
[Full Text] XQuery Full-Text Requirements, World Wide Web Consortium (W3C) Working Draft, 2003.
[Halloween] Anecdotes, Anne Fizpatrick, IEEE Annals of the History of Computing, 2002.
[MUD] The MUD FAQ
[QIL] System.Xml.Query namespace, MSDN Longhorn Preview Documentation, Microsoft.
[Vianu] Victor Vianu's Home Page
[XAT] XAT: XML Algebra for the Rainbow System, X. Zhang and E. Rundensteiner, WPI Technical Report, 2002.
[XML] Extensible Markup Language (XML) 1.0, 3rd edition, World Wide Web Consortium (W3C) Recommendation, 2004.
[XPath] XML Path Language (XPath) Version 1.0, World Wide Web Consortium (W3C) Recommendation, 1999.
[XQBE] XQuery by Example, Braga et al., 2003. 12th International World Wide Web Conference Poster Session.
[XQGM] Querying XML Views of Relational Data, Shanmugasundaram, et al., The VLDB Journal, 2001.
[XQuery] , World Wide Web Consortium (W3C) Working Draft, 2003.
[XML Schema] XML Schema, World Wide Web Consortium (W3C) Recommendation, 2001.
[XSLT] XSL Transformations (XSLT) Version 1.0, World Wide Web Consortium (W3C) Recommendation, 1999.