*Note: This book excerpt is from XQuery from the Experts: A Guide to the W<st1:chmetcnv w:st="on" unitname="C" sourcevalue="3" hasspace="False" negative="False" numbertype="1" tcsc="0">3C</st1:chmetcnv> XML Query Language by Howard Katz, Don Chamberlin, Denise Draper, Mary Fernandez, Michael Kay, Jonathan Robie, Michael Rys, Jerome Simeon, Jim Tivy, and Philip Wadler, (ISBN 0-321-18060-7), copyright 2004. All rights reserved. Material posted with permission from Addison-Wesley<o:p></o:p>
XML (Extensible Markup Language) is an extremely versatile data format that has been used to represent many different kinds of data.<o:p></o:p>
This chapter uses bibliography data to illustrate the basic features of XQuery.<o:p></o:p>
XQuery is defined in terms of a formal data model, not in terms of XML text. <o:p></o:p>
XQuery uses "smiley faces" to begin and end comments. This cheerful notation was originally suggested by Jeni Tennison.<o:p></o:p>
XQuery uses input functions to identify the data to be queried.<o:p></o:p>
In XQuery, path expressions are used to locate nodes in XML data. XQuery's path expressions are derived from XPath 1.0 and are identical to the path expressions of XPath 2.0. <o:p></o:p>
Now we will learn how to create nodes. Elements, attributes, text nodes, processing instructions, and comments can all be created using the same syntax as XML. <o:p></o:p>
Queries in XQuery often combine information from one or more sources and restructure it to create a new result. This section focuses on the expressions and functions most commonly used for combining and restructuring XML data.<o:p></o:p>
Like most languages, XQuery has arithmetic operators and comparison operators, and because sequences of nodes are a fundamental datatype in XQuery, it is not surprising that XQuery also has node sequence operators.<o:p></o:p>
XQuery has a set of built-in functions and operators, including many that are familiar from other languages, and some that are used for customized XML processing. <o:p></o:p>
When a query becomes large and complex, it is often much easier to understand if it is divided into functions, and these functions can be reused in other parts of the query.<o:p></o:p>
A query can define a variable in the prolog. Such a variable is available at any point after it is declared. <o:p></o:p>
Functions can be put in library modules, which can be imported by any query. Every module in XQuery is either a main module, which contains a query body to be evaluated, or a library module, which has a module declaration but no query body.<o:p></o:p>
XQuery implementations are often embedded in an environment such as a Java or C# program or a relational database. The environment can provide external functions and variables to XQuery.<o:p></o:p>
XQuery is not only a query language, but also a language that can do fairly general processing of XML. It is a strongly typed language that works well with data that may be strongly or weakly typed. <o:p></o:p>
XML (Extensible Markup Language) is an extremely versatile data format that has been used to represent many different kinds of data, including web pages, web messages, books, business and accounting data, XML representations of relational database tables, programming interfaces, objects, financial transactions, chess games, vector graphics, multimedia presentations, credit applications, system logs, and textual variants in ancient Greek manuscripts. <o:p></o:p>
In addition, some systems offer XML views of non-XML data sources such as relational databases, allowing XML-based processing of data that is not physically represented as XML. An XML document can represent almost anything, and users of an XML query language expect it to perform useful queries on whatever they have stored in XML. Examples illustrating the variety of XML documents and queries that operate on them appear in [XQ-UC]. <o:p></o:p>
However complex the data stored in XML may be, the structure of XML itself is simple. An XML document is essentially an outline in which order and hierarchy are the two main structural units. XQuery is based on the structure of XML and leverages this structure to provide query capabilities for the same range of data that XML stores. To be more precise, XQuery is defined in terms of the XQuery 1.0 and XPath 2.0 Data Model [XQ-DM], which represents the parsed structure of an XML document as an ordered, labeled tree in which nodes have identity and may be associated with simple or complex types. XQuery can be used to query XML data that has no schema at all, or that is governed by a World Wide Web Consortium (W<st1:chmetcnv w:st="on" unitname="C" sourcevalue="3" hasspace="False" negative="False" numbertype="1" tcsc="0">3C</st1:chmetcnv>) XML Schema or by a Document Type Definition (DTD). Note that the data model used by XQuery is quite different from the classical relational model, which has no hierarchy, treats order as insignificant, and does not support identity. XQuery is a functional language—instead of executing commands as procedural languages do, every query is an expression to be evaluated, and expressions can be combined quite flexibly with other expressions to create new expressions. <o:p></o:p>
This chapter gives a high-level introduction to the XQuery language by presenting a series of examples, each of which illustrates an important feature of the language and shows how it is used in practice. Some of the examples are drawn from [XQ-UC]. We cover most of the language features of XQuery, but also focus on teaching the idioms used to solve specific kinds of problems with XQuery. We start with a discussion of the structure of XML documents as input and output to queries and then present basic operations on XML-locating nodes in XML structures using path expressions, constructing XML structures with element constructors, and combining and restructuring information from XML documents using FLWOR expressions, sorting, conditional expressions, and quantified expressions. After that, we explore operators and functions, discussing arithmetic operators, comparisons, some of the common functions in the XQuery function library, and how to write and call user-defined functions. Finally, we discuss how to import and use XML Schema types in queries. <o:p></o:p>
Many users will learn best if they have access to a working implementation of XQuery. Several good implementations can be downloaded for free from the Internet; a list of these appears on the W<st1:chmetcnv w:st="on" unitname="C" sourcevalue="3" hasspace="False" negative="False" numbertype="1" tcsc="0">3C</st1:chmetcnv> XML Query Working Group home page, which is found at http://www.w3.org/xml/Query.html. <o:p></o:p>
This chapter is based on the May 2003 Working Draft of the XQuery language. XQuery is still under development, and some aspects of the language discussed in this chapter may change. <o:p></o:p>
This chapter uses bibliography data to illustrate the basic features of XQuery. The data used is taken from the XML Query Use Cases, Use Case "XMP," and originally appeared in [EXEMPLARS]. We have modified the data slightly to illustrate some of the points to be made. The data used appears in Listing 1.1. <o:p></o:p>
Listing 1.1 Bibliography Data for Use Case "XMP" <o:p></o:p>
<bib><o:p></o:p>
<book year="1994"><o:p></o:p>
<title>TCP/IP Illustrated</title><o:p></o:p>
<author><last>Stevens</last><first>W.</first></author><o:p></o:p>
<publisher>Addison-Wesley</publisher><o:p></o:p>
<price>65.95</price><o:p></o:p>
</book> <o:p></o:p>
<book year="1992"><o:p></o:p>
<title>Advanced Programming in the UNIX Environment</title><o:p></o:p>
<author><last>Stevens</last><first>W.</first></author><o:p></o:p>
<publisher>Addison-Wesley</publisher><o:p></o:p>
<price>65.95</price><o:p></o:p>
</book><o:p></o:p>
<book year="2000"><o:p></o:p>
<title>Data on the Web</title><o:p></o:p>
<author><last>Abiteboul</last><first>Serge</first></author><o:p></o:p>
<author><last>Buneman</last><first>Peter</first></author><o:p></o:p>
<author><last>Suciu</last><first>Dan</first></author><o:p></o:p>
<publisher>Morgan Kaufmann Publishers</publisher><o:p></o:p>
<price>65.95</price><o:p></o:p>
</book><o:p></o:p>
<book year="1999"><o:p></o:p>
<title>The Economics of Technology and Content<o:p></o:p>
for Digital TV</title><o:p></o:p>
<editor><o:p></o:p>
<last>Gerbarg</last><o:p></o:p>
<first>Darcy</first><o:p></o:p>
<affiliation>CITI</affiliation><o:p></o:p>
</editor><o:p></o:p>
<publisher>Kluwer Academic Publishers</publisher><o:p></o:p>
<price>129.95</price><o:p></o:p>
</book><o:p></o:p>
</bib><o:p></o:p>
The data for this example was created using a DTD, which specifies that a bibliography is a sequence of books, each book has a title, publication year (as an attribute), an author or an editor, a publisher, and a price, and each author or editor has a first and a last name, and an editor has an affiliation. Listing 1.2 provides the DTD for our example. <o:p></o:p>
Listing 1.2 DTD for the Bibliography Data <o:p></o:p>
<!ELEMENT bib (book* )><o:p></o:p>
<!ELEMENT book (title, (author+ | editor+ ), publisher, price )><o:p></o:p>
<!ATTLIST book year CDATA #REQUIRED ><o:p></o:p>
<!ELEMENT author (last, first )><o:p></o:p>
<!ELEMENT editor (last, first, affiliation )><o:p></o:p>
<!ELEMENT title (#PCDATA )><o:p></o:p>
<!ELEMENT last (#PCDATA )><o:p></o:p>
<!ELEMENT first (#PCDATA )><o:p></o:p>
<!ELEMENT affiliation (#PCDATA )><o:p></o:p>
<!ELEMENT publisher (#PCDATA )><o:p></o:p>
<!ELEMENT price (#PCDATA )><o:p></o:p>
XQuery is defined in terms of a formal data model, not in terms of XML text. Every input to a query is an instance of the data model, and the output of every query is an instance of the data model. In the XQuery data model, every document is represented as a tree of nodes. The kinds of nodes that may occur are: document, element, attribute, text, name-space, processing instruction, and comment. Every node has a unique node identity that distinguishes it from other nodes-even from other nodes that are otherwise identical. <o:p></o:p>
In addition to nodes, the data model allows atomic values, which are single values that correspond to the simple types defined in the W<st1:chmetcnv w:st="on" unitname="C" sourcevalue="3" hasspace="False" negative="False" numbertype="1" tcsc="0">3C</st1:chmetcnv> Recommendation, "XML Schema, Part 2" [SCHEMA], such as strings, Booleans, decimals, integers, floats and doubles, and dates. These simple types may occur in any document associated with a W<st1:chmetcnv w:st="on" unitname="C" sourcevalue="3" hasspace="False" negative="False" numbertype="1" tcsc="0">3C</st1:chmetcnv> XML Schema. As we will see later, we can also represent several simple types directly as literals in the XQuery language, including strings, integers, doubles, and decimals.
An item is a single node or atomic value. A series of items is known as a sequence. In XQuery, every value is a sequence, and there is no distinction between a single item and a sequence of length one. Sequences can only contain nodes or atomic values; they cannot contain other sequences. <o:p></o:p>
The first node in any document is the document node, which contains the entire document. The document node does not correspond to anything visible in the document; it represents the document itself. Element nodes, comment nodes, and processing instruction nodes occur in the order in which they are found in the XML (after expansion of entities). Element nodes occur before their children-the element nodes, text nodes, comment nodes, and processing instructions they contain. Attributes are not considered children of an element, but they have a defined position in document order: They occur after the element in which they are found, before the children of the element. The relative order of attribute nodes is implementation-dependent. In document order, each node occurs precisely once, so sorting nodes in document order removes duplicates. <o:p></o:p>
An easy way to understand document order is to look at the text of an XML document and mark the first character of each element start tag, attribute name, processing instruction, comment, or text node. If the first character of one node occurs before the first character of another node, it will precede that node in document order. Let's explore this using the following small XML document: <o:p></o:p>
<!-document order-><o:p></o:p>
<!book year="1994"><o:p></o:p>
<!title>TCP/IP Illustrated<!/title><o:p></o:p>
<!author><!last>Stevens<!/last><!first>W.<!/first><!/author><o:p></o:p>
<!/book> <o:p></o:p>
<o:p></o:p>
The first node of any document is the document node. After that, we can identify the sequence of nodes by looking at the sequence of start characters found in the original document-these are identified by underlines in the example. The second node is the comment, followed by the book element, the year attribute, the title element, the text node containing TCP/IP Illustrated, the author element, the last element, the text node containing Stevens, the first element, and the text node containing W.<o:p></o:p>
Literals and Comments in XQuery<o:p></o:p>
XQuery uses "smiley faces" to begin and end comments. This cheerful notation was originally suggested by Jeni Tennison. Here is an example of a comment: <o:p></o:p>
(: Thanks, Jeni! :)<o:p></o:p>
Note that XQuery comments are comments found in a query. XML documents may also have comments, like the comment found in an earlier example: <o:p></o:p>
<!- document order -><o:p></o:p>
XQuery comments do not create XML comments-XQuery has a constructor for this purpose, which is discussed later in the section on constructors. <o:p></o:p>
XQuery supports three kinds of numeric literals. Any number may begin with an optional + or - sign. A number that has only digits is an integer, a number containing only digits and a single decimal point is a decimal, and any valid floating-point literal containing an e or E is a double. These correspond to the XML Schema simple types xs:integer, xs:decimal, and xs:double. <o:p></o:p>
1<o:p></o:p> |
(: An integer :) <o:p></o:p> |
||
-2<o:p></o:p> |
(: An integer :)<o:p></o:p> |
||
+2<o:p></o:p> |
(: An integer :)<o:p></o:p> |
||
1.23<o:p></o:p> |
(: A decimal <o:p></o:p> |
:) <o:p></o:p> |
<o:p> </o:p> |
-1.23<o:p></o:p> |
(: A decimal <o:p></o:p> |
:) <o:p></o:p> |
<o:p> </o:p> |
1.2e5<o:p></o:p> |
(: A double <o:p></o:p> |
:) <o:p></o:p> |
<o:p> </o:p> |
-1.2E5<o:p></o:p> |
(: A double <o:p></o:p> |
:) <o:p></o:p> |
<o:p> </o:p> |
String literals are delimited by quotation marks or apostrophes. If a string is delimited by quotation marks, it may contain apostrophes; if a string is delimited by apostrophes, it may contain quotation marks: <o:p></o:p>
"a string"<o:p></o:p>
'a string'<o:p></o:p>
"This is a string, isn't it?"<o:p></o:p>
'This is a "string"'<o:p></o:p>
If the literal is delimited by apostrophes, two adjacent apostrophes within the literal are interpreted as a single apostrophe. Similarly, if the literal is delimited by quotation marks, two adjacent quotation marks within the literal are interpreted as one quotation mark. The following two string literals are identical: <o:p></o:p>
"a "" or a ' delimits a string literal"<o:p></o:p>
'a " or a '' delimits a string literal'<o:p></o:p>
A string literal may contain predefined entity references. The entity references shown in Table 1.1 are predefined in XQuery. <o:p></o:p>
Here is a string literal that contains two predefined entity references: <o:p></o:p>
'<bold>A sample element.</bold>'<o:p></o:p>
Input Functions in XQuery<o:p></o:p>
XQuery uses input functions to identify the data to be queried. There are two input functions: <o:p></o:p>
1. doc() returns an entire document, identifying the document by a Universal Resource Identifier (URI). To be more precise, it returns the document node. <o:p></o:p>
2. collection() returns a collection, which is any sequence of nodes that is associated with a URI. This is often used to identify a database to be used in a query. <o:p></o:p>
TABLE 1.1 Entity References<o:p></o:p> |
Character Represented<o:p></o:p> |
Predefined in XQuery Entity Reference <o:p></o:p> |
<o:p></o:p> |
<<o:p></o:p> |
<<o:p> </o:p> |
><o:p></o:p> |
><o:p> </o:p> |
&<o:p></o:p> |
&<o:p></o:p> |
"<o:p></o:p> |
"<o:p></o:p> |
'<o:p></o:p> |
'<o:p></o:p> |
If our sample data is in a file named books.xml, then the following query returns the entire document: <o:p></o:p>
doc("books.xml")<o:p></o:p>
A dynamic error is raised if the doc() function is not able to locate the specified document or the collection() function is not able to locate the specified collection. <o:p></o:p>
Locating Nodes with Path Expressions in XQuery<o:p></o:p>
In XQuery, path expressions are used to locate nodes in XML data. XQuery's path expressions are derived from XPath 1.0 and are identical to the path expressions of XPath 2.0. The functionality of path expressions is closely related to the underlying data model. We start with a few examples that convey the intuition behind path expressions, then define how they operate in terms of the data model. <o:p></o:p>
The most commonly used operators in path expressions locate nodes by identifying their location in the hierarchy of the tree. A path expression consists of a series of one or more steps, separated by a slash, /, or double slash, //. Every step evaluates to a sequence of nodes. For instance, consider the following expression: <o:p></o:p>
doc("books.xml")/bib/book<o:p></o:p>
This expression opens books.xml using the doc() function and returns its document node, uses /bib to select the bib element at the top of the document, and uses /book to select the book elements within the bib element. This path expression contains three steps. The same books could have been found by the following query, which uses the double slash, //, to select all of the book elements contained in the document, regardless of the level at which they are found: <o:p></o:p>
doc("books.xml")//book<o:p></o:p>
Predicates are Boolean conditions that select a subset of the nodes computed by a step expression. XQuery uses square brackets around predicates. For instance, the following query returns only authors for which last="Stevens" is true:
doc("books.xml")/bib/book/author[last="Stevens"]<o:p></o:p>
If a predicate contains a single numeric value, it is treated like a subscript. For instance, the following expression returns the first author of each book: <o:p></o:p>
doc("books.xml")/bib/book/author[1]<o:p></o:p>
Note that the expression author[1] will be evaluated for each book. If you want the first author in the entire document, you can use parentheses to force the desired precedence: <o:p></o:p>
(doc("books.xml")/bib/book/author)[1]<o:p></o:p>
Now let's explore how path expressions are evaluated in terms of the data model. The steps in a path expression are evaluated from left to right. The first step identifies a sequence of nodes using an input function, a variable that has been bound to a sequence of nodes, or a function that returns a sequence of nodes. Some XQuery implementations also allow a path expression to start with a / or //. <o:p></o:p>