XMLTABLE by example, Part 2: Common scenarios for using XMLTABLE with DB2

Get the most out of this SQL/XML function

Vitor Rodrigues ( [email protected]), DB2 pureXML Technical Enablement, IBM

Vitor Rodrigues is a software developer at the IBM Silicon Valley Lab. He graduated from University of Minho, Portugal, in Computer Science and Systems Engineering. He joined IBM in 2005 as an intern working on DB2 Everyplace and DB2 9 pureXML. Vitor was a part of the DB2 9 QA team for pureXML, where he acquired deep knowledge of the XML features in DB2 9. After his internship, Vitor became a regular employee, now working for the DB2 9 XML Enablement team.

Matthias Nicola ( [email protected]), DB2/XML Performance, IBM Silicon Valley Laboratory

Dr. Nicola is the technical lead for XML database performance at IBM's Silicon Valley Lab. His work focuses on all aspects of XML performance in DB2, including XQuery, SQL/XML, and all native XML features in DB2. Dr. Nicola works closely with the DB2 XML development teams as well as with customers and business partners who are using XML, assisting them in the design, implementation, and optimization of XML solutions. Prior to joining IBM, Dr. Nicola worked on data warehousing performance for Informix Software. He also worked for four years in research and industry projects on distributed and replicated databases. He received his doctorate in computer science in 1999 from the Technical University of Aachen, Germany.

Summary: XMLTABLE is one of the most powerful functions in the SQL/XML standard and is available in IBM® DB2® 9 for Linux®, Unix®, and Windows® as well as DB2 9 for z/OS®. In part 1 of this two-part series on XMLTABLE you learned how to use XMLTABLE to retrieve XML data in relational format, how to deal with repeating or missing XML elements, and how to handle namespaces in the XMLTABLE function. Part 2 describes common XMLTABLE usage scenarios, such as shredding XML into relational tables, splitting large documents into smaller ones, producing hybrid XML-relational storage of your XML documents, and using relational views over XML data. These techniques and samples will help you develop powerful XML applications with DB2 9 pureXML.

View more content in this series

Tag this!

Update My dW interests ( Log in | What's this?) Skip to help for Update My dW interests

Date: 13 Sep 2007
Level: Intermediate

Activity: 4114 views
Comments: 0 (Add comments)

Sample data

For the discussion of XMLTABLE scenarios in this article, the same sample table and data is used as in part 1 of this series. The table contains one XML column and two rows with one XML document per row. The first document describes a department with two employees, the second document a department with just one employee.

Table 1. Sample table and data

create table emp (doc XML);

<dept bldg="101">

                        <employee id="901">

                        <name>

                        <first>John</first>

                        <last>Doe</last>

                        </name>

                        <office>344</office>

                        <salary currency="USD">55000</salary>

                        </employee>

                        <employee id="902">

                        <name>

                        <first>Peter</first>

                        <last>Pan</last>

                        </name>

                        <office>216</office>

                        <phone>905-416-5004</phone>

                        </employee>

                        </dept>

<dept bldg="114">

                        <employee id="903">

                        <name>

                        <first>Mary</first>

                        <last>Jones</last>

                        </name>

                        <office>415</office>

                        <phone>905-403-6112</phone>

                        <phone>647-504-4546</phone>

                        <salary currency="USD">64000</salary>

                        </employee>

                        </dept>

XMLTABLE can be used for a variety of purposes other than the basic usage of returning XML data in relational format.

Split big XML documents into smaller documents

Most programmers find it convenient and efficient to work with an XML document granularity that matches the logical business objects of the application and the predominant granularity of access. For example, a single document per purchase order, per trade, per contract, per tax return, per customer is usually a good idea. Smaller documents can be manipulated more efficiently than large ones, especially with XML design and authoring tools. Also, indexed access and data retrieval is faster for smaller documents.

However, for a bulk transfer of XML data outside the database, such as FTP, it is often not convenient to handle thousands or millions of separate documents. Therefore, it is common to receive large XML documents, often multiple hundred Megabytes per file, which contain many repeating blocks that represent independent objects. Many external tools fail, or have severe problems, when you try to open such large XML documents, typically due to document object model (DOM) parsing and memory limitations.

DB2 can ingest XML documents up to 2GB. Optionally, you can split them into smaller documents using the XMLTABLE function. As an example, use the data in Table 1, where employee data is aggregated by department, with multiple employees per documents. Query 1 is embedded in an INSERT statement and extracts each employee from a department document. Each employee is inserted into the table employee as a separate XML document in a separate row.

Query 1. Split documents into smaller pieces using XMLTABLE

                

            create table employee(doc XML)

            

            INSERT INTO employee

            SELECT X.* FROM 

            XMLTABLE ('$d/dept/employee' passing cast(? As XML) as "d" 

            COLUMNS 

            doc XML PATH 'document{.}') AS X

The XMLTABLE function in this query produces one row per employee and a single column of type XML. The "?" in the passing clause denotes a parameter marker that provides the XML document as input to the statement. The cast (? as XML) is used to cast the parameter marker to type XML. Instead of using a parameter marker, you can also use literal XML documents as input to a statement, but that is usually clumsy and impractical for all but very small documents. Parameter markers are preferred.

The XML data model requires a parsed, well-formed XML document to have one document node (as a parent of the single root element of the document). This document node is not visible in the textual (serialized) representation of an XML document. The employee sub-trees extracted from an input document does not have document nodes, and hence cannot be inserted as well-formed documents. Therefore, the document{} constructor creates a document node for each extracted employee.

Decompose XML documents into relational data

Using XMLTABLE-queries in INSERT statements is also a convenient way of decomposing, or shredding, XML data into a relational table. For example, sometimes XML is only used as a message format and does not need to be retained once the message is received. If the XML payload of the message can easily be mapped to existing relational tables to feed your SQL applications, then shredding XML to relational can be the right approach. However, in scenarios with complex and variable XML schemas this is a lot harder, so that pureXML storage is a better option than shredding.

Following the previous example, Query 2 shows how XMLTABLE is used to shred the employee data into a new relational table, employeeRel, with one row per employee.

Query 2. Use XMLTABLE to decompose XML data into relational format

                

            create table employeeRel(bldg integer, id integer, fname varchar(15),

            lname varchar(15), office integer)

            

            INSERT INTO employeeRel

            SELECT X.* FROM 

            XMLTABLE ('$d/dept/employee' passing cast (? as XML) as "d" 

            COLUMNS 

            building	INTEGER	PATH '../@bldg'

            empID	INTEGER	PATH '@id',

            firstname	VARCHAR(15)	PATH 'name/first',

            lastname	VARCHAR(15)	PATH 'name/last',

            office	INTEGER	PATH 'office') AS X

In this scenario, you are only shredding a subset of the elements in your documents. Phone and salary elements were not shredded, but that can easily be added. For now, a SELECT statement from the table employeeRel returns the following output:

BLDG   ID    FNAME                LNAME               OFFICE

            ----- ------ -------------------- ------------------- -----

            101    901   John                 Doe                 344

            101    902   Peter                Pan                 216

            114    903   Mary                 Jones               415

            

            3 record(s) selected.

It is also possible to shred XML documents from an existing XML column. In Query 2, you would simply pass a column name instead of the parameter marker into the row-generating expression, as you see in Query 5 below as well as in most of the queries in part 1 of this series.

Hybrid approach: XML and relational

There are situations in which it makes sense to have some of your data in relational format and some in XML format. For structured and fixed data items that are unlikely to change over time, relational columns may be a good choice. Other parts of your data may be semi-structured, variable, or inherently hierarchical and therefore better stored in an XML column. A common way to achieve this hybrid approach at insert time is to extract selected elements or attributes from each XML document and place them into relational columns in the same row where you insert the full document. On the relational columns, you can then define multi-column indexes (composite keys), referential integrity constraints, or triggers, since these concepts are not or only partially available for XML columns.

There are various ways to populate the relational columns in your table along with the XML column. You can do the value extraction in your application and feed both the XML document and the extracted values into an SQL INSERT statement. But, it's often easier and simpler to just use XMLTABLE in an INSERT statement. This avoids XML parsing in your application and keeps the logic in the INSERT statement. If any changes are needed in the future, only this statement -- possibly encapsulated in a stored procedure -- needs to be modified instead of altering the application code.

Query 3 shows how the department documents in Table 1 can be inserted with extraction of the bldg attribute into a separate column of type INTEGER. Note that the row-generating expression $d does not include any navigation, meaning no XPath steps. This means that the context of the COLUMNS clause is always a full document including its document node.

Query 3. Insert XML data and extract one attribute into a relational column

                

            create table hybridDept(bldg integer, doc XML);

            

            INSERT INTO hybridDept

            SELECT X.* FROM

            XMLTABLE ('$d' passing cast(? as XML) as "d" 

            COLUMNS 

            building        INTEGER         PATH ' @bldg',

            doc             XML             PATH '.'

            ) AS X

Query 4 inserts the same XML documents as Query 3, but it extracts the employee ID, first and last name, and office number into relational columns. It splits up the department documents and produces one row per employee with one XML document per employee. The new documents are fragments of the input documents and therefore require the construction of document nodes.

Query 4. Insert XML data and store multiple elements in relational columns

                

            create table hybridEmp(id integer, fname varchar(25), 

            lname varchar(25), doc XML);

            

            INSERT INTO hybridEmp

            SELECT X.* FROM

            XMLTABLE ('$d/dept/employee' passing cast(? as XML) as "d" 

            COLUMNS 

            id 	INTEGER		PATH ' @id',

            first 	VARCHAR(25) 	PATH 'name/first',

            last 	VARCHAR(25) 	PATH 'name/last',

            doc 	XML		PATH 'document{.}'

            ) AS X

Create a relational view over XML data

In the previous sections, you saw how to use XMLTABLE to shred XML data to relational tables or to produce a hybrid representation of your XML data. The same XMLTABLE expressions can be embedded in view definitions. This allows you to provide applications with a relational or hybrid view of the XML data without actually storing the data in that respective format.

Query 5 is embedded in a view definition and exposes a subset of the XML elements and attributes from the column doc in a relational format. Applications can use this view to query the data with traditional SQL statements, even though the actual underlying data is in XML format.

Query 5. Use a view to expose your XML data in relational format

                

            CREATE VIEW emp_rel (id, first_name ,last_name ,office) 

            AS

            SELECT X.* FROM emp,

            XMLTABLE ('$d/dept/employee' passing doc as "d" 

            COLUMNS 

            empID 		INTEGER 	PATH '@id',

            firstname		VARCHAR(5) 	PATH 'name/first',

            lastname		VARCHAR(5) 	PATH 'name/last',

            office		INTEGER 	PATH 'office') AS X

If you are familiar with relational views, you will easily recognize the first line of the statement that defines the view name and its columns. The query expression in the AS clause defines the content of the view. In this example, XMLTABLE is used to produce one row for each employee in the XML data. Each row is composed of four relational columns containing the employee ID, first name, last name, and office number, respectively.

Query 6 shows how you can run a simple SQL statement (with no XML extensions) against this view to obtain the first name and ID of the employee in office 344.

Query 6. Use SQL to query the XMLTABLE view

                

            select id, first_name from emp_rel where office = 344;

            

            ID          FIRST_NAME

            ----------- ----------

            901 John

            

            1 record(s) selected.

Be aware that when you run such SQL queries with SQL predicates against an XMLTABLE view, DB2 first generates all the relational rows for the view and then evaluates the SQL predicate(s). This means that the XMLTABLE function is applied to all documents in the underlying table, even though the SQL query requires data from only one particular document. This can adversely affect performance.

In particular, DB2 does not use an XML index to evaluate the SQL predicate because the predicate is expressed on a relational column of the view and not on the underlying base table. It is, in general, not straight forward for a database system to convert relational predicates into XML predicates. In the next section, you'll see how to work around this restriction and still achieve good performance.

Performance considerations for XMLTABLE views

The key to obtaining good runtime performance for queries over XMLTABLE views is the proper use of indexes to reduce the number of XML documents that are input to the XMLTABLE function. There are two ways of doing this:

Using XML predicates with XML indexes
Using relational predicates on relational columns with indexes

Since Query 6 has a predicate on office, which is ultimately an element in the underlying XML data, you may want to speed up the query by creating an index on /dept/employee/office for the underlying XML column. A query does not use this index unless you express the predicate as an XML predicate. This requires the XML column to be included in the view definition, as shown in Query 7.

Query 7. Include the XML column in the XMLTABLE view

                

            CREATE VIEW emp_rel (id, first_name ,last_name ,office, doc) 

            AS

            SELECT X.* FROM emp,

            XMLTABLE ('$d/dept/employee' passing doc as "d" 

            COLUMNS 

            empID	INTEGER		PATH '@id',

            firstname	VARCHAR(5)	PATH 'name/first',

            lastname	VARCHAR(5)	PATH 'name/last',

            office	INTEGER	PATH 'office'

            doc      	XML	PATH 'document{.}'

            ) AS X

Then you can express Query 6 with an XML predicate that can use the XML index to provide a significant performance improvement. This is shown in Query 8. Although an XML predicate is used, the SELECT list still references only relational view columns to return the query result in relational format.

Query 8. Query the XMLTABLE view with an XML predicate

                

            select id, first_name from emp_rel 

            where xmlexists('$d/employee/[office = 344]' passing doc as "d")

            

            ID          FIRST_NAME

            ----------- ----------

            901 John

            

            1 record(s) selected.

You can also speed up queries against XMLTABLE views if the underlying base table has relational columns in addition to the XML column, which is actually quite common. In that case, you should include these extra relational columns in the view definition. In Query 9 the columns deptID and unit have been included in the view.

Query 9. Include relational column in the XMLTABLE view

                

            CREATE TABLE emp(deptId integer, unit char(20), doc XML)

            

            CREATE VIEW emp_rel (deptId, unit, empid, first_name ,last_name ,office) 

            AS

            SELECT E.deptId, E.unit, X.* 

            FROM emp E,

            XMLTABLE ('$d/dept/employee' passing doc as "d" 

            COLUMNS 

            empID		INTEGER	PATH '@id',

            firstname	VARCHAR(5)	PATH 'name/first',

            lastname	VARCHAR(5)	PATH 'name/last',

            office	INTEGER	PATH 'office'

            ) AS X

Chances are that your queries for employee information will include predicates on the deptID, or on the unit, or both. If these predicates are sufficiently selective, then the cost of the XMLTABLE processing is significantly reduced. In Query 10, a predicate on deptID has been added. DB2 evaluates this predicate first so that the XMLTABLE function is only applied to the employees in department 9473. This is a small set compared to the total size of the table. Among those, the second predicate finds the employee in office 344 quickly.

Query 10. Query the XMLTABLE view with relational predicates

                

            select id, first_name from emp_rel 

            where deptID = 9473 and office = 344

            

            ID          FIRST_NAME

            ----------- ----------

            901 John

            

            1 record(s) selected.

Number the rows produced for a document

In some cases, you may want to generate a column that numbers the rows which XMLTABLE produces for any given document. This can help your application to remember the order in which the values appeared in each document. You can achieve this with the "for ordinality" clause that is supported in DB2 9.5. Query 11 demonstrates this functionality. Note that the numbering starts with 1 for each document that is input to the XMLTABLE function.

Query 11. An XMLTABLE query with a sequence number produced by "for ordinality"

                

            SELECT X.* 

            FROM emp, 

            XMLTABLE ('$d/dept/employee' passing doc as "d" 

            COLUMNS 

            seqno       for ordinality,

            empID 	INTEGER 	PATH '@id',

            firstname 	VARCHAR(20) 	PATH 'name/first',

            lastname 	VARCHAR(25) 	PATH 'name/last') AS X

Running this query in DB2 returns the following result:

seqno   empID       firstname            lastname

            ------- ----------- -------------------- -------------------------

            1         901 John                 Doe

            2         902 Peter                Pan

            1         903 Mary                 Jones

List field-value pairs

Another real-world XML application requirement involves listing all field/value pairs that exist in a given XML document -- without making assumptions about its structure or the attribute and element names. This can be achieved with the XMLTABLE function.

To produce a relational table that contains two columns (tagname, value), you need to traverse all elements and attributes in your XML documents and extract the name and value of each element and attribute. Query 12 shows a generic way to return all attribute/value and element/value pairs from the XML data. This generic query can be modified to meet more specific needs.

Note the row-generating expression $d//(*, @*) in Query 12. The double-slash (//) recursively traverses the XML document to every level. Expressions such as $d//* or $d//@* would return either all elements or all attributes, respectively, but not a mixed sequence of both. The expression $d//(*, @*) returns all elements and attributes.

Query 12. List all attributes and elements with their values

                

            SELECT X.* FROM emp,

            XMLTABLE ('$d//(*, @*)' passing doc as "d" 

            COLUMNS 

            tagname VARCHAR(20)  PATH  'name()',

            value   VARCHAR(60)  PATH  'data(.)'

            ) AS X

For each node in your XML data, Query 12 returns a relational row with the name and value of the node. The XML data model defines that the value of a non-leaf node is the concatenation of all descendant text nodes. This is visible in the result of Query 12:

TAGNAME              VALUE

            -------------------- -----------------------------------------------

            dept                 JohnDoe34455000PeterPan216905-416-5004

            bldg                 101

            employee             JohnDoe34455000

            id                   901

            name                 JohnDoe

            first                John

            last                 Doe

            office               344

            salary               55000

            currency             USD

            employee             PeterPan216905-416-5004

            id                   902

            name                 PeterPan

            first                Peter

            last                 Pan

            office               216

            phone                905-416-5004

            dept                 MaryJones415905-403-6112647-504-454664000

            bldg                 114

            employee             MaryJones415905-403-6112647-504-454664000

            id                   903

            name                 MaryJones

            first                Mary

            last                 Jones

            office               415

            phone                905-403-6112

            phone                647-504-4546

            salary               64000

            currency             USD

            

            29 record(s) selected.

Since the concatenated values are not very useful and redundant with other entries, you can write smarter expressions for the COLUMNS clause to produce a more organized result set.

Query 13. List all triplets of name-type-value for all XML nodes in your documents

                

            SELECT X.* FROM emp ,

            XMLTABLE ('$d//(*, @*)' passing doc as "d" 

            COLUMNS 

            tagname  VARCHAR(16) PATH  'name()',

            type     VARCHAR(12) PATH 'let $i := (if  (self::element()) 

            then (if (./*) then (''ELEMENT'') else (''LEAF-ELEMENT'')) 

            else (''ATTRIBUTE'')) return $i',

            value    VARCHAR(16) PATH '( .[self::element()]/text(),

            .[self::attribute()]/data(.) )'

            ) AS X

The difference between Query 12 and Query 13 is a new column-generating expression for the column value and the new column named type. Just like the row-generating expression can be any XQuery expression, the same is true for the column-generating expressions as long as they each return a scalar value that is castable to assigned SQL data type.

In the value column, you now only return each element's immediate text node children without concatenation of its descendant text nodes. This is achieved with the function text(). For attribute nodes, use the function data() instead of text, since attribute values don't live in separate text nodes. This article doesn't use the function data() for element nodes because it would also return the element's descendant text nodes. The predicate (.[self::element()]/text(),.[self::attribute()]/data(.)) is used to achieve your needs. If the current node is an element, its text node is returned using the text() function. If the current node is an attribute, the data() function is used to return its value.

The other major difference from Query 12 introduced in Query 13 is the new column type. In this column, you return the type of the current node (meaning the node kind, not the data type). Consider three types: element, leaf element, and attribute. The distinction between these three types is done based on the following rules:

If the node is of type element() and has child nodes (./* evaluates to true), then it is of type ELEMENT.
If the node is of type element() and has no child nodes (./* evaluates to false), then it is a LEAF-ELEMENT.
In the remaining cases, the node is of type ATTRIBUTE.

Note that these rules are only correct because your XML documents contain no other node types. If our XML data contained comments or processing instructions, then the above rules would have to be extended.

Executing Query 13, you get as a result a relational table containing the triplets name-type-value for all XML nodes in your documents:

TAGNAME          TYPE             VALUE

            ---------------- ---------------- ------------

            dept             ELEMENT          -

            bldg             ATTRIBUTE        101

            employee         ELEMENT          -

            id               ATTRIBUTE        901

            name             ELEMENT          -

            first            LEAF-ELEMENT     John

            last             LEAF-ELEMENT     Doe

            office           LEAF-ELEMENT     344

            salary           LEAF-ELEMENT     55000

            currency         ATTRIBUTE        USD

            employee         ELEMENT          -

            id               ATTRIBUTE        902

            name             ELEMENT          -

            first            LEAF-ELEMENT     Peter

            last             LEAF-ELEMENT     Pan

            office           LEAF-ELEMENT     216

            phone            LEAF-ELEMENT     905-416-5004

            dept             ELEMENT          -

            bldg             ATTRIBUTE        114

            employee         ELEMENT          -

            id               ATTRIBUTE        903

            name             ELEMENT          -

            first            LEAF-ELEMENT     Mary

            last             LEAF-ELEMENT     Jones

            office           LEAF-ELEMENT     415

            phone            LEAF-ELEMENT     905-403-6112

            phone            LEAF-ELEMENT     647-504-4546

            salary           LEAF-ELEMENT     64000

            currency         ATTRIBUTE        USD

            

            29 record(s) selected.

The type column in the result from Query 13 also allows you to count how many nodes of each kind are in your XML documents. This is possible because the result of XMLTABLE is a relational table and can be input to any SQL operations. Adding a count() and a group by clause to Query 13, you get the following query:

Query 14. Count the total number of XML nodes of each type

                

            SELECT X."type", count(X."tagname") as count FROM test ,

            XMLTABLE ('$d//(*, @*)' passing doc as "d" 

            COLUMNS 

            "tagname" VARCHAR(20) PATH  'name()',

            "type" VARCHAR(22) PATH '(if  (self::element()) 

            then (if (./*) then (''NON-LEAF ELEMENT'') else (''LEAF-ELEMENT'')) 

            else (''ATTRIBUTE''))',

            "value" VARCHAR(12)  PATH  '(.[self::element()]/text(),.[self::attribute()]/data(.))'

            ) AS X group by X."type"

Query 14 will generate the following summary table:

type         COUNT

            ------------ -----------

            ATTRIBUTE              7

            NON-LEAF ELEMENT       8

            LEAF-ELEMENT          14

            

            3 record(s) selected.

This makes it easy to get detailed statistics about the XML documents stored in your database. Just like you used the SQL count() and group by clause on top of XMLTABLE, you can apply any SQL capabilities to the rowset that XMLTABLE produces. For example, you use the SQL order by clause to sort the output of the XMLTABLE function.

Summary

The XMLTABLE function is a very versatile tool for processing XML data in SELECT and INSERT statements. In this article, you have seen how XMLTABLE can be deployed to split big XML documents into small ones, to perform simple XML shredding, or to produce a hybrid representation of your XML data. This two-part series on XMLTABLE contains a wide range of SQL/XML statement samples that you may be able to use as templates for some of your own application development.

Acknowledgements

Thanks to Cindy Saracco for her review and helpful comments on this article.

Resources

Learn

"XMLTABLE by example - Part 1: Retrieving XML data in relational format" (developerWorks, September 2007): Learn to use XMLTABLE to retrieve XML data in relational format. Discover how to manage repeating or missing XML elements, and how to handle namespaces in the XMLTABLE function.
"Native XML Support in DB2 Universal Database": Learn more about this important topic.
"Query DB2 XML Data with SQL" (developerWorks, March 2006): Query data stored in XML columns using SQL/XML.
"Query DB2 XML data with XQuery" (developerWorks, April 2006): Query data stored in XML columns using XQuery.
"Querying XML data with namespaces" (developerWorks, November 2006): Walk through several common scenarios to help you learn how to query XML data that contains namespaces.
"pureXML in DB2 9: Which way to query your XML Data?" (developerWorks, June 2006): Find more examples and guidelines for querying XML data.
DB2 pureXML Enablement Wiki: Find a wealth of pureXML resources.
developerWorks Information Management zone: Learn more about DB2. Find technical documentation, how-to articles, education, downloads, product information, and more.
Stay current with developerWorks technical events and webcasts.

Get products and technologies

Now you can use DB2 for free. Download DB2 Express-C, a no-charge version of DB2 Express Edition for the community that offers the same core data features as DB2 Express Edtion and provides a solid base to build and deploy applications.
Build your next development project with IBM trial software, available for download directly from developerWorks.
Download a free trial version of DB2 Enterprise 9.

Discuss

Participate in developerWorks blogs and get involved in the developerWorks community.

About the authors

XMLTABLE by example, Part 2: Common scenarios for using XMLTABLE with DB2

你可能感兴趣的:(XMLTABLE by example, Part 2: Common scenarios for using XMLTABLE with DB2)