Managing XML data: Native XML databases
Managing XML data: Native XML databases
When your only tool is a hammer, everything looks like a nail. When your only tool is a relational database, everything looks like a table. Reality, however, is more complicated than that. Data often isn't tabular and can benefit from a tool that more closely fits its natural structure. When that data is XML, the appropriate tool for managing it might well be a native XML database. For many classes of applications with significant XML processing needs, a native XML database is a very powerful tool. Explore the nature of native XML databases and get some general ideas about what to expect from this new tool in the developer's toolbox.
Relational databases in general, and SQL databases in particular, have been so incredibly successful that they've almost completely eliminated the competition, at least in mind share if not always in actual installations. (A lot of data is still locked up in hierarchical, big iron databases like IMS™, and quite a bit more is stored in lower-end, non-SQL databases like FileMaker.) However, although relational databases fit a lot of problems very well, they don't really fit XML documents, at least not in their full generality. While you can shred an XML document enough to stuff it into a relational table or just treat it as one big blob, neither approach really lends itself to indexing and fast queries. In practice, shredding also tends to lead to the loss of details like element order, processing instructions, comments, white space, and other elements that are important in many applications in which XML documents don't look exactly like serialized tables in the first place. Field and record boundaries just don't match the boundaries of an XML document. Applications such as publishing systems that care about these details need to look beyond the relational database for their information storage needs.
Traditionally, information that doesn't naturally fit into tables has been stored in a file system. However, that approach is showing its age and probably should have been abandoned years ago. A great deal of data is now being encoded in XML, and more is being created every day. However, many people dump these XML documents into file systems without giving much thought to managing the superstructures formed by the document collections (as distinct from the internal structures of each separate document). It's time for something better.
Consequently, various vendors have released native XML databases. A native XML database is one that treats XML documents and elements as the fundamental structures rather than tables, records, and fields. Such a database enables developers to use tools and languages that more naturally fit the structure of the documents they're working with, thereby enhancing productivity. It is also widely believed (if not exactly proven) that native XML databases can significantly outperform traditional relational databases for tasks that involve heavy document processing, such as newspaper publishing, Web site management, and Web services.
Database models
Relational databases have a well-understood mathematical theory behind them as laid down 30 years ago by E. F. Codd, and expanded and expounded upon in the decades since by C. J. Date and others. Implementations don't always (okay, never) follow the theory precisely. However, the theory does provide the community with a reasonably shared understanding of what the phrase "relational database" means. The understanding is even clearer if you say "SQL database," because an ISO standard lays out exactly what such a database must provide.
The situation in the world of native XML databases is much murkier, in part because they're still in development. Standards are just being developed, and they cover only part of what's needed to interface with such a database. In fact, most of what I say here about native XML databases won't apply to all products called "native XML databases." Nonetheless, the smoke from the initial volleys is beginning to clear, and I can begin to make some general statements that are at least mostly true about most XML databases, even if exceptions aren't hard to find.
I'll begin with a comparison of the XML model and the relational model, as Table 1 shows. I should say that this is a comparison of an XML model to the relational model, because although the relational model is fairly well defined (even if not always precisely implemented), the XML model has no such standard, de facto or de jure. Still, Table 1 is a reasonable, rough outline of what you can expect.
Table 1. Relational databases compared to XML databases
Relational database | XML database |
A relational database contains tables. | An XML database contains collections. |
A relational table contains records with the same schema. | A collection contains XML documents with the same schema. |
A relational record is an unordered list of named values. | An XML document is a tree of nodes. |
A SQL query returns an unordered set of records. | An XQuery returns an ordered sequence of nodes. |
|
Implementations differ on each of these points. Some native XML databases don't really have a notion of collection. Some databases allow a collection to support several schemas. A few low-end databases don't support schemas at all. (Such databases are more useful than you might expect -- after all, you tend to care more about the instance documents than the schemas.) Other, mostly early products only supported DTDs. Currently, W3C XML Schema is the most commonly supported language among native XML databases. Indeed, the needs of databases -- both traditional relational and native XML -- were major drivers in the design of the W3C XML Schema language. However, widespread dissatisfaction with that language is causing a few vendors to start thinking about RELAX NG, though I've yet to see it implemented in any actual products.
In most XML databases, the fundamental unit is the XML document, which roughly corresponds to a record in a traditional database. One big advantage of a native XML database is that it can run queries that combine (or join, in SQL parlance) information contained in multiple XML documents. The need to query multiple documents explains the design of XQuery, the developing query language for native XML documents, which is in turn based on XPath 2. In fact, the ability to query multiple documents is probably the single most fundamental difference between XPath 1 and XPath 2/XQuery. What SQL is to relational databases, XQuery is to native XML databases.
However, XQuery does not do as much as SQL does. Whereas SQL has four fundamental operations -- SELECT
, INSERT
, UPDATE
, and DELETE
-- as well as some lesser commands for creating and dropping tables and users, XQuery really starts and stops with SELECT
. XQuery lets you retrieve information from an XML database, but that's it. It can't add documents to the database, delete documents from the database, modify existing documents, or do anything else, which is a pretty gaping hole in its capabilities.
For the moment, most native XML databases fill this hole in various, proprietary ways, often implemented as an XQuery extension. The closest thing to a standard in this space (close only in the sense that the moon is closer to Brooklyn than Jupiter is) is XUpdate. XUpdate is implemented by dbXML, eXist, and X-Hive/DB, among other products. For example, here's a simple XUpdate that adds a <MiddleName>Rusty</MiddleName>
element to every Author
element that has a Surname
child element with the value "Harold":
<xupdate:append select="//Author[Surname='Harold']"> |
However, XUpdate is just one possibility, and the number of native XML databases that don't implement it outnumber those that do. A lot of work remains before XUpdate becomes a serious contender, and it really hasn't advanced in the past four years. Longer term, the W3C XQuery working group is expected to add update facilities to XQuery. However, work on this has just begun. So far, only requirements have been published. The group hasn't even published a proposal for the actual syntax of the language. Given that just implementing the XML equivalent of SELECT
has taken the same group five years and counting, I'm not holding my breath.
Benefits of native XML databases
Because the current state of native XML databases is so unsettled, why might you consider using one? Well, possibly for the same reasons you might have considered using relational databases in the early 1980s. Twenty-five years ago, relational databases were slow, buggy, nonstandard memory hogs. Nonetheless, they still had a lot of advantages compared to traditional systems, and they only got better over time.
Today's native XML databases are certainly nonstandard. Some of them, perhaps most, are also slow memory hogs -- though 25 years of Moore's law have made that particular problem less noticeable. How buggy they are varies a lot from one product to the next. Some are ready to go into production today, and others I wouldn't trust to manage a grocery list. However, if you've got a lot of XML data to manage, the technology has some real advantages that may make it worth your time to evaluate the current crop of products.
Everything is in one place
The most important (and most often overlooked) advantage of a native XML database is simply that it keeps all your content in one easily searched, easily managed place. You don't need to worry about file-naming conventions or directory structures -- everything's in the database. All you have to do to get the information out is make a query. File systems are adequate (barely) for single-user systems, and even for those systems, traditional file systems are showing their age. Companies like Apple and Microsoft® are slowly moving the foundations of their operating systems to more database-like structures. For data that is accessed and edited by many different users with varying levels of privilege across heterogeneous systems, a database of some kind is the only option. Today, too much critical data is stored in Microsoft Word files and Excel spreadsheets on the Chief Executive Officer's laptop or in the lead programmer's personal CVS repository. Some (not all) of this information can plausibly be stored in a centralized, database-backed repository. Besides making it possible to find the information when you need it, doing this also enables centralized, professionally managed redundant systems and backups. By storing content in a database, you can avoid losing every draft of a seven-figure proposal and all its supporting documents when your boss leaves his unbacked-up laptop in a taxi.
Multiple views of the same data
A related advantage of storing data in a database is that doing so enables multiple views of the same content. For instance, the version of a proposal you show to the internal team might contain content about anticipated cost structures and profit margins that you might not want to make available to the company whose business you're bidding for. Of course, this advantage is hardly unique to native XML databases. Relational databases do this very well, too. However, it's still worth mentioning. Perhaps the special advantage of a native XML database in this case is that the final report itself becomes just another database query, rather than something produced by a nonstandard tool such as Crystal Reports operating over the output of the query.
Beyond the advantages that are inherent in any database system -- relational, native XML, or otherwise -- using a database specifically for processing XML has several advantages.
Performance
The first advantage is performance. Queries over a well-designed, well-implemented native XML database are simply faster than queries over documents stored in a file system, and for several reasons. First, the database can do all sorts of indexing tricks to operate quickly. For instance, it can maintain a table of all the ID values in a document so that it can jump right to the element with a certain ID rather than having to walk the tree looking for it, as a non-database tool such as the Jaxen XPath engine does. The database can assign sequence numbers to each node so that it knows the position of each node and can compare the document order of two nodes in constant time.
The next reason is that the database has essentially pre-parsed each document when storing it. Therefore, it doesn't need to check each document that the query has accessed for well-formedness, or build an object model representing that document. All these details are already inside the database in a form the query engine can use.
XML databases use a lot of other tricks to optimize performance. A few of these (smart query rewriting, for example) are available to tools that aren't backed by a database. However, the biggest performance wins come from trading insertion and update speed for query speed. The database does more work when it adds or modifies a document, stores the result of that work, and then uses the results to run lightning-fast queries. If queries are significantly more frequent than insertions and updates -- as they tend to be in many applications -- then the extra cost paid to put documents into the database is more than earned back when retrieving them.
Very large documents
The second advantage that native XML databases have over non-database systems is document size. Because databases can be disk backed, they can essentially process arbitrarily large documents. Streaming tools like SAX and System.Xml.XmlReader can do this too, but tree-based tools like XSLT, XPath, and DOM tend to self-limit when documents hit approximately 100 megabytes. Native XML databases allow XSLT, XQuery, DOM, and so forth, to process arbitrarily large documents.
Not one bit is lost
A final advantage of some (though not all) native XML databases is worth mentioning. They can retrieve the original, unparsed document, character-per-character or even byte-per-byte. This functionality is critical in certain legal situations where you need to reproduce the exact, original document down to the last byte. This functionality can also be important in software development, particularly in bug tracking and performance optimization. In these cases, seemingly irrelevant details that shouldn't matter sometimes do. It's important to make sure that the database doesn't change the two bytes in a 10-megabyte document that actually trigger the bug. Parser-based solutions, including systems that shred XML documents before storing them in relational databases, tend to lose some things like white space inside tags, numeric character references, and other normally irrelevant details. Ninety-nine percent of the time, you don't care about these arcana, but if you find yourself in the one percent of cases in which this stuff matters, it's worth looking for a database that preserves it.
Looking forward
The more data you have, the more important it becomes to use some sort of database system to manage it. If the data is XML, a solid native XML database is an obvious choice. The question then becomes where can you find such a system. Given that some of the foundation technologies like XQuery are at least a year away from completion and others are barely getting started, you might question whether you really can find a stable system at this time. Still, the benefits are enough for you to consider moving to a native XML database anyway, as long as you adopt it with the full knowledge that you're going to pay the upgrade costs in the future, either in time, in money, or both.
Resources
- Read Chris Date's book, An Introduction to Database Systems, the standard introduction to the relational model. Date is probably the best advocate of the relational-is-the-one-true-data-model position. The book's eighth edition now includes (grudgingly) a chapter about XML written by IBM's Nick Tindall.
- Get a solid introduction to using XML with various types of database systems at Ronald Bourret's site.
- James Gosling describes the double bump adoption curve now being experienced by native XML databases in "Phase Relationships in the Standardization Process". An alternate way of looking at this is that native XML databases are now "crossing the chasm," as described in the book of the same name by Geoffrey A. Moore.
- Read the XUpdate specification.
- Printed out, the W3C's XQuery specs run to hundreds of pages. I recommend starting with the XML Query Use Cases.
- Check out the open source eXist, which is probably the most widely deployed native XML database, though it has some performance issues.
- Find out more about the Mark Logic Content Interaction Server, probably the hottest closed source XML database right now. Whether it's the most worthy remains to be seen.
- Take a look at dbXML 2.0 from the dbXML Group and Berkeley DB XML 2.1 from Sleepycat Software. Although their names are confusingly similar, they are probably the most robust open source native XML databases available today.
- Read the previous installments of Elliotte Rusty Harold's Managing XML data column here on developerWorks.
- Find hundreds more XML resources on the developerWorks XML zone.
- Learn how you can become an IBM Certified Developer in XML and related technologies.