Parsing XML... backwards?  Hot PDF Print E-mail
Tag it:
Delicious
Furl it!
Digg
NewsVine
Reddit
YahooMyWeb
Technorati
Articles Reviews XML
Written by Michael Day   
Thursday, 15 March 2007

{mos_sb_discuss:24}

Okay, I’ve heard jokes about people parsing XML files backwards, starting at the end of the file and reporting SAX events in reverse document order, but it seems that someone has actually gone and done it.


 

The justification sounds almost plausible: an instant messaging client (Adium on the Mac) that writes out XML message log files and uses backwards parsing as a method for retrieving the last N messages in constant time, regardless of how many messages the file contains in total.

However, it’s crazy to think of doing this for XML in general.

First problem: the document encoding. You don’t know what it is unless you sniff the beginning of the file and read the XML declaration, if present. A specific application may always write out XML in the same encoding and thus not bother to check, but this is not good enough for the general case.

Second problem: the DOCTYPE declaration. This can define entities and fixed attribute values, and again it’s at the beginning of the file, not the end. If you parse the file backwards and hit an entity reference, you have no idea what to do with it.

A specific application may decide it’s just not going to handle entities, but that won’t work for XML in general.

Third problem: comments. Say you’re parsing backwards through an XML file and you see this: -->. Must be the end of a comment, right? Wrong, it’s the end of an element:

This is a killer for efficient parsing as it means you need potentially unbounded look-ahead (or look-behind, in this case) to decide what something is. (This problem could be avoided if comments were symmetrical and ended with --!>, but XML just wasn’t designed to be parsed backwards).

Fourth problem: processing instructions. As with the comments, how can you tell what this is: ?>. The problem this time is that text can contain unescaped > characters (as long as they don’t follow ]]), so a backwards parser may need to look a very long way ahead to tell if this is the end of a processing instruction or just some text.

Read more


User reviews

There are no user reviews for this item.

Add new review




Powered by jReviews

Last Updated ( Friday, 08 June 2007 )
 
< Prev   Next >