Parsing XML... backwards?
|
|
|
|
| Articles Reviews XML | |
| Written by Michael Day | |
| Thursday, 15 March 2007 | |
|
{mos_sb_discuss:24}
The justification sounds almost plausible: an instant messaging client (Adium on the Mac) that writes out XML message log files and uses backwards parsing as a method for retrieving the last N messages in constant time, regardless of how many messages the file contains in total. However, it’s crazy to think of doing this for XML in general. First problem: the document encoding. You don’t know what it is unless you sniff the beginning of the file and read the XML declaration, if present. A specific application may always write out XML in the same encoding and thus not bother to check, but this is not good enough for the general case. Second problem: the DOCTYPE declaration. This can define entities and fixed attribute values, and again it’s at the beginning of the file, not the end. If you parse the file backwards and hit an entity reference, you have no idea what to do with it. A specific application may decide it’s just not going to handle entities, but that won’t work for XML in general. Third problem: comments. Say you’re parsing backwards through an XML file and you see this: -->. Must be the end of a comment, right? Wrong, it’s the end of an element: This is a killer for efficient parsing as it means you need potentially unbounded look-ahead (or look-behind, in this case) to decide what something is. (This problem could be avoided if comments were symmetrical and ended with --!>, but XML just wasn’t designed to be parsed backwards). Fourth problem: processing instructions. As with the comments, how can you tell what this is: ?>. The problem this time is that text can contain unescaped > characters (as long as they don’t follow ]]), so a backwards parser may need to look a very long way ahead to tell if this is the end of a processing instruction or just some text. Powered by jReviews |
|
| Last Updated ( Friday, 08 June 2007 ) | |
| < Prev | Next > |
|---|







