Parsing Invalid Markup with Ruby
|
|
|
|
| Articles Reviews Ruby | |
| Written by JOHN | |
| Friday, 21 September 2007 | |
|
{mos_sb_discuss:50} What to do if you need to extract data from a document that’s supposed to be HTML or XML, but that contains some invalid markup. For a quick solution, use Rubyful Soup, written by Leonard Richardson and found in the rubyful_soup gem. It can build a document model even out of invalid XML or HTML, and it offers an idiomatic Rubyinterface for searching the document model. It’s good for quick screen-scraping tasks or HTML cleanup. require 'rubygems' require 'rubyful_soup' invalid_html = 'A lot of <b class=1>tags are <i class=2>never closed.' soup = BeautifulSoup.new(invalid_html) puts soup.prettify # A lot of # <b class="1">tags are # <i class="2">never closed. # </i> # </b> soup.b.i # => <i class="2">never closed.</i> soup.i # => <i class="2">never closed.</i> soup.find(nil, :attrs=>{'class' => '2'}) # => <i class="2">never closed.</i> soup.find_all('i') # => [<i class="2">never closed.</i>] soup.b['class'] # => "1" soup.find_text(/closed/) # => "never closed." If you need better performance, do what Rubyful Soup does and write a custom parser on top of the event-based parser SGMLParser (found in the htmltools gem). It works a lot like REXML’s StreamListener interface. Sometimes it seems like the authors of markup parsers do their coding atop an ivory tower. Most parsers simplyrefuse to parse bad markup, but this cuts off an enormous source of interesting data. Most of the pages on the World Wide Web are invalid You can also subclass BeautifulStoneSoup and implement your own heuristics. Rubyful Soup builds a densely linked model of the entire document, which uses a lot of memory. If you only need to process certain parts of the document, you can implementthe SGMLParser hooks yourself and get a faster parser that uses less memory. extractor = LinkGrabber.newextractor.feed(html) The equivalent Rubyful Soup program is quicker to write and easier to understand, but it runs more slowly and uses more memory: Powered by jReviews |
|
| Last Updated ( Friday, 21 September 2007 ) | |
| < Prev | Next > |
|---|







