NAME
   Web::Scraper - Web Scraping Toolkit using HTML and CSS Selectors or
   XPath expressions

SYNOPSIS
     use URI;
     use Web::Scraper;
     use Encode;

     # First, create your scraper block
     my $authors = scraper {
         # Parse all TDs inside 'table[width="100%]"', store them into
         # an array 'authors'.  We embed other scrapers for each TD.
         process 'table[width="100%"] td', "authors[]" => scraper {
           # And, in each TD,
           # get the URI of "a" element
           process "a", uri => '@href';
           # get text inside "small" element
           process "small", fullname => 'TEXT';
         };
     };

     my $res = $authors->scrape( URI->new("http://search.cpan.org/author/?A") );

     # iterate the array 'authors'
     for my $author (@{$res->{authors}}) {
         # output is like:
         # Andy Adler      http://search.cpan.org/~aadler/
         # Aaron K Dancygier       http://search.cpan.org/~aakd/
         # Aamer Akhter    http://search.cpan.org/~aakhter/
         print Encode::encode("utf8", "$author->{fullname}\t$author->{uri}\n");
     }

   The structure would resemble this (visually) { authors => [ { fullname
   => $fullname, link => $uri }, { fullname => $fullname, link => $uri }, ]
   }

DESCRIPTION
   Web::Scraper is a web scraper toolkit, inspired by Ruby's equivalent
   Scrapi. It provides a DSL-ish interface for traversing HTML documents
   and returning a neatly arranged Perl data structure.

   The *scraper* and *process* blocks provide a method to define what
   segments of a document to extract. It understands HTML and CSS Selectors
   as well as XPath expressions.

METHODS
 scraper
     $scraper = scraper { ... };

   Creates a new Web::Scraper object by wrapping the DSL code that will be
   fired when *scrape* method is called.

 scrape
     $res = $scraper->scrape(URI->new($uri));
     $res = $scraper->scrape($html_content);
     $res = $scraper->scrape(\$html_content);
     $res = $scraper->scrape($http_response);
     $res = $scraper->scrape($html_element);

   Retrieves the HTML from URI, HTTP::Response, HTML::Tree or text strings
   and creates a DOM object, then fires the callback scraper code to
   retrieve the data structure.

   If you pass URI or HTTP::Response object, Web::Scraper will
   automatically guesses the encoding of the content by looking at
   Content-Type headers and META tags. Otherwise you need to decode the
   HTML to Unicode before passing it to *scrape* method.

   You can optionally pass the base URL when you pass the HTML content as a
   string instead of URI or HTTP::Response.

     $res = $scraper->scrape($html_content, "http://example.com/foo");

   This way Web::Scraper can resolve the relative links found in the
   document.

 process
     scraper {
         process "tag.class", key => 'TEXT';
         process '//tag[contains(@foo, "bar")]', key2 => '@attr';
         process '//comment()', 'comments[]' => 'TEXT';
     };

   *process* is the method to find matching elements from HTML with CSS
   selector or XPath expression, then extract text or attributes into the
   result stash.

   If the first argument begins with "//" or "id(" it's treated as an XPath
   expression and otherwise CSS selector.

     # <span class="date">2008/12/21</span>
     # date => "2008/12/21"
     process ".date", date => 'TEXT';

     # <div class="body"><a href="http://example.com/">foo</a></div>
     # link => URI->new("http://example.com/")
     process ".body > a", link => '@href';

     # <div class="body"><!-- HTML Comment here --><a href="http://example.com/">foo</a></div>
     # comment => " HTML Comment here "
     #
     # NOTES: A comment nodes are accessed when installed
     # the HTML::TreeBuilder::XPath (version >= 0.14) and/or
     # the HTML::TreeBuilder::LibXML (version >= 0.13)
     process "//div[contains(@class, 'body')]/comment()", comment => 'TEXT';

     # <div class="body"><a href="http://example.com/">foo</a></div>
     # link => URI->new("http://example.com/"), text => "foo"
     process ".body > a", link => '@href', text => 'TEXT';

     # <ul><li>foo</li><li>bar</li></ul>
     # list => [ "foo", "bar" ]
     process "li", "list[]" => "TEXT";

     # <ul><li id="1">foo</li><li id="2">bar</li></ul>
     # list => [ { id => "1", text => "foo" }, { id => "2", text => "bar" } ];
     process "li", "list[]" => { id => '@id', text => "TEXT" };

 process_first
   "process_first" is the same as "process" but stops when the first
   matching result is found.

     # <span class="date">2008/12/21</span>
     # <span class="date">2008/12/22</span>
     # date => "2008/12/21"
     process_first ".date", date => 'TEXT';

 result
   "result" allows to return not the default value after processing but a
   single value specified by a key or a hash reference built from several
   keys.

     process 'a', 'want[]' => 'TEXT';
     result 'want';

EXAMPLES
   There are many examples in the "eg/" dir packaged in this distribution.
   It is recommended to look through these.

NESTED SCRAPERS
   Scrapers can be nested thus allowing to scrape already captured data.

     # <ul>
     # <li class="foo"><a href="foo1">bar1</a></li>
     # <li class="bar"><a href="foo2">bar2</a></li>
     # <li class="foo"><a href="foo3">bar3</a></li>
     # </ul>
     # friends => [ {href => 'foo1'}, {href => 'foo2'} ];
     process 'li', 'friends[]' => scraper {
       process 'a', href => '@href',
     };

FILTERS
   Filters are applied to the result after processing. They can be declared
   as anonymous subroutines or as class names.

     process $exp, $key => [ 'TEXT', sub { s/foo/bar/ } ];
     process $exp, $key => [ 'TEXT', 'Something' ];
     process $exp, $key => [ 'TEXT', '+MyApp::Filter::Foo' ];

   Filters can be stacked

     process $exp, $key => [ '@href', 'Foo', '+MyApp::Filter::Bar', \&baz ];

   More about filters you can find in Web::Scraper::Filter documentation.

XML backends
   By default HTML::TreeBuilder::XPath is used, this can be replaces by a
   XML::LibXML backend using Web::Scraper::LibXML module.

     use Web::Scraper::LibXML;

     # same as Web::Scraper
     my $scraper = scraper { ... };

AUTHOR
   Tatsuhiko Miyagawa <[email protected]>

LICENSE
   This library is free software; you can redistribute it and/or modify it
   under the same terms as Perl itself.

SEE ALSO
   <http://blog.labnotes.org/category/scrapi/>

   HTML::TreeBuilder::XPath