=head1 NAME
WWW::Crawler::Mojo - A web crawling framework for Perl
=head1 SYNOPSIS
use strict;
use warnings;
use utf8;
use WWW::Crawler::Mojo;
use 5.10.0;
my $bot = WWW::Crawler::Mojo->new;
$bot->on(res => sub {
my ($bot, $discover, $job, $res) = @_;
$discover->();
});
$bot->on(refer => sub {
my ($bot, $enqueue, $job, $context) = @_;
$enqueue->();
});
$bot->enqueue('
http://example.com/');
$bot->crawl;
=head1 DESCRIPTION
L<WWW::Crawler::Mojo> is a web crawling framework for those who familier with
L<Mojo>::* APIs.
Note that the module is aimed at trivial use cases of crawling within a
moderate range of web pages so DO NOT use it for persistent crawler jobs.
=head1 ATTRIBUTE
L<WWW::Crawler::Mojo> inherits all attributes from L<Mojo::EventEmitter> and
implements the following new ones.
=head2 ua
A L<Mojo::UserAgent> instance.
my $ua = $bot->ua;
$bot->ua(Mojo::UserAgent->new);
=head2 ua_name
Name of crawler for User-Agent header.
$bot->ua_name('my-bot/0.01 (+
https://example.com/)');
say $bot->ua_name; # 'my-bot/0.01 (+
https://example.com/)'
=head2 active_conn
A number of current connections.
$bot->active_conn($bot->active_conn + 1);
say $bot->active_conn;
=head2 active_conns_per_host
A number of current connections per host.
$bot->active_conns_per_host($bot->active_conns_per_host + 1);
say $bot->active_conns_per_host;
=head2 depth
A number of max depth to crawl. Note that the depth is the number of HTTP
requests to get to URI starting with the first job. This doesn't mean the
deepness of URI path detected with slash.
$bot->depth(5);
say $bot->depth; # 5
=head2 max_conn
A number of max connections.
$bot->max_conn(5);
say $bot->max_conn; # 5
=head2 max_conn_per_host
A number of max connections per host.
$bot->max_conn_per_host(5);
say $bot->max_conn_per_host; # 5
=head2 peeping_max_length
Max length of peeping API content.
$bot->peeping_max_length(100000);
say $bot->peeping_max_length; # 100000
=head2 queue
FIFO array contains L<WWW::Crawler::Mojo::Job> objects.
push(@{$bot->queue}, WWW::Crawler::Mojo::Job->new(...));
my $job = shift @{$bot->queue};
=head1 EVENTS
L<WWW::Crawler::Mojo> inherits all events from L<Mojo::EventEmitter> and
implements the following new ones.
=head2 res
Emitted when crawler got response from server.
$bot->on(res => sub {
my ($bot, $discover, $job, $res) = @_;
if (...) {
$discover->();
} else {
# DO NOTHING
}
});
=head2 refer
Emitted when new URI is found. You can enqueue the URI conditionally with
the callback.
$bot->on(refer => sub {
my ($bot, $enqueue, $job, $context) = @_;
if (...) {
$enqueue->();
} elseif (...) {
$enqueue->(...); # maybe different url
} else {
# DO NOTHING
}
});
=head2 empty
Emitted when queue length got zero. The length is checked every 5 seconds.
$bot->on(empty => sub {
my ($bot) = @_;
say "Queue is drained out.";
});
=head2 error
Emitted when user agent returns no status code for request. Possibly caused by
network errors or un-responsible servers.
$bot->on(error => sub {
my ($bot, $error, $job) = @_;
say "error: $_[1]";
if (...) { # until failur occures 3 times
$bot->requeue($job);
}
});
Note that server errors such as 404 or 500 cannot be catched with the event.
Consider res event for the use case instead of this.
=head2 start
Emitted right before crawl is started.
$bot->on(start => sub {
my $self = shift;
...
});
=head1 METHODS
L<WWW::Crawler::Mojo> inherits all methods from L<Mojo::EventEmitter> and
implements the following new ones.
=head2 crawl
Start crawling loop.
$bot->crawl;
=head2 init
Initialize crawler settings.
$bot->init;
=head2 process_job
Process a job.
$bot->process_job;
=head2 say_start
Displays starting messages to STDOUT
$bot->say_start;
=head2 peeping_handler
peeping API dispatcher.
$bot->peeping_handler($loop, $stream);
=head2 discover
Parses and discovers links in a web page. Each links are appended to FIFO array.
$bot->discover($res, $job);
=head2 enqueue
Append a job with a URI or L<WWW::Crawler::Mojo::Job> object.
$bot->enqueue($job);
=head2 requeue
Append a job for re-try.
$self->on(error => sub {
my ($self, $msg, $job) = @_;
if (...) { # until failur occures 3 times
$bot->requeue($job);
}
});
=head2 collect_urls_html
Collects URLs out of HTML.
WWW::Crawler::Mojo::collect_urls_html($dom, sub {
my ($uri, $dom) = @_;
});
=head2 collect_urls_css
Collects URLs out of CSS.
WWW::Crawler::Mojo::collect_urls_css($dom, sub {
my $uri = shift;
});
=head2 guess_encoding
Guesses encoding of HTML or CSS with given L<Mojo::Message::Response> instance.
$encode = WWW::Crawler::Mojo::guess_encoding($res) || 'utf-8'
=head2 resolve_href
Resolves URLs with a base URL.
WWW::Crawler::Mojo::resolve_href($base, $uri);
=head1 EXAMPLE
L<
https://github.com/jamadam/WWW-Flatten>
=head1 AUTHOR
Sugama Keita, E<lt>
[email protected]<gt>
=head1 COPYRIGHT AND LICENSE
Copyright (C) jamadam
This program is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
=cut