use Labrador::Crawler::CrawlerHandler my $manager = new Labrador::Crawler::CrawlerHandler( config => $config, dispatcher => 'sibu.dcs.gla.ac.uk', port => 2680); $manager->run(); exit;
Communicates with the dispatcher, generally managing this crawler.
This module is affected by the following configuration file directives:
Automatically called from the constructor.
Commence crawling. Doesn't return until crawling ends. Fetches next URL from $manager->next_url(), checks it for robots.txt compliance, and crawls it. Syncs stats with the dispatcher if enough time has expired. Also involves a retry mechanism for retries - backs off for 5 seconds each time it finds no work to do, upto work_retries times.
Probably the most important subroutine in the entire crawler Event called by agent, when it has successfully retrieved $url. All sorts of useful information wrapped up in $document. For more information see, the documentation for Labrador::Crawler::Document
Event method called when the Agent finds that $url redirects to another URL. More information can be found in $HTTPresponse.
The Manager treats a redirection as a page with only one URL on it.
Event method called when the Agent failed to retrieve $url. More information can be found in $HTTPresponse.
Checks whether $url is allowed by the robots.txt file on that host (if any). Will check it's own disk cache, dispatcher's disk cache, and finally retrieve the robots.txt if necessary.
Runs $url through each of the loaded URLFilters. Returns 1 if the URL is allowed, 0 otherwise.
Runs the document through all loaded content filters, altering values of $privs as it progresses. Notice that this shortcuts - it will terminate if all privilege values are zero.
Shutdown the crawler. Simplisticly undefs each of the objects fields.
As usual, these are only documented for completeness, and should not be directly called.
Returns 1 if all items of @array are false. (Remember false means 0 or "" or undef).
Load the module called $name.
$Revision: 1.12 $