NAME

Labrador::Crawler::CrawlerHandler

SYPNOPSIS

	use Labrador::Crawler::CrawlerHandler
	my $manager = new Labrador::Crawler::CrawlerHandler(
		config => $config,
		dispatcher => 'sibu.dcs.gla.ac.uk',
		port => 2680);
	$manager->run();
	exit;

DESCRIPTION

Communicates with the dispatcher, generally managing this crawler.

CONFIGURATION

This module is affected by the following configuration file directives:

SpiderSyncStats
SpiderName
FollowTagLinks
AllowedProtocols
ContentHandler

METHODS

new(%params)

Pure instantiation.

init(%params)

Automatically called from the constructor.

Instantiates Labrador::Crawler::Agent
Instantiates WWW::RobotRules
Instantiates Labrador::Common::RobotsCache
Loads file name extensions blacklist into a regex
Loads content type whitelists into regexs
Loads follow tag whitelist into a regex
Loads Protocol scheme whitelist into a regex
Instantiates all ContentHandlers
Sets default for retries.

run

Commence crawling. Doesn't return until crawling ends. Fetches next URL from $manager->next_url(), checks it for robots.txt compliance, and crawls it. Syncs stats with the dispatcher if enough time has expired. Also involves a retry mechanism for retries - backs off for 5 seconds each time it finds no work to do, upto work_retries times.

agent_success($url, $document)

Probably the most important subroutine in the entire crawler Event called by agent, when it has successfully retrieved $url. All sorts of useful information wrapped up in $document. For more information see, the documentation for Labrador::Crawler::Document

agent_redirect($url, $HTTPresponse)

Event method called when the Agent finds that $url redirects to another URL. More information can be found in $HTTPresponse.

The Manager treats a redirection as a page with only one URL on it.

agent_failure($url, $HTTPresponse)

Event method called when the Agent failed to retrieve $url. More information can be found in $HTTPresponse.

url_robots_allowed($url)

Checks whether $url is allowed by the robots.txt file on that host (if any). Will check it's own disk cache, dispatcher's disk cache, and finally retrieve the robots.txt if necessary.

run_url_filters($url)

Runs $url through each of the loaded URLFilters. Returns 1 if the URL is allowed, 0 otherwise.

run_content_filters($document, $privs)

Runs the document through all loaded content filters, altering values of $privs as it progresses. Notice that this shortcuts - it will terminate if all privilege values are zero.

shutdown()

Shutdown the crawler. Simplisticly undefs each of the objects fields.

PRIVATE METHODS

As usual, these are only documented for completeness, and should not be directly called.

_all_zero(@array): Returns 1 if all items of @array are false. (Remember false means 0 or "" or undef).
_load_module($name): Load the module called $name.

REVISION

	$Revision: 1.12 $