NAME

Labrador::Crawler::Dispatcher_Client

SYNOPSIS

	my $client = new Labrador::Crawler::Dispatcher_Client('localhost' 2460);
	$client->connect;
	my $forkcount = $client->WORK;
	$client->disconnect;

DESCRIPTION

This object provides an easy way for each sub crawler to talk to the dispatcher. It encapsulates all possible VERBs (questions).

For more information about how this class talks to the dispatcher, please refer to docs/protocol.txt

METHODS

Methods are documented below. Most methods are actually VERB invocations and are described under VERBS.

new($dispatcher_hostname, $dispatcher_port)

Instantiates the object, but does NOT open a connection to the dispatcher. This is done separately using connect(). This is a two-state object (connection open, and connection closed).

connect

Opens a (TCP) connection to the dispatcher. Returns 1 if successfully connected and protocol handshake succeeded, 0 otherwise.

disconnect

Sends QUIT, and closes the (TCP) connection with the dispatcher.

VERBS

This object supports the following verbs, and they may be called as described. For more protocol information refer to docs/protocol.txt

WORK

Queries the dispatcher, to see how many clients a host fork

CONF

Obtain the configuration file from the dispatcher. Returns an array of the configuration text file lines, or an empty array if failure.

NEXT($n)

Obtain the next $n URLs to process from the dispatcher.

FINISHED($url, @links)

Ask the dispatcher to mark $url as finished, and add @links to the master queue.

ALLOWED($url)

Returns the result of the URL filters of the dispatcher on this URL. Implements local caching on the result of filtering $url. NB: this could grow extremely large over pro-longed usage.

ROBOTS($hostnameport)

Asks the dispatcher for the robots.txt file for the hostname and port given. (Joined by :). Will return ('#') for an empty file, or blank for not cached by the dispatcher.

ROBOTSFILE($hostnameport, @file)

Submit a robots.txt (@file) for server running on specified $hostnameport to the dispatcher, so that other crawlers can access it.

STATS(%stats)

Submit the stats of this subcrawler to the dispatcher where they can be aggregated.

FAILED($url, $reason)

Inform the dispatcher that the retrieval of $url failed because of $reason.

NOOP

Just checks a reply can be obtained from the dispatcher. Useful for checking connectivity with a client.

MONITOR

Obtain the stats hash from the dispatcher. Mainly used for monitoring the progress of a crawl.

FINGERPRINT($md5, $url)

Checks the fingerprint with the dispatcher, and returns the url which it was seen at already, otherwise returns 0, having noted the fingerprint for future reference.

Private methods

These are documented for completeness, but should only be used internally.

_command($command, $arg1, @args)

Sends a command with verb $command to the dispatcher. $arg1 is appended to the verb (following a space). @args are appended as separate lines to the request. returns ($status_code, @all_returned_lines);

_sendlines(@lines)

Code copied from Net::Cmd. Uses syswrites to write to the secoket.

_getline()

Reads lines using sysread. Handles partial reads! Code shamelessly stolen from Net::Cmd, which is the basis for Net::FTP among other modules.

name

Returns the name of this crawler. Currently set to hostname:processid.

last_result_code()

Returns the last status code from the last command executed. Returns 0 if no command has yet been executed.

REVISION

	$Revision: 1.21 $