NAME

Labrador::Crawler::Document

SYNOPSIS

	my $doc = new Labrador::Crawler::Document($HTTPresponse);
	my @links = $doc->links;
	my $content_type = $doc->content_type;
	print $doc->content;

DESCRIPTION

This module is a wrapper around HTTP::Reponse, such that generic methods can be called on a Document object, and the correct child class will provide the data required. Child classes will be created for common types of document, including HTML, PDF, PS.

METHODS

new($response): Constructs a new Document object from HTTP::Response object $response. If a specific child implementation exists for the given type of document, it will be used. Eg PDF, Postscript.
init: Initialiase the handler. This should be called from child handlers to ensure any common functionality is initiliased.
content_type: Returns the original content type of the response.
header($name, [$value]): Passes through to HTTP::Headers->header
response: Returns the original HTTP::Response object returned by the request. NB: the content() method of HTTP::Response should not be used, as the data may have been compressed.
url: Return the URI object used during this request.
links: Returns an array of links found in this document.
content
contents: Returns the downloaded content for this Document. Note this content may have been transformed or converted by the document handler. NB: To limit stack size, this returns a reference to the data.
fingerprint: Returns an MD5 sum (base 64) of the content of this data. This can be used for ignoring content across identical hosts.

PRIVATE METHODS

_load_module($name)

Load the module named $name.

TODO

Look at file extensions to match file types
Peruse modules in BEGIN{} to generate CONTENT_TYPE_MAP etc
More Child classes, eg for RTF, RSS/RDF?, MS Word, Powerpoint
Generic link abstraction using URI::Find, if available

REVISION

	$Revision: 1.5 $