NAME

Labrador::Crawler::Document

SYNOPSIS

	my $doc = new Labrador::Crawler::Document($HTTPresponse);
	my @links = $doc->links;
	my $content_type = $doc->content_type;
	print $doc->content;

DESCRIPTION

This module is a wrapper around HTTP::Reponse, such that generic methods can be called on a Document object, and the correct child class will provide the data required. Child classes will be created for common types of document, including HTML, PDF, PS.

METHODS

new($response)

Constructs a new Document object from HTTP::Response object $response. If a specific child implementation exists for the given type of document, it will be used. Eg PDF, Postscript.

init

Initialiase the handler. This should be called from child handlers to ensure any common functionality is initiliased.

content_type

Returns the original content type of the response.

header($name, [$value])

Passes through to HTTP::Headers->header

response

Returns the original HTTP::Response object returned by the request. NB: the content() method of HTTP::Response should not be used, as the data may have been compressed.

url

Return the URI object used during this request.

links

Returns an array of links found in this document.

content
contents

Returns the downloaded content for this Document. Note this content may have been transformed or converted by the document handler. NB: To limit stack size, this returns a reference to the data.

fingerprint

Returns an MD5 sum (base 64) of the content of this data. This can be used for ignoring content across identical hosts.

PRIVATE METHODS

_load_module($name)

Load the module named $name.

TODO

Look at file extensions to match file types
Peruse modules in BEGIN{} to generate CONTENT_TYPE_MAP etc
More Child classes, eg for RTF, RSS/RDF?, MS Word, Powerpoint
Generic link abstraction using URI::Find, if available

REVISION

	$Revision: 1.5 $