NAME

Labrador::Crawler::ContentFilter

SYNOPSIS

	use Labrador::Crawler::ContentFilter;
	my $filter = new Labrador::Crawler::ContentFilter(
		'Binary', $config, $dispatcher_client);
	my $privs = {'index' => 0, 'follow' => 0};
	$filter->filter($document, $privs);
	print "May not index\n" unless $privs->{'index'};
	print "May not follow\n" unless $privs->{'follow'};

DESCRIPTION

Abstract class. Must be implemented.

Content filters are responsible for looking at content and determining two things: a) if the content should not be indexed and b) if the links in the content should not be followed.

KNOWN CHILDREN

Binary: Detects binary content by looking for the null character (\0) in the document.
ContentTypes: Only index or follow content types allowed in the configuration file
MetaRobots: Examines any Meta robots tag in HTML documents
Fingerprint: Takes a fingerprint of the document and asks the dispatcher if its seen that finger print before.
WhitelistLanguages: Only index documents that contain stopwords of our desired languages.

METHODS

new($name, $config, $dispatcher_client)

Constructs a new Content Filter object

name()

Returns the name of this filter - useful for debugging warnings.

filter($document, $privs)

Abstract - each child class must provide this method, which alters the filter settings of $privs ('follow', 'index') according to some heuristic on the content.

init()

An optional method that is called when the class is started, so that any child module can be initialiased

_load_module($name)

Load the module named $name.

REVISION

	$Revision: 1.4 $