NAME

Labrador::Crawler::ContentFilter

SYNOPSIS

	use Labrador::Crawler::ContentFilter;
	my $filter = new Labrador::Crawler::ContentFilter(
		'Binary', $config, $dispatcher_client);
	my $privs = {'index' => 0, 'follow' => 0};
	$filter->filter($document, $privs);
	print "May not index\n" unless $privs->{'index'};
	print "May not follow\n" unless $privs->{'follow'};

DESCRIPTION

Abstract class. Must be implemented.

Content filters are responsible for looking at content and determining two things: a) if the content should not be indexed and b) if the links in the content should not be followed.

KNOWN CHILDREN

Binary
Detects binary content by looking for the null character (\0) in the document.
ContentTypes
Only index or follow content types allowed in the configuration file
MetaRobots
Examines any Meta robots tag in HTML documents
Fingerprint
Takes a fingerprint of the document and asks the dispatcher if its seen that finger print before.
WhitelistLanguages
Only index documents that contain stopwords of our desired languages.

METHODS

new($name, $config, $dispatcher_client)

Constructs a new Content Filter object

name()

Returns the name of this filter - useful for debugging warnings.

filter($document, $privs)

Abstract - each child class must provide this method, which alters the filter settings of $privs ('follow', 'index') according to some heuristic on the content.

init()

An optional method that is called when the class is started, so that any child module can be initialiased

_load_module($name)

Load the module named $name.

REVISION

	$Revision: 1.4 $