DESIRE Toolkit Components
2.3 Combine
Description
The aim of the design and architecture of Combine is to provide a
harvesting system which can be used for building moderately large
indexes; however, no attempt has been made to compete with the
worldwide commercial search engines. Rather, Combine could be used for
building an index covering a small country or all universities in a
region. Combine can also be used not only for expanding a collection,
but simply for harvesting text and some other information from a list
of resources (as in the European Link Treasury toolkit:
http://www.lub.lu.se/combine/ELT/).
The harvesting policies can be formulated flexibly using rules for
inclusion and exclusion, allowing distributed data collection; this
implies that a number of servers will each have the responsibility for
one or more regions or domains in a broad sense. These areas of
responsibility can be assigned on the basis of actual network domains,
organisations or geographical domains, as well as for domains of human
knowledge.
An important part of the architecture is an easy way of filtering the
sets of URLs to be indexed according to subject or domain. Before a
random set of URLs is loaded into the scheduler for processing, they
are filtered through an external policy filter. This filter, which is
localised for each installation, determines which URLs are to be
harvested, given the policy adopted by the installation. It thus
defines the region or domain which a particular installation will
cover.
Combine was originally developed during DESIRE I, and in phase II of
DESIRE its source code has been maintained and its configuration system
improved in the following ways:
a single robot installation can now be used to maintain multiple
resource discovery databases
the databases may be configured as collections, and the robot can
dynamically assign a given record to a collection based on content
analysis (e.g. automatic classification, various kinds of string
matching, metadata requirements), or host domain
the parsing/summarising subsystem is now capable of handling
Postscript and PDF documents. It is also able to parse XHTML (HTML
defined using XML) documents, but not XML/SGML documents in general
|