DESIRE Toolkit Components

2.3 Combine

Description

The aim of the design and architecture of Combine is to provide a harvesting system which can be used for building moderately large indexes; however, no attempt has been made to compete with the worldwide commercial search engines. Rather, Combine could be used for building an index covering a small country or all universities in a region. Combine can also be used not only for expanding a collection, but simply for harvesting text and some other information from a list of resources (as in the European Link Treasury toolkit: http://www.lub.lu.se/combine/ELT/).

The harvesting policies can be formulated flexibly using rules for inclusion and exclusion, allowing distributed data collection; this implies that a number of servers will each have the responsibility for one or more regions or domains in a broad sense. These areas of responsibility can be assigned on the basis of actual network domains, organisations or geographical domains, as well as for domains of human knowledge.

An important part of the architecture is an easy way of filtering the sets of URLs to be indexed according to subject or domain. Before a random set of URLs is loaded into the scheduler for processing, they are filtered through an external policy filter. This filter, which is localised for each installation, determines which URLs are to be harvested, given the policy adopted by the installation. It thus defines the region or domain which a particular installation will cover.

Combine was originally developed during DESIRE I, and in phase II of DESIRE its source code has been maintained and its configuration system improved in the following ways:

  • a single robot installation can now be used to maintain multiple resource discovery databases
  • the databases may be configured as collections, and the robot can dynamically assign a given record to a collection based on content analysis (e.g. automatic classification, various kinds of string matching, metadata requirements), or host domain
  • the parsing/summarising subsystem is now capable of handling Postscript and PDF documents. It is also able to parse XHTML (HTML defined using XML) documents, but not XML/SGML documents in general


    demonstrator:
    http://safari.hsv.se/index.html.en

    home page:
    http://www.lub.lu.se/combine/


    installation:
    http://www.lub.lu.se/combine/dist/combine-v1.1-src.tar.gz

    installation documentation:
    http://www.lub.lu.se/combine/docs/uguide.html


    technical documentation:
    http://www.lub.lu.se/combine/docs/tguide.html