Project Number:

RE 4004 (RE)

Project Title:

DESIRE II - Development of a European Service for Information on Research and Education II

Deliverable Type:

PU

Deliverable Number:

D3.7

Contractual Date of Delivery:

Month 21

Actual Date of Delivery:

30/3/2000

Title of Deliverable:

Upgraded harvesting system including multiple retrieval protocols

Workpackage(s) contributing to the Deliverable:

WP3

Nature of the Deliverable:

TO

Author:

Sigfrid Lundberg

Contact Details:

email:siglun@munin.lub.lu.se

phone: +46 46 222 36 83

fax: +46 46 222 36 82

URL: http://www.lub.lu.se/~siglun/

Other Authors:

Fredrick Rybarczyk, Tomas Schönthal

URL

http://www.desire.org/html/research/deliverables/D3.7/

Abstract

The DESIRE Harvesting system, COMBINE, was developed during the first phase of the DESIRE project. During phase II of the project we have actively maintained its source code and improved the product in many ways:- The configuration system has been vastly improved.

  • A single robot installation can now be used to maintain multiple resource discovery databases.
  • The databases may be configured as Collections, and the robot can dynamically assign a given record to a collection based on content analysis (e.g., automatic classification, various kinds string matching, metadata requirements), or host domain.
  • The parsing/summarizing subsystem is now capable of handling postscript and PDF documents. We are also able to parse XHTML (HTML defined using XML) documents, but not XML/SGML documents in general.

Summarizing software has to be produced for each document type definition.

The upgraded Combine software is available from:http://www.lub.lu.se/combine/

Forthcoming features that were not completed at time of delivery:

  • An improved XML-based record format built upon an internal DTD.
  • A parser extracting information from UNIX tar archive and possibly zip archives as well.
  • The latter parser should be used in conjunction with a support for the FTP protocol using an HTTP proxy.

Some functional requirements cannot be met at present:-

  • Since Worldwide Web consortium has not released RDF schema as a recommendation, and since the Dublin Core Metadata Initiative (DCMI) has neither issued recommendations on how DC should be encoded in RDF, nor a set of interoperability qualifiers, we are not yet prepared to support RDF metadata.

Keywords

Metadata, Harvesting robot, Collections, Resource discovery databases

Distribution List:

DESIRE Project Team; European Commission, DESIRE Public Web site

Issue:

V1.0

Reference:

deliverd37.rtf

Total Number of Pages:

4

Document Control

Issue Number

Issue Date

Reason for Change

v0.1

06/3/2000

Initial version for internal review

v0.2

08/3/2000

Revised in light of internal comments and sent to peer reviewers

v0.3

30/3/2000

Final version sent to Project Coordinator

v1.0

31/3/2000

Final version sent to Commission and posted on DESIRE site

Executive Summary

The DESIRE Harvesting system, COMBINE, was developed during the first phase of the DESIRE project.

The COMBINE harvester is a metadata aware harvesting system, which while harvesting stores embedded meta tags in the harvesting database, together with other textual data extracted from other parts of text, HTML, postscript or PDF documents.

The system consists of a control unit, the "cabin", a harvesting database unit and harvesters. These three units communicate with each other using IP-domain sockets and may be distributed across several machines in a local area network.

The software is well suited for what we call regional WWW indexing, that is, for building resource discovery databases covering a geographical domain or a subject area. During phase II of the DESIRE project we have actively maintained its source code and improved the product in many ways.

The upgraded Combine software is available from: http://www.lub.lu.se/combine/

Forthcoming features that were not completed at time of delivery:

Some functional requirements cannot be met at present:

Scope Statement

The harvesting policies can be formulated flexibly using allow and exclude rules, allowing distributed data collection implying that a number of servers will each have the responsibility for one or more regions or domains in a broad sense. These areas of responsibility can be assigned based upon actual network domains, organizations or geographical domains just as easily as they could be domains of human knowledge.

An important part of the architecture is an easy way to filter the sets of URLs to be indexed according to some subject or domain. Before a random set of URLs is loaded into the scheduler for processing, they are filtered through an external policy-filter. This filter, which is localised for each installation, determines what URLs are to be harvested given the policy adopted by the installation. It thus defines the region or domain a particular installation will cover.

The project aimed at building a system which:


Title:
Issue: V1.0
Date: 30/3/2000