Project Number: | RE 4004 (RE) |
Project Title: | DESIRE II - Development of a European Service for Information on Research and Education II |
Deliverable Type: | PU |
Deliverable Number: | D3.7 |
Contractual Date of Delivery: | Month 21 |
Actual Date of Delivery: | 30/3/2000 |
Title of Deliverable: | Upgraded harvesting system including multiple retrieval protocols |
Workpackage(s) contributing to the Deliverable: | WP3 |
Nature of the Deliverable: | TO |
Author: | Sigfrid Lundberg |
Contact Details: | email:siglun@munin.lub.lu.se phone: +46 46 222 36 83 fax: +46 46 222 36 82 |
Other Authors: | Fredrick Rybarczyk, Tomas Schönthal |
URL | http://www.desire.org/html/research/deliverables/D3.7/ |
Abstract |
The DESIRE Harvesting system, COMBINE, was developed during the first phase of the DESIRE project. During phase II of the project we have actively maintained its source code and improved the product in many ways:- The configuration system has been vastly improved.
Summarizing software has to be produced for each document type definition. The upgraded Combine software is available from:http://www.lub.lu.se/combine/ Forthcoming features that were not completed at time of delivery:
Some functional requirements cannot be met at present:-
|
Keywords | Metadata, Harvesting robot, Collections, Resource discovery databases |
Distribution List: | DESIRE Project Team; European Commission, DESIRE Public Web site |
Issue: | V1.0 |
Reference: | deliverd37.rtf |
Total Number of Pages: | 4 |
Issue Number | Issue Date | Reason for Change |
v0.1 | 06/3/2000 | Initial version for internal review |
v0.2 | 08/3/2000 | Revised in light of internal comments and sent to peer reviewers |
v0.3 | 30/3/2000 | Final version sent to Project Coordinator |
v1.0 | 31/3/2000 | Final version sent to Commission and posted on DESIRE site |
The DESIRE Harvesting system, COMBINE, was developed during the first phase of the DESIRE project.
The COMBINE harvester is a metadata aware harvesting system, which while harvesting stores embedded meta tags in the harvesting database, together with other textual data extracted from other parts of text, HTML, postscript or PDF documents.
The system consists of a control unit, the "cabin", a harvesting database unit and harvesters. These three units communicate with each other using IP-domain sockets and may be distributed across several machines in a local area network.
The software is well suited for what we call regional WWW indexing, that is, for building resource discovery databases covering a geographical domain or a subject area. During phase II of the DESIRE project we have actively maintained its source code and improved the product in many ways.
The upgraded Combine software is available from: http://www.lub.lu.se/combine/
Forthcoming features that were not completed at time of delivery:
Some functional requirements cannot be met at present:
The harvesting policies can be formulated flexibly using allow and exclude rules, allowing distributed data collection implying that a number of servers will each have the responsibility for one or more regions or domains in a broad sense. These areas of responsibility can be assigned based upon actual network domains, organizations or geographical domains just as easily as they could be domains of human knowledge.
An important part of the architecture is an easy way to filter the sets of URLs to be indexed according to some subject or domain. Before a random set of URLs is loaded into the scheduler for processing, they are filtered through an external policy-filter. This filter, which is localised for each installation, determines what URLs are to be harvested given the policy adopted by the installation. It thus defines the region or domain a particular installation will cover.
The project aimed at building a system which: