Research: Deliverables: D3.7 Upgraded harvesting system including multiple retrieval protocols

The DESIRE Harvesting system, COMBINE, was developed during the first phase of the DESIRE project. During phase II of the project we have actively maintained its source code and improved the product in many ways:- The configuration system has been vastly improved.

  • A single robot installation can now be used to maintain multiple resource discovery databases.
  • The databases may be configured as Collections, and the robot can dynamically assign a given record to a collection based on content analysis (e.g., automatic classification, various kinds string matching, metadata requirements), or host domain.
  • The parsing/summarizing subsystem is now capable of handling postscript and PDF documents. We are also able to parse XHTML (HTML defined using XML) documents, but not XML/SGML documents in general.

Summarizing software has to be produced for each document type definition.

The upgraded Combine software is available from:http://www.lub.lu.se/combine/

The report is available in the following formats:

Comprehensive Peer reviews for this deliverable are also available:

  • Review by Eliot Christian, U.S. Geological Survey (HTML), (RTF)
  • Review by David Beckett, Computing Laboratory, University of Kent at Canterbury, UK (HTML), (RTF)

Abstract

The DESIRE Harvesting system, COMBINE, was developed during the first phase of the DESIRE project. During phase II of the project we have actively maintained its source code and improved the product in many ways:- The configuration system has been vastly improved.

  • A single robot installation can now be used to maintain multiple resource discovery databases.
  • The databases may be configured as Collections, and the robot can dynamically assign a given record to a collection based on content analysis (e.g., automatic classification, various kinds string matching, metadata requirements), or host domain.
  • The parsing/summarizing subsystem is now capable of handling postscript and PDF documents. We are also able to parse XHTML (HTML defined using XML) documents, but not XML/SGML documents in general.

Summarizing software has to be produced for each document type definition.

The upgraded Combine software is available from:http://www.lub.lu.se/combine/

Forthcoming features that were not completed at time of delivery:

  • An improved XML-based record format built upon an internal DTD.
  • A parser extracting information from UNIX tar archive and possibly zip archives as well.
  • The latter parser should be used in conjunction with a support for the FTP protocol using an HTTP proxy.

Some functional requirements cannot be met at present:-

  • Since Worldwide Web consortium has not released RDF schema as a recommendation, and since the Dublin Core Metadata Initiative (DCMI) has neither issued recommendations on how DC should be encoded in RDF, nor a set of interoperability qualifiers, we are not yet prepared to support RDF metadata.

Keywords

Metadata, Harvesting robot, Collections, Resource discovery databases