Project Number: | RE 4004 (RE) | ||
Project Title: | DESIRE II - Development of a European Service for Information on Research and Education II | ||
Deliverable Number: | D3.7 | ||
Deliverable Title: | Upgraded harvesting system including multiple retrieval protocols | ||
Deliverable Type: | PU | ||
Deliverable Kind: | TO | ||
Principal Reviewer: | Name | Eliot Christian | |
Address | U.S. Geological Survey 802 National CenterReston, VA 20192USA | ||
echristi@usgs.gov | |||
Telephone | +1-703-648-7245 | ||
Fax | +1-703-648-7112 | ||
Credentials | (Qualifications/Relevant experience/short CV) Eliot Christian is helping develop and promote a global vision for information access that enhances the free flow of information through a decentralized "Global Information Locator Service". He helped establish this approach in law, policy, standards, and technology at the United States Federal level, building consensus among government agencies and developing key support among libraries, information services, and corporations. He has been carrying these ideas to other levels of government and internationally, leading the ISO Metadata Working Group and consulting on a variety of initiatives supporting a Global Information Infrastructure. Since 1990, Mr. Christian has pursued issues of data and information management primarily from the perspective of environment and earth science at the interagency and international levels. He joined the United States Geological Survey in 1986, as a manager of data and information systems with a focus on strategic planning, standards, and new technologies. From 1975 to 1986, he worked as a computer resources manager in the Veterans Administration, helping direct six nationwide data processing centers, overseeing data management for all VA corporate databases, and operating a large, state-of-the-art distributed database transaction system. | ||
Summary: | Relevant | (1 = poor, 5 = excellent) 5The new version of the Combine harvester is very relevant to the overall goals of the Desire project and the specific task to upgrade the harvesting system. Over the past two years, there has been a growing awareness that gathering information over the Internet is a complex and demanding challenge that is absolutely crucial if distributed information resources are to be exploited effectively. Commercial search engines often compete now on the speed, reliability, and ease-of-use of their Web harvesting component. | |
State-of-Art | 5 The Combine harvester has features that distinguish it as at the forefront among available harvesting software, such as multiple concurrent and distributed operation, an adjustable revisiting algorithm, and a way to determine whether content has changed since the prior retrieval. I believe there is a significant innovation here represented by the Combine catalog record, presently defined as the internal record format of the hdb database. | ||
Meets Objectives | 4 I did not see that the Combine harvester has been extended to cover other protocols such as FTP, NNTP, and mailing lists. Perhaps this objective is envisioned to be addressed by alternate harvesters? | ||
Clarity | 4 I am afraid readers may be overwhelmed when first confronted by a detailing of Combine features. As now presented, your audience may be limited to system administrators already convinced they require a full-featured and complicated harvesting facility. | ||
Value to Users | 5 The upgraded harvester should be of great value to those who need to harvest complex Web information resources and to those who need to find information resources through the intermediary of a harvesting service. | ||
Specific Criticisms | 1 | It strikes me that a harvesting service can be regarded as a kind of selective filter of information resources. This being so, I believe it would be good practice for operators of a harvesting facility to make it as "transparent" as possible. In the case of Combine, it would be useful for the software to emit a human-readable description of all of its default as well as operator-specified behaviors. In effect, these behaviors are a kind of "collections policy" that affect available content at least as much as do more overt filtering mechanisms like PICS labels. | |
2 | Following on the "transparency" theme above, I believe it would be quite useful to expose the Combine catalog record rather than being merely an internal record format. Perhaps this record could be represented in a format such as that used in the Advanced Search Facility, itself based on a GILS XML representation described in an RDF schema at <http://www.gils.net/xml-en.rdfs> In such a format, these records could be not only searched interoperably with libraries and other catalogs, but the records could be readily reviewed, edited, and freely interchanged. | ||
3 | A more approachable presentation might begin with an almost trivial example of an application of Combine, and proceed to additional features by expanding the example application. For instance, Combine might be introduced for a one-time harvest of single set of Web pages on a local disk. This example could be expanded stepwise to include: scheduled harvesting, multiple host harvesting, multiple harvesters, etc. | ||
4 | I would be interested to see more development of the ability to exploit contextual or implied metadata (e.g., pointers from elsewhere, user evaluations, Internet registration records, domain assertions, etc.), although the "$get_meta" directive is a start. | ||
Developer Response: | 1 | There is in fact a script which will report its current configuration, and which parameters that are configurable. The "filtering" mechanisms in the combine are very much hard coded into the extraction software. This could be more transparent and configurable. | |
2 | We have made an experimental implementation of an RDF record for the Combine, rather than the current "pseudo sgml". We have also an SOIF export facility. The problem with the former is that it was har to parse back into its internal representation, called XWI, which an object oriented interna software representation of a record. We have evaluated the GILS XML for the purpose, and our conclusion is that it is more suitable for our purposes. However our current bet is to provide an extensible XWI object, which, using "schema" module, can cross-walk the XWI and possibly many different XML representations by means of a configuration file, very much like the Collections configuration file. | ||
3-4 | Anders Ardö is the one that has experimented with adding inferential capabilities to the combine. Some of this work has been performed within the Desire projects, for example the exists Combine modules for automatic classification. This capability is already running as an instance of combine that analyzes English texts to see whether they have a biomedical content. It further assigns relevant Medical Subject Headings (MESH) to them, and follows links only in pages that has a biomedical relevant content. There is also a module which infers the language of a resource. | ||