Project Number: | RE 4004 (RE) |
Project Title: | DESIRE II - Development of a European Service for Information on Research and Education II |
Deliverable Type: | PU |
Deliverable Number: | D3.3 |
Contractual Date of Delivery: | April 2000 |
Actual Date of Delivery: | April 2000 |
Title of Deliverable: | DESIRE Integrated Toolkit |
Workpackage(s) contributing to the Deliverable: | WP3 |
Nature of the Deliverable: | PR |
Author: | Tracey Hooper |
Contact Details: | Institute for Learning and Research Technology |
Other Authors: | Tim Dixon, TDC Networking Consultancy Limited |
URL |
Abstract | This is the final report describing the public release of the DESIRE Toolkit, available in April 2000. During the course of the DESIRE project, a number of different software components have been produced. In many cases, these build upon previous work (such as the ROADS toolkit), or on work undertaken in phase one of the project (eg Combine). Phase two of DESIRE includes a number of strands that share common software requirements. The purpose of this toolkit is to provide a supporting environment and framework to allow collective and ongoing development of this software, and to provide both an environment and framework for their maintenance beyond the lifetime of DESIRE-II. |
Keywords | Software |
Distribution List: | DESIRE Project Team; European Commission |
Issue: | V0.2 |
Reference: | DESIRE Integrated Toolkit |
Total Number of Pages: | 31 |
Issue Number | Issue Date | Reason for Change |
0.1 | 12/4/00 | (Initial Draft) |
This deliverable marks the public release of the DESIRE toolkit in April 2000.
During the course of the DESIRE project, a number of different software components have been produced. In many cases, these build upon previous work (such as the ROADS toolkit), or on work undertaken in phase one of the project (eg Combine). Phase two of DESIRE includes a number of strands that share common software requirements. The purpose of this toolkit is to provide a supporting environment and framework to allow collective and ongoing development of this software, and to provide both an environment and framework for their maintenance beyond the lifetime of DESIRE-II.
There are already a number of 'resource discovery', metadata and Web indexing themed software toolkits in existence. There are also projects such as the Advanced Search Framework (ASF) which draw together a number of smaller components to produce services and applications. ROADS and Combine, are good examples of toolkits which themselves include a number of sub-components, while also usefully serving as components of large applications.
Many of the developments in DESIRE phase two build upon these software distributions. Rather than risk duplication by creating a monolithic 'DESIRE software distribution' for the DESIRE toolkit, we instead adopt a decentralised model. ROADS and Combine have an existence beyond the scope of the DESIRE project. Through the creation of the DESIRE software toolkit, we aim to bring these diverse components together under a broad umbrella.
In order to provide a resource that will have a continued applicability beyond the lifetime of the project, the tools themselves will be used to construct an online system that will allow developers to contribute, select and check the compatibility of a wide variety of software tools, initially drawn from the DESIRE environment, but extensible to other relevant tools in the future.
This is a final document, covering the high-level design of the toolkit and introducing some of the tools developed in various Work Packages of the DESIRE II project.
Its intended audience comprises the developers of electronic information services who are looking for tools to assist them in their work. This document gives an overview of the tools and guidance for selecting appropriate tools in a variety of common service scenarios.
Centroid
A summary of the information contained in a whois++ index used to guide the routing of information queries to the likely source of the information being sought.
CIP
Common Indexing Protocol (RFC 2651, RFC 2652, RFC 2653)
IAFA
Internet Anonymous File Archive
LDAP
Lightweight Directory-access Protocol (RFC 1777)
RDF
Resource Description Framework
TIO
A Tagged Index Object for use in the Common Indexing Protocol (RFC 2654)
URL
Uniform Resource Locator (RFC 1738)
Whois++
A network information lookup service and associated protocol (RFC 1834, RFC 1385, RFC 1913, RFC 1914)
Although, ultimately, the value of the work done in DESIRE II is to be found in the quality and usefulness of the software tools themselves, their worth can only be exploited if the developers of information systems can readily locate the tools they need to build their applications. Since the tools developed in the DESIRE projects cover a wide variety of different application areas, and are designed to complement tools available from other sources, the construction of a specific application may require the selection of several tools from different sources.
The toolkit comprises this document (which introduces the range of tools available and guidelines for selecting appropriate tools from the kit for a variety of applications) and a publicly-available software repository which contains a variety of source code and installation kits for the tools. In a small number of cases, code will not be made available online (for example, for privacy reasons, the code to index LDAP directory servers will have restricted availability); in this case, the developers can be contacted directly for further details of availability.
To assist in the selection of tools, all components of the toolkit will be described and categorised in a standardised fashion which captures information about the nature and availability of the component and the possible ways in which it may be combined with other components. Each relevant characteristic of a component is named an Attribute. Each attribute may occur exactly once (indicated by 1 in the table below) or may occur zero or more times (indicated by * in the table below) or may occur 1 or more times (indicated by +). Each attribute may have a single value (such as the name of the component) or may consist of a list, each item of the list describing a different aspect of the attribute’s value. In the table below, lists are shown in the conventional manner – items grouped in parentheses and separated by commas.
The value of each attribute is interpreted according to the syntax shown below. Most attributes consist of text, but the values of others are constrained. Syntax constraints noted below include:
Category A high-level classification of the nature of the software tool: (resource description, resource search, resource retrieval, metadata conversion, metadata management, metadata inference, protocol conversion, user interface)
URL The value must be a URL according to RFC 1738
Input The value encodes the identification of a data representation or network protocol which the tool accepts as input and must be one of (whois++, CIP, LDAP, LDIF, Z39.50, HTTP, FTP, IAFA, RDF, ROADS, centroid, DVT Tree, DVT Database, RUDOLF-RDF, TIO)
Output The value encodes the identification of a data representation or network protocol which the tool produces as output and must be one of (whois++, CIP, LDAP, LDIF, Z39.50, HTTP, FTP, IAFA, RDF, ROADS, centroid, HDB, DVT Tree, DVT Database, RUDOLF-RDF, TIO)
SupportType The value encodes an indication of the level of support available for the tool and must be one of (Unsupported, Community, Developer, Commercial)
LicenceType The value encodes an indication of the terms on which the software is made available and must be one of (Public Domain, Open Source, Educational, Commercial, Restricted Distribution)
Attribute | Occurrence | Syntax | Explanation |
Name | 1 | Text | The name of the component |
Nature | + | Category | The category or categories of function performed by the tool |
Function | 1 | Text | A description of the component’s main function |
Application | 1 | Text | A description of the application (or applications) in which the component might be used |
Example | * | (URL, Text) | Examples of online services that have been built with this tool – the URL at which the example can be found and a text description of the key features of the example |
Origin | 1 | Text | A description of the individual or organisation providing the component |
Licence | 1 | (LicenceType, Text) | The legal terms relating to use of the component |
Support | + | SupportType or (SupportType, URL) | The type of support available and, optionally the URL of a Web page or mailing list from which support may be obtained |
InstallAndDocs | * | URL | Location(s) at which information can be found relating to the installation and documentation of the tool |
HandbookXref | * | URL | Cross reference to sections of the DESIRE Information Gateways Handbook where additional information can be found. |
Consumes | * | (Input, Text) | For each input type supported by the tool, a pair of values specifying the input type accepted and describing the use made of the input |
Produces | * | (Output, Text) | For each type of output produced by the tool, a pair of values specifying the output type produced and describing the output |
Requirements | * | Text | Supplementary text describing the system requirements of the toolkit component |
To aid information system builders in their choice of appropriate tools, the latter section of this document introduces a number of common service scenarios and describes the way in which different tools may be deployed to support those services.
The DESIRE Information Gateways Handbook describes good practice in setting up and running information gateways. The software in the DESIRE II toolkit is the result of new research and development and extends the possibilities available beyond what is described in the current edition of the handbook, although some developments are anticipated in the handbook. Future editions of the handbook will take account of the new possibilities raised by the toolkit.
The preliminary toolkit consists of the tools listed in this section. Each has been categorised according to the description scheme set out above.
ROADS is a system which stores metadata describing information resources in the form of IAFA templates. Resource descriptions can be searched via locally-generated Web pages, or remotely using the whois++ protocol. Resource descriptions can be amalgamated and exchanged with other retrieval systems in the form of centroids or using the Common Indexing Protocol. ROADS is particularly suited to the construction of manually-maintained catalogues of information resources which require a consistent approach to categorisation or rating. ROADS is commonly used to build Subject-Based Information Gateways.
Useful for: Manually maintaining information gateways, especially those requiring consistent cataloguing or rating.
Attribute | Value |
Name | ROADS |
Nature | resource description, resource search, resource retrieval, metadata conversion |
Function | ROADS is a system which stores metadata describing information resources in the form of IAFA templates. Resource descriptions can be searched via locally-generated Web pages, or remotely using the whois++ protocol. Resource descriptions can be amalgamated and exchanged with other retrieval systems in the form of centroids or using the Common Indexing Protocol. |
Application | ROADS is particularly suited to the construction of manually-maintained catalogues of information resources which require a consistent approach to the categorisation or rating. ROADS is commonly used to build Subject-Based Information Gateways. |
Example | http://www.sosig.ac.uk/ |
Origin | Department of Computer Science at Loughborough University of Technology |
Licence | Open Source |
Support | Developer |
InstallAndDocs | |
Consumes | whois++ IAFA |
Produces | whois++ centroid IAFA |
Requirements | UNIX, an HTTP server supporting CGI, Perl (see http://www.roads.lut.ac.uk/v2/Manual/manual-1.html#ss1.4) |
HandbookXref |
An extension to the ROADS software toolkit to allow the creation and storage of Dublin Core metadata. Dublin Core elements are mapped to ROADS IAFA attributes for storage to allow interoperability with other ROADS-based services and so that various ROADS scripts continue to work without modification. The cataloguer and end user see only the mapped views of the metadata represented as a DC element set. Included is a script to generate ‘on-the-fly’ Dublin Core in RDF representations of the stored records (wpp2qualdc.pl) and also scripts to inform the content providers of the metadata. The content providers are encouraged to create a relationship between their resource and the corresponding metadata record using the HTML ‘link’ tag to call the wpp2qualdc.pl script.
The Dublin Core Metadata Repository is intended for use as a substitute for the IAFA templates normally used in ROADS. This will not be suitable for all kinds of data, since the IAFA templates can support a much wider range of attributes than the Dublin Core elements, but the use of Dublin Core standards will allow a high degree of interoperability with other services which use Dublin Core, whether ROADS-based or not.
Useful for: Maintaining subject gateways, if DC is a suitable format for their metadata.
Other applications for storing metadata
Attribute | Value |
Name | ROADS Dublin Core Metadata Repository (‘DC in a box’) |
Nature | resource description, resource search, resource retrieval, metadata conversion |
Function | An extension to the ROADS software toolkit to allow the creation and storage of Dublin Core metadata. |
Application | Dublin Core elements are mapped to ROADS IAFA attributes for storage to allow interoperability with other ROADS-based services and so that various ROADS scripts continue to work without modification. The cataloguer and end user see only the mapped views of the metadata represented as a DC element set. Included is a script to generate ‘on the fly’ Dublin Core in RDF representations of the stored records (wpp2qualdc.pl) and also scripts to inform the content providers of the metadata. The content providers are encouraged to create a relationship between their resource and the corresponding metadata record using the HTML ‘link’ tag to call the wpp2qualdc.pl script. |
Example | |
Origin | ILRT |
Licence | Open Source |
Support | Developer |
InstallAndDocs | |
Consumes | whois++ IAFA |
Produces | whois++ centroid IAFA Dublin Core RDF |
Requirements | UNIX, an HTTP server supporting CGI, Perl (see http://www.roads.lut.ac.uk/v2/Manual/manual-1.html#ss1.4) |
HandbookXref |
The aim of the design and architecture of Combine is to provide a harvesting system which can be used for building moderately large indexes; however, no attempt has been made to compete with the worldwide commercial search engines. Rather, Combine could be used for building an index covering a small country or all universities in a region. Combine can also be used not only for expanding a collection, but simply for harvesting text and some other information from a list of resources (as in the European Link Treasury toolkit: http://www.lub.lu.se/combine/ELT/).
The harvesting policies can be formulated flexibly using rules for inclusion and exclusion, allowing distributed data collection; this implies that a number of servers will each have the responsibility for one or more regions or domains in a broad sense. These areas of responsibility can be assigned on the basis of actual network domains, organisations or geographical domains, as well as for domains of human knowledge.
An important part of the architecture is an easy way of filtering the sets of URLs to be indexed according to subject or domain. Before a random set of URLs is loaded into the scheduler for processing, they are filtered through an external policy filter. This filter, which is localised for each installation, determines which URLs are to be harvested, given the policy adopted by the installation. It thus defines the region or domain which a particular installation will cover.
Combine was originally developed during DESIRE I, and in phase II of DESIRE its source code has been maintained and its configuration system improved in the following ways:
Useful for: Creating a large harvested database, for example one covering a particular subject area, geographical area and/or domain such as higher education.
Not useful if: A precisely focused and catalogued database is needed
Attribute | Value |
Name | Combine |
Nature | resource retrieval |
Function | Combine is a robot for harvesting of Web resources, and is designed to be distributable, parallel and flexible. It is distributable in the sense that different parts of Combine can run on separate computers. Parallel, meaning that some parts of a Combine system can exist in several instances to increase performance. Flexibility is achieved by the system being built by putting together small and relatively simple building blocks in a way that is modifiable by the user. |
Application | Combine can be used to produce distributed, federated and regional Web indexes. |
Example | http://safari.hsv.se/index.html.en |
Origin | Lund University Library NetLab |
Licence | Open Source |
Support | Developer |
InstallAndDocs | |
Consumes | HTTP |
Produces | HDB |
Requirements | Linux version 1.2 (or higher) , or Solaris 2.5 (or higher) |
HandbookXref |
As searching for information on the Internet becomes more reliant upon structured meta-information, there will be a need for new types of tools for interfacing between controlled vocabularies (such as terms lists, classification systems and thesauri) and HTML forms in applications such as metadata creators and search engines. The DVT (DESIRE Vocabulary Toolkit) is such a tool. The vocabulary database tools allow the construction and manipulation of databases of specialist vocabularies which can then be used to facilitate browsing in user interfaces to search services. The vocabulary browser tools allow interaction between a search interface and a vocabulary database constructed with the DESIRE vocabulary database tools, and consist of two parts:
Useful for: Databases which could be usefully browsed using a specialist vocabulary, classification scheme or thesaurus. The vocabulary tools may not be used if conditions of use of the vocabulary, classification scheme or thesaurus forbid its display and use independently of classifying records.
Attribute | Value |
Name | DESIRE Vocabulary Database |
Nature | Metadata management |
Function | The vocabulary database tools allow the construction and manipulation of databases of specialist vocabularies which can then be used to facilitate browsing in user interfaces to search services. |
Application | The vocabulary tools can be used to build systems for searching across a number of Internet data stores while traversing a vocabulary database. |
Example | |
Origin | Lund University Library NetLab |
Licence | Open Source |
Support | Developer |
InstallAndDocs | http://www.desire.org/toolkit/vocab.html WWW server supporting CGI, Perl |
Consumes | DVT Tree DVT Database |
Produces | DVT Tree DVT Database |
Requirements | WWW server supporting CGI, Perl |
HandbookXref | (none) |
For descriptions, see section 2.4.
Attribute | Value |
Name | DESIRE Vocabulary Browser |
Nature | User interface |
Function | The vocabulary browser tools allow interaction between a search interface and a vocabulary database constructed with the DESIRE vocabulary database tools. |
Application | The vocabulary tools can be used to build systems for searching across a number of Internet data stores while traversing a vocabulary database. |
Example | |
Origin | Lund University Library NetLab |
Licence | Open Source |
Support | Developer |
InstallAndDocs | |
Consumes | DVT Database |
Produces | |
Requires | WWW server supporting CGI, Perl |
HandbookXref | (none) |
For descriptions, see section 2.4.
Attribute | Value |
Name | DESIRE Vocabulary Browser for Z39.50 |
Nature | User interface |
Function | The vocabulary browser tools allow interaction between a Z39.50 search service and a vocabulary database constructed with the DESIRE vocabulary database tools. |
Application | The vocabulary tools can be used to build systems for searching across a number of Z39.50 servers while traversing a vocabulary database. |
Example | |
Origin | Lund University Library NetLab |
Licence | Open Source |
Support | Developer |
InstallAndDocs | |
Consumes | DVT Database |
Produces | Z39.50 |
Requires | WWW server supporting CGI, Perl |
HandbookXref | (none) |
The RDF data model presents a flexible, expressive approach to representing structured data for the Web. This has both advantages and drawbacks. A key difference between RDF storage systems and the traditional relational approach is that in the case of generalised RDF stores it is necessary to anticipate the need to manage data drawing on metadata vocabularies that were unknown at the time when the database was initialised. This tool provides a generalised data storage engine for RDF, using the Berkeley Database system for on-disk representation and indexing, and defines an API which may be used to access it. It can be used by any application which needs to store or retrieve metadata in RDF format.
Useful for: Storing structured data, including but not only metadata, for retrieval over the web.
Examples of particular applications are storage of thesauri and of quality ratings.
Attribute | Value |
Name | RDF Data Store |
Nature | Metadata management |
Function | The RDF data model presents a flexible, expressive approach for representing structured data for the Web. This has both advantages and drawbacks. A key difference between RDF storage systems and the traditional relational approach is that with generalised RDF stores, it is necessary to anticipate the need to manage data drawing on metadata vocabularies that were unknown at the time the database was initialised. This tool provides a generalised data storage engine for RDF and defines an API which may be used to access it. |
Application | Any application which needs to store or retrieve metadata in RDF format. |
Example | http://www.grapevine.sosig.ac.uk/grapevine/recommender.htm |
Origin | University of Bristol Institute for Learning and Research Technology |
Licence | Open Source |
Support | Developer |
InstallAndDocs | |
Consumes | RUDOLF-RDF |
Produces | RUDOLF-RDF |
Requirements | Berkeley Database (http://www.sleepycat.com/), Perl (or Java) |
HandbookXref | (none) |
A simple RDF rating and recommendation server built on top of a graph-oriented RDF API, which solicits, manages and stores ratings and recommendations using the RDF graph APIs. The Opinion Server application provides mechanisms for constructing simple RDF statements about Web resources and for writing those statements into an RDF store. The Java version provides a mechanism for displaying the results.
As a further example, a UKOLN demonstrator, using Javascript, allows ratings to be made available at the click of a personal toolbar button:
http://www.ukoln.ac.uk/metadata/desire/qualityratings/displayquality/.
Useful for: Gathering ratings and recommendations from a knowledgeable user community
Attribute | Value |
Name | Opinion Server |
Nature | Metadata management |
Function | A simple RDF rating and recommendation server built on top of a graph-oriented RDF API. |
Application | Solicits, manages and stores ratings and recommendations using the RDF graph APIs. The Opinion Server application provides mechanisms for constructing simple RDF statements about Web resources and writing those statements into an RDF store. |
Example | Http://www.grapevine.sosig.ac.uk/grapevine/recommender.htm |
Origin | University of Bristol Institute for Learning and Research Technology |
Licence | Open Source |
Support | Developer |
InstallAndDocs | http://www.desire.org/toolkit/opinionJ.html (Java) |
Consumes | RUDOLF-RDF |
Produces | RUDOLF-RDF |
Requirements | Java/Perl as appropriate |
HandbookXref | http://www.desire.org/handbook/3-6.html |
The LDAP crawler is an LDAPv2/v3 directory robot that produces LDIF (LDAP Interchange Format) dumps of data in a specified Directory Information Tree (DIT). It can be used to feed centralised or distributed indexing services that, e.g., offer a single entry-point to a set of LDAP servers in an organisation or country.
Useful for: Collecting data from a group of LDAP servers.
Attribute | Value |
Name | LDAP crawler |
Nature | Resource retrieval |
Function | Gathering of LDAPv2/v3 objects |
Application | The LDAP crawler is an LDAPv2/v3 directory robot that produces LDIF (LDAP Interchange Format) dumps of data in a specified Directory Information Tree (DIT). It can be used to feed centralised or distributed indexing services that, e.g., offer a single entry-point to a set of LDAP servers in an organisation or country. |
Example | 1. ldap://search.surfnet.nl/c=NL 2. http://search.surfnet.nl/naam/index.html |
Origin | SURFnet |
Licence | Copyright (c) 1998, SURFnet bv, the Netherlands. All rights reserved. This program may currently only be distributed among DESIRE project participants. |
Support | Developer: |
InstallAndDocs | |
Consumes | LDAP |
Produces | LDIF |
Requirements | Contact developer for further information |
HandbookXref |
The Generic Distributed Indexing Server collects, indexes and makes available forward knowledge about resources, based on the IETF Common Indexing Protocol (CIP). The stored CIP forward-knowledge objects are searchable for clients using (possibly various) search protocols.
The software includes:
The Generic Index Server provides a referral-based distributed indexing service that can (for example) offer a single entry point to a set of LDAP servers in an organisation or country, or a set of resource discovery services.
The architecture is explained in http://www.surfnet.nl/innovatie/surf-ace/search/ldap/d2_ldap_tio/
Useful for: large scale distributed directories, based on LDAP, where privacy of data is an issue. It requires work to make it function in distributed indexing systems based on other (non-LDAP) protocols.
Not useful if: the amount of data indexed is very limited and/or protection of data is less important, for example in the case of a corporate directory on an already closed intranet - in those cases, an off-the-shelf LDAP server is probably sufficient.
Attribute | Value |
Name | DESIRE Generic Distributed Index Server |
Nature | resource search, resource retrieval, metadata conversion, metadata management, protocol conversion |
Function | Collects, indexes and makes available forward knowledge about resources, based on the IETF Common Indexing Protocol (CIP). The stored CIP forward knowledge objects are searchable for clients using (possibly various) search protocols. The software includes:
|
Application | The Generic Index Server provides a referral based distributed indexing service that can, e.g., offer a single entry-point to a set of LDAP servers in an organisation or country, or a set of resource discovery services. The architecture is explained in http://www.surfnet.nl/innovatie/surf-ace/search/ldap/d2_ldap_tio/. |
Example | http://www.sec.nl/persons/henny/desire/ldap/d2demo.html Demonstrator page on a Distributed LDAP-index service |
Origin | SURFnet, The Netherlands; University of Tuebingen, Germany. |
Licence | Copyright (c) 1999, SURFnet bv, the Netherlands. All rights reserved. The software is currently under development and will be made available to DESIRE project participants, initially. |
Support | Developers: |
InstallAndDocs | |
Consumes | LDIF TIO Centroid |
Produces | LDIF LDAP HTTP TIO |
Requirements | Contact developers |
HandbookXref | (none) |
The Matcher tool implements a subject classification process using a subject-specific thesaurus by which terms are intellectually mapped to categories or subject classes. The classification process is made up of several steps. First, the document to be classified is fetched. Text is extracted from this document, and all thesaurus terms are matched to it. Some heuristic processing rules are applied to the results from the matching process. Finally, the outcome is formatted either for presentation or for storing in a database.
Useful for: Classifying documents (such as those harvested by a robot).
Not useful if: qualified cataloguers are available to catalogue documents.
Attribute | Value |
Name | Matcher |
Nature | Metadata inference |
Function | The tool implements a subject classification process using a subject-specific thesaurus which terms are intellectually mapped to categories or subject classes. The classification process is made up of several steps. First the document to be classified is fetched. From this document text is extracted, and all thesaurus terms are matched to it. Some heuristic processing rules are applied to the results from the matching process. Finally the outcome is formatted for either presentation or storing in a database. |
Application | Automatic subject classification of WWW-pages |
Example | |
Origin | Lund University Library NetLab |
Licence | Open Source |
Support | Developer |
InstallAndDocs | |
Consumes | HTTP HDB |
Produces | RDF HDB |
Requirements | Perl |
HandbookXref | (none) |
This part of the toolkit links, Combine and ROADS. It creates the vocabulary required for Matcher using the contents of a ROADS catalogue. The vocabulary can then be used to autoclassify a Combine catalogue
Useful for: Integrating ROADS, Combine and Matcher if all of these are being used.
Attribute | Value |
Name | Combine Auto-classification |
Nature | Metadata inference |
Function | Creates the vocabulary required for Matcher using the contents of a ROADS catalogue |
Application | Automatic classification of WWW-pages |
Example | |
Origin | Lund University Library NetLab |
Licence | Open Source |
Support | Developer |
InstallAndDocs | |
Consumes | HTTP HDB |
Produces | RDF HDB |
Requirements | Perl |
HandbookXref | (none) |
In practice, services are unlikely to need all components of the toolkit, or they may use parts of the toolkit with other software such as Z39.50. The following scenario illustrates how some parts of the toolkit might be used in conjunction with one another.
ROADS supplies the framework for manually creating and maintaining a catalogue of records in the chosen subject area. ROADS was part of DESIRE I, but further general development work has been done on it under DESIRE II, including the enhancements included in the release of ROADS 2.3 in July 1999, and those in subsequent development versions of the software. Using ROADS, resources can be catalogued online and then searched or browsed by users. The browsing software now supports hierarchical subject classifications. ROADS is in itself sufficient to set up and run an information gateway; however, its functionality can be extended by using other components of the toolkit in conjunction with ROADS.
An expanded collection of resources can be created by sending the Combine robot to ‘harvest’ resources automatically by following links in specified starting points, such as a manually-catalogued collection which forms part of the same service. Combine can be configured in many different ways; for example, it can be asked to harvest to a particular depth from its starting point, or to confine itself to certain sites or domains. The Combine collection is likely to be an order of magnitude larger than a corresponding manually-catalogued collection, but less focused, with some of its content being outside the intended subject or other area. Moreover, Combine does not attempt to derive metadata except by extracting it (where present) from metadata and title fields in the resources themselves, and using the opening words of the text of each resource as a description; the records in the harvested collection are not classified in any way. Without further processing, therefore, the harvested collection can be searched but not browsed.
A collection harvested in this way can be offered as an ‘advanced’ service, so that (for example) a search which produces no results on the manually-catalogued catalogue can be tried more productively on the collection harvested by Combine.
Part of the process of manually cataloguing resources is classifying them under an appropriate classification scheme. It would be very labour-intensive to do the same for the much larger collection of resources which harvesting by robot typically produces, but unless these resources are classified in some way, they cannot be browsed by subject. The Matcher matches the text of these resources against a subject-specific vocabulary linking keyword terms to a classification scheme, then applies some heuristic processing rules to the results to classify them, after which they can be added to a browsing interface.
The classification software can also be used as a topic filter in a WWW harvester to select resources in a particular subject area on the fly during harvesting. This is a slow process, but produces a higher level of relevance in the pages harvested than using Combine without this filter (see http://dtv25.dtv.dk/CP/cp.html for a demonstration).
When a ROADS catalogue and a Combine-generated collection cover the same subject area, Combine Autoclassification can make classifying the latter even easier, by using the ROADS catalogue to create the vocabulary which the Matcher requires.
The elements of the toolkit described so far have been used for collecting and processing data about resources. The Opinion Server allows users to add value to the data in the gateway by recommending and rating particular resources. Such recommendations are particularly useful for resources harvested by Combine, which will not necessarily been scrutinised by a cataloguer.
The RDF data store API is a generalised data storage engine. Its particular applications in a subject gateway environment could include storing the recommendation information created by Opinion Server or storing a thesaurus. As the data is held in Resource Description Format it can be made accessible to other software using RDF.
The collections included within the gateway could stand on their own, but their usefulness can be further extended by allowing them to be searched or browsed together with one another or with other gateways or sources of information.
ROADS catalogues are well suited to cross-browsing and cross-searching with other services, whether those services are ROADS-based or not (cf. http://www.desire.org/handbook/3-7.html: Interoperability) if their records are in an appropriate format. This spares the user the effort of searching or browsing the services sequentially. The immediate source of the information returned can be made clear by use of miniature logos or other branding.
A demonstration of cross-searching may be seen at http://www.rdn.ac.uk/xsearch.html. This links the gateways SOSIG and BIOME (which both use ROADS) and EMC (which does not).
The Generic Distributed Indexing Server collects content from servers (both LDAP and whois++) in the form of ‘centroids’. On receiving a search request, it can see from the presence or absence of an appropriate centroid whether the search terms occur in the database held on a particular server and hence whether to interrogate that server; this eliminates the need to connect to servers which do not have the requested information. The subject gateway outlined in this scenario could be one of a number of services which contribute to an index on the Generic Distributed Indexing Server.
Cross-browsing is potentially more complicated than cross-searching. Firstly, it requires the classification of the resources in all the collections involved; therefore cross-browsing involving a collection of resources generated by Combine can only be done after they have been classified, for example by using the Matcher described above. If the resources in different collections have been classified using different schemes, cross-browsing requires a reconciliation of the schemes used, either by mapping one scheme on to the other, or by mapping both to a third scheme.
The two scenarios below describe services which are using several parts of the toolkit, integrated with one another and with tools developed earlier in DESIRE I. Other services which are already using part of the toolkit include DutchESS (http://www.konbib.nl/dutchess/) and Biz/ed (http://www.bized.ac.uk/). The European Link Treasury toolkit (http://www.lub.lu.se/combine/ELT/) uses Combine to harvest text and other information from resources submitted by editors and adds a Z39.50 search interface to the results.
SOSIG (the Social Sciences Information Gateway, http://www.sosig.ac.uk/) uses as its primary data source a set of IAFA templates describing information resources in the social sciences (the ‘SOSIG Internet Catalogue’), which may be searched and browsed using ROADS. SOSIG has made use of recent enhancements to the ROADS software; indeed, it has been used as a testing ground for some of these new features, such as classification under hierarchical categories (now possible in ROADS 2.3). SOSIG records can also now be cross-browsed with those on Biz/ed. This has been achieved by mapping the Dewey classifications in Biz/ed to the UDC classification used for most of the SOSIG resources.
The technology behind SOSIG is frequently upgraded in the light of new developments, to offer new services to users or to enhance existing ones. The SOSIG gateway was relaunched in February 2000, offering new features which rely on other components of the DESIRE toolkit, and others will be added in the near future.
In addition to the templates in the SOSIG Internet Catalogue, which have been prepared by a team of subject editors, there is now a much larger collection of resources (the ‘Social Science Search Engine’) which have been harvested using the Combine robot, from links found in the resources in the SOSIG Internet Catalogue (and other collections of pages) to other pages on the same sites. These resources have also been converted into IAFA templates so that the ROADS software can be used to index and search them. The Social Science Search Engine may be used for wider-ranging but less discriminating searches than the SOSIG Internet Catalogue; the user is advised that the ‘descriptive’ information returned with the search results has been extracted automatically from the text on the page. (http://sosig.ac.uk/harvester.html).
As a further service to researchers in the social sciences, SOSIG now incorporates the Grapevine service (http://www.sosig.ac.uk/gv/). Users may consult the bulletin board on Grapevine for details of conferences and courses, or create personal accounts which enable them to register their CVs, research interests and needs, and to view information from SOSIG customised according to their preferences. Grapevine will soon include an RDF opinion server so that Grapevine users can also submit their ratings and recommendations of resources on SOSIG for other Grapevine users to see. This opinion server is in turn built on an RDF Application Programming Interface which stores the recommendations submitted.
EELS (Engineering Electronic Library, Sweden, http://eels.lub.lu.se/) is a subject gateway for resources in engineering and related areas; EELS is based at the University of Lund and is a co-operative project of the six Swedish University of Technology Libraries. EELS uses several components of the DESIRE II toolkit. The resources are catalogued by subject specialists using ROADS, although Z39.50 software is used in the user interface for searching. The Ei engineering classification system is used to classify the resources and the Ei thesaurus is used to index all records by subject.
As with SOSIG, there is a parallel collection which covers the same subject area but which has been created using the Combine robot: "All" Engineering resources on the Internet (AE, http://eels.lub.lu.se/ae/index.html). This indexes the full text of 14 important engineering Internet collections and of pages outside those collections which they link to (two levels deep). The information collected by Combine has also been used to generate lists of URLs of the most frequently cited pages and directories, and lists of URLs sorted by title and country (domain).
The Matcher developed in DESIRE II has been tested by classifying automatically the resources in AE (currently over 250,000 in number). This is done by matching terms in the Ei thesaurus against the texts of the resources and then applying various heuristic processing rules to give appropriate weightings to the results. Classification of the resources in AE will eventually permit cross-browsing of the EELS catalogue and AE. A demonstration of the classified resources in AE, based on an earlier version of the collection, may be viewed at http://www.lub.lu.se/desire/demonstration.html (see ‘Outcome of the automatic classification of engineering full-text documents’).