Project Number:

RE 4004 (RE)

Project Title:

DESIRE II - Development of a European Service for Information on Research and Education II

Deliverable Type:

(PU/LI/RP)

Deliverable Number:

(N3.3b1/2)

Contractual Date of Delivery:


Actual Date of Delivery:

Mar 2000

Title of Deliverable:

Guidelines on overcoming the problems of user interface design for large and/or distributed collection access. Guidelines for gateways on strategies to cope with the maintenance and evolution of large collections of resources

Workpackage(s) contributing to the Deliverable:

WP3

Nature of the Deliverable:

RE

Author:

Phil Cross

Contact Details:

ILRT,
8-10 Berkeley Square, Bristol, BS8 1HH
Tel.: +44(0)117-928-7113; Fax: +44(0)117-928-7112
http://www.ilrt.bris.ac.uk/

Other Authors:

Andy Powell

URL

http://www.desire.org/html/research/deliverables/D3.3/

Abstract

At present subject gateways tend to consist of no more than a few thousand records due to the manual effort required to select and catalogue Internet resources. The likely method of growth for subject gateways will be via collaborative efforts, or through the use of harvesting software, seeded from the gateways themselves. The increase in size of the database presented to the end-user (virtual or otherwise) and the ability for a single search to be passed to a number of different databases produces new problems that need to be addressed. One part of the problem is concerned with interface and usability issues; a second part of the problem is concerned with the management of such collections; and finally there are issues relating to the computer systems used to run the subject gateway services, such as the need for databases that can handle much larger collections of data. This report examines each of these issues in the light of the experiences gained by partners within the DESIREII project.

Keywords

Browsing
Cross-browsing
Cross-searching
Information gateways
Ranking
Scalability
Searching
User interface
Usability


Distribution List:

Project, Public Usage

Issue:

1.0

Reference:

N3.3b

Total Number of Pages:

9


Document Control

Issue Number

Issue Date

Reason for Change

V0.1

9/3/2000

Initial document for discussion within team







Executive Summary

At present subject gateways tend to consist of no more than a few thousand records due to the manual effort required to select and catalogue Internet resources. The likely method of growth for subject gateways will be via collaborative efforts, or through the use of harvesting software, seeded from the gateways themselves. The increase in size of the database presented to the end-user (virtual or otherwise) and the ability for a single search to be passed to a number of different databases produces new problems that need to be addressed. One part of the problem is concerned with interface and usability issues; a second part of the problem is concerned with the management of such collections; and finally there are issues relating to the computer systems used to run the subject gateway services, such as the need for databases that can handle much larger collections of data. This report examines each of these issues in the light of the experiences gained by partners within the DESIREII project.

Scope Statement

MANDATORY SECTION. This should explain the context in which the deliverable was written (e.g. this is part of a series of reports covering the area of...), why it is important, and what other related deliverables may be of interest and importance.


1 Introduction

At present subject gateways tend to consist of no more than a few thousand records due to the manual effort required to select and catalogue Internet resources. Even a ‘large’ subject gateway typically only has in the region of 6-7000 records. This is very small in comparison with traditional online bibliographic databases. Consequently, the problems associated with storing and retrieving large collections of bibliographic data, such as recall and precision in searches, search engine functionality, etc., have not yet been significant.

It seems unlikely that individual subject gateways are capable of growing significantly in size given current funding models. Only directories that have limited or no quality criteria, high levels of funding or possibly voluntary effort - such as Yahoo! or the Open Directory Project - seem capable of producing manually created databases with sizes of the order of hundreds of thousands of records. Another example is the vast collaborative effort produced by OCLC for NetFirst (with approximately 120,000 records) which requires considerable funding.

The likely method of growth for subject gateways is seen instead to be via collaborative efforts. There are two approaches to building a collaborative subject gateway: the first is for a number of different organisations to contribute records to a central database; the second approach is for each organisation to maintain their own database, allowing the end-user to search across one or more of these depending on the nature of their query. In some cases a combination of both approaches may be appropriate. These methods allow a real or virtual increase in size of the collection of resources presented to the end-user.

We have also begun to see the use of harvesting software which enables the automated indexing of Internet resources whilst retaining a degree of quality as a result of the ability to choose the seeding URIs for the robot. The first phase of the DESIRE project developed some harvesting tools that can be used in conjunction with the ROADS and Zebra software. Such mechanisms have the potential to create databases at least one order of magnitude larger than those of current gateways. This increase in size of the database presented to the end-user and the ability for a single search to be passed to a number of different databases produces new problems that need to be addressed.

One part of the problem is concerned with interface and usability issues. These include the presentation of large results sets to the user; the means by which the cross-search paradigm is presented; and the ranking or filtering of any results produced. Browsing of collections of resources can also pose problems when large numbers of resources have to be contained within the browsing structure. A second part of the problem is concerned with the management of such collections. For example, the need for automated mechanisms for link checking and perhaps for detecting changes to sites that require updates to their descriptions. Finally there are issues relating to the computer systems used to run the subject gateway service, such as the need for databases that can handle much larger collections of data.

2 User Interface and Usability Issues

We are concerned here with the issues involved for a user with searching and displaying large sets of data using the gateway’s search engine; and with browsing through large sets of records, via a particular browsing/classification scheme structure. Both issues revolve around the need for the user to be able to quickly home in on the most relevant resources.

One particular approach to dealing with the problems of cross-searching and browsing several different collections of information has been developed within the DESIRE project. This uses rdf/xml to create a ‘folders’ view of retrieved data and is detailed in N3.3b.3: Report on the scaleable development of information gateways.

2.1 Searching and displaying large collections

With a relatively small database, such as the size of a typical subject gateway, the issue of precision in searching is not of great importance. This is because the user can scroll quickly through a results set and use their own judgement as to which are the most useful records. However, as the size of the database increases, so does the average number of records retrieved, and it consequently becomes much more difficult to filter out the most relevant and useful results. This problem can be approached in two ways:

2.1.1 Mechanisms for increasing precision of searches

Below are some of the ways in which the precision of searches can be increased:

2.1.2 Displaying large result-sets

Typically, large result-sets should not be displayed on a single Web page. This is because of the time taken to retrieve and display the data and because of scrolling problems for the end-user. The ROADS software limits the total number of records returned by a search, but as the size of the database increases, the proportion of searches resulting in 'too many hits' will increase. In addition to reducing the number of hits returned, by increasing search precision, it may also be sensible to investigate mechanisms for improving the way records are displayed. These might include:

2.2 Browsing large collections (including cross-browsing)

Most subject gateways provide a browsing interface to their data in addition to a search interface. Many of the issues raised above apply equally to the browse interface. For example, as the number of records in the database grows the lists of records presented in the browse interface is likely to become too long to be shown on a single Web page.

The browse interface is typically based upon an existing classification system with which the end-user is familiar, such as Dewey or UDC. As the database size increases, the number of records per section will also increase unless the specificity of the classification scheme is improved. Therefore, there are design decisions that need to be taken about the depth and complexity of the classification scheme used. This needs to be balanced against possible navigation problems caused by an overly complex hierarchy, which need to be taken into account in the interface design.

The number of records within each section of the browsing hierarchy may be reduced by providing options to display subsets of the records in the database, such as those from different countries. It is also possible to subdivide the records displayed by criteria such as resource type as an alternative to a simple alphabetical display.

It is worth noting that a combination of browse and search interface may further help the end-user. This can be achieved by embedding a restricted search interface into each sub-section of the browse interface, returning results that are only found within that sub-section and those below.

Many of the above techniques are now being made available within the Social Science Information Gateway (SOSIG) and a further technique being explored by this gateway is the ability for cataloguers to indicate that particular records are especially recommended. This results in such records being highlighted and displayed at the top of the relevant browse sections.

3 Administration and Management Issues

As the number of records in a subject gateway database increases, the techniques used to manage it may need to change. Manual checking of records is likely to be feasible only for a small database.

Some areas where automated checking of records may be possible are:

4 Systems Issues

Clearly, as a database grows the amount of disk space it requires will also grow. It is likely that memory and CPU power requirements will also increase. It is possible that database software that copes with 10,000 records may not cope efficiently with 100,000 records. For example, there is some evidence that the default, filesystem-based database supplied with ROADS does not cope well with databases larger than about 50,000 records. A solution can be to use alternative ‘back-end’ databases for record storage, such as the use of a Z39.50 server or an SQL database.

There may also be performance problems associated with cross-searching large numbers of remote databases. With the whois++ protocol used by ROADS, the system has to wait for results to come back from each remote database in turn. Work is currently being done within the DESIRE project into the areas of parallel searching and results interfaces that return results to the user as and when they become available (see N3.3b.3 mentioned above).


PART IV

5 References

OCLC Online computer Library Centre, Inc. CORC. http://www.netfirst.org/oclc/corc/index.htm
Yahoo! http://www.yahoo.com/
Netscape. Open Directory Project. http://dmoz.org/
OCLC Online Computer Library Centre, Inc. NetFirst. http://www.oclc.org/oclc/netfirst/netfirst.htm
Social Science Information Gateway http://www.sosig.ac.uk/


Title:
Issue:1.0
Date: Mar 2000