![]() |
|||||
| | | | | | | | |||||
Research: Publications
Introduction Building a personal or company web site is now a commonplace activity, with an increasing array of commercial tools available to make the task easier and an increasing number of search facilities available to help users locate information on these sites. Mostly, though, the model for information access is based on the notion of a single provider and a single consumer. The research community has a very different model of the information service: information is drawn from a wide range of providers and distributed to a large and discerning collaborative community. The task of creating and operating continental-scale information services brings a whole new set of challenges: assessing the quality of information; integrating information from many sources and in many languages; optimising the location of information to make best use of network bandwidth; maintaining service quality given a widely-distributed set of service components. In this brief overview, we investigate the developing technology which can be applied to address these needs and the extent to which they have been successfully deployed in the DESIRE project, undertaken by 23 partner organisations as part of the Telematics Application Programme supported financially by the European Commissionís DG XIII. A complete list of partners and further information about the project can be found on the project web site. Today's Information Environment Researchers have been the prime movers in the development of the Internet: they were the first to adopt the technology and have continued to push at technological limits as they discover new possibilities for its use and encounter new problems. Researchers typically have more exacting requirements and greater awareness of technical potential than general network users. We have identified several aspects of Internet information services where such requirements and potential are unfulfilled:
Resource Discovery All Internet users will be familiar with the public search services such as Alta Vista and Yahoo! that respectively demonstrate two approaches to finding information on the World Wide Web - exhaustive indexing by trawling every accessible on-line document and categorised listings of selected (or self-selected) material. Without such services, information on the Web would be effectively inaccessible. However, anyone who regularly uses them will be aware of the "needle-in-a-haystack" problem; they will also be aware of the time lag that can occur between information becoming available and its appearance in search results. Researchers are professional information users. They need prompt access to current information that is relevant to their field of research and of scholarly quality. Many existing search services fall short of these demands, for a variety of different reasons:
Efficient Retrieval Even assuming that researchers can find the information they are looking for, actually getting it can be a slow process. Available network bandwidth never exceeds demand and considerable amounts of researchersí time can be wasted if a potentially-promising document takes half an hour to retrieve, only to be found to be less promising than at first thought. It is unreasonable to expect the provider of a frequently consulted information resource to provide the large amounts of network capacity required to support its efficient retrieval by those wishing to exploit it. Reliability Even if in principle information can be efficiently retrieved, it is of little use if the information cannot be retrieved reliably and repeatably. An academic information service is likely to comprise a multiplicity of distributed information sources coming from different countries, institutions and disciplines; the technical infrastructure for that service may be built on disparate equipment under the control of a wide variety of system and network managers. As a consequence, a typical distributed information service will suffer from avoidable faults - broken links between information items, performance bottlenecks and intermittent failures. Security Researchers frequently wish to collaborate on work-in-progress away from the prying eyes of their competitors. Publishers of academic information may also wish to restrict its circulation to those who have paid to receive it. The weak password-based authentication mechanisms of the Web are not suitable for these purposes. DESIRable Goals The DESIRE project was created to develop and deploy techniques for improving the quality of international research information services, adopting and contributing to existing and emerging standards. Resource Discovery A European Web Index (EWI) is being built which, instead of collecting all the data in the Internet together at one point, allows a user to make searches simultaneously across a number of smaller, easier-to-maintain, collections of data whose maintenance can be devolved to appropriate communities of interest - subject specialists, national or linguistic groups for example. Using a standard search protocol (in the case of EWI, Z39.50), it is straightforward to incorporate new information collections into the scope of the search without adding massive overhead at one point in the system. The architecture of the EWI is show in the figure below.
Categorised listings are being provided in the form of Subject-Based Information Gateways (SBIGs). One of the main components of their operation is to associate meaningful and accurate metadata with information resources. Metadata is descriptive data which provides details of the source, content and potential relevance of a document - such as the name of the author, title and publication date. If provided consistently, reliably and in a documented form, search requests can be framed in a way which will provide predictable results. There are several metadata standards in current use so tools are needed to convert between them. Crucially, SBIGs rely on the skills of the same communities of interest as maintain EWI information to select and rate the relevance of information to specific subject areas. Subject-based gateways have been set up in the UK and the Netherlands, mainly in the area of Social Science, using ROADS software and tools. As part of a validation exercise for the system, an orthopaedic information server has also been created to help train medical students. A network of correspondents is being recruited to develop new information gateways across Europe, and to contribute material to existing gateways, There is much further potential. Use of standard search protocols means that each system has access to the data stored in the other. Use of standard formats for metadata means that the automatic harvesting system can collect existing metadata from information sources (or even potentially infer metadata in some cases) and pass likely candidate documents and their associated metadata to an appropriate subject expert for classification. Efficient Retrieval Every Web browser keeps a local cache of recently-fetched documents, and experience shows that the potential benefits increase as caching is made available systematically to larger groups of users. However, the shifting patterns of access which occur when progressively larger user groups are involved (departments, campuses, or even entire countries) make the logistics of planning, configuring and operating caching services very complex. Decisions must be made on the topological siting of caching servers, which documents should be cached and how servers can interact with each other to back each other up (to avoid single points of failure) or to act as secondary or tertiary caches for less-frequently-accessed information which nonetheless has a high enough cost of retrieval to be worthwhile caching. National caching hierarchies have been set up in Norway and The Netherlands, and the core of a European caching network has been created from them. A representation of the national caching mesh in the Netherlands is shown in the figure below.
The practical experiences of establishing these pilots have been documented and the results are being used to promote good practice. In the process, a number of important enhancements to the basic Internet protocols involved: HTTP (the hypertext transfer protocol) and ICP (inter-cache protocol) have been suggested and incorporated into operational services. Some of the recommendations that have emerged are summarised here:
Provided these guidelines are met, web traffic can be reduced by 30%-50% - depending on the cache configuration and the homogeneity of the user community. Even accounting for the cost of setting up and operating a caching mesh, the pilot system is saving the Norwegian research network around 50 000 ECU per year. If caching could be employed across the major part of their user community savings could amount to 500 000 ECU per year. Although the main purpose of installing caches is to reduce the demand on bandwidth, a substantial benefit has been demonstrated in reducing the latency between requesting a document and its retrieval, as show in the figure below. Users experience a sub-second response time for documents that can be returned from the cache without further checking, with substantially longer delays if the original source must be consulted. Note that a substantial part of the additional time taken is the "transaction cost" of establishing a connection to the remote server: an "if-modified-since" (IMS) check is not substantially less time-consuming than retrieving the document unconditionally.
Reliability The tools available to manage and monitor the performance of web servers are proprietary and do not integrate well with the tools used to manage and monitor other crucial components of an on-line information system: local and wide-area networks, databases and many other components may be vital to its operation and the quality of service perceived by users. It is necessary to bring together management information from each of these components and analyse it in a way which gives operators the same perception of system performance as that experienced by users. Using the Web to transport operational information of this sort also makes it possible to diagnose (and anticipate) faults in widely distributed services. Such a system has been built, using Java to achieve platform independence and reduce costs. The system is built in several layers. At the lowest level, management information relevant to the performance of the service is collected from agents. A classification procedure derives a quality value, using various service-level assessment criteria, which can be used to alert operators to performance problems. A history of performance indicators (stored in a relational database management system) can be retrieved and assessed in the same way to chart performance over time and assist in capacity planning. Information can be aggregated by site (a collection of management entities contributing to a specific service) or across an entire management domain (for example, a collection of web sites). One of the major achievements has been to define performance indicators and metrics which reflect the service usersí perception of quality, yet which can be derived from the management information available through standard management interfaces (such as SNMP). In the context of information server performance, a web-server MIB has been developed and an agent produced for the Apache web server, using existing standards where possible and augmenting the information where necessary in liaison with international interest groups and with the IETF. An example of a management display is shown in the figure below.
Security Although there have been many attempts to establish universal security infrastructures, most have foundered on mistrust or legal obstacles. Although existing weak authentication mechanisms are inadequate for anything other than casual use, universally available strong authentication may be a long time coming. For applications that require a better level of security, an architecture must be devised which allows the specific mechanism to be replaced as necessary in different environments. For the purposes of DESIRE, it was decided to exploit the smartcards that have been widely distributed to university students in the Netherlands. Since the distribution infrastructure is already in place, the usual key-management problems with authentication systems were avoided. A system was built which would allow students access to a server containing commercial software distribution kits, provided they were licensed (through their educational institution) to download and install the packages. The system was supplied as part of a ready-configured network-access kit (providing dial-up, IDSN or cable-modem connections), packaged with a low-cost smartcard reader as shown below.
Once installed and configured, students had access to software using the "smart server" system, of which a sample screen is shown in the following figure.
The authentication protocol, jointly developed with IBM, is in principal extensible to a wide variety of smartcards, and to three-party authentication (in the case that the client does not trust the identity of the server). This will open up the possibility of a wide range of other applications. Usage of the system has, unsurprisingly, been heaviest amongst those students having cable-modem access at their student accommodation. In a one-month period in 1997, over 1.5GB of software was downloaded by over 250 students. Conclusions Building information systems to serve large, international groups of sophisticated users is a complex task that cannot be achieved solely with the aid of commercially available tools. By developing (or encouraging the development of) tools based on existing standards and which help form new standards, we can pursue the twin goals of serving users and improving the level of service available to everyone. Technology in itself is not the answer to all of our problems. It is as important to build up networks of people - users, information specialists, network operators and technologists - to ensure the cost-effectiveness, scalability and integration of services at a European level which accurately match usersí needs and aspirations. The rapid growth and fast-changing environment of the Internet has left many information providers and information users struggling to keep up. There is little point in implementing advanced systems if the information providers cannot maintain them and information consumers cannot use them effectively or are simply intimidated. Basic training in fundamental principles must be available, backed up by systems that permit the development of training materials and their rapid delivery to the user community. We are, at least, fortunate in being able to use the very information systems we are developing to deliver training to every userís desktop. Not surprisingly, training has been an important component of the DESIRE project. Not only have training materials been developed to assist users to make the most of the services on offer, but a new system for delivering on-line training, TONIC-NG, has been created and used to help deliver a number of training workshops, principally for the developers of information gateways. A sample training module is shown in the next figure.
One of the fundamental principles of DESIRE is that the systems and techniques developed as part of the project should be generally available to the European research community. We are actively encouraging the co-operation of users and network operators across Europe to use our results where they can help improve the range and quality of information services available. A follow-on project, DESIRE II, will develop further the most encouraging results of DESIRE I and will be providing workshops, in co-operation with TERENA, open to the whole networking community where anyone who is interested will be able to get in-depth information. In the meantime, the project web site contains pointers to all the project results and public documentation. |
||||||||||||
|
Contact | © 1998-2000 DESIRE Consortium | Disclaimer | Search Last updated: |