Services: Web Indexing

The European Web Index

The information needs of European researchers are being met increasingly from on-line sources provided by governments, fellow researchers and commercial organisations. The types of resource now available on the Internet include data archives, reports, scholarly discussion lists, journal articles, directories and electronic books. However, the sheer quantity of information available, together with a lack of coherent organisation, has prevented many researchers from maximising the potential of the Internet in their work.

DESIRE is helping researchers navigate the increasingly dense information jungle, leading them quickly to high quality Internet resources. Although part of the DESIRE project is working on building catalogues of research resources, there will always be a need to locate resources on-line which are not catalogued: resources which are too new to have been reviewed or too transient or esoteric to be collected for general use. To search out this kind of information, researchers turn to automatic systems which trawl the network, harvesting and indexing all the information they find.

There are already some extremely comprehensive services which index a large proportion of the World Wide Web and are currently free of charge to users, but these are mainly aimed at casual searching and do not necessarily meet the more stringent needs of academic research. Existing search engines do not necessarily index the entire content of the documents they encounter (perhaps only the first few lines) and can be indiscriminate in their selection of material. The search facilities and schemes for ranking results vary from one search engine to another, and may not be well enough specified for users to be confident of the outcome. Many services have the primary goal of achieving the broadest coverage, leading to longer and longer delays (perhaps several months) between publication and harvesting.

DESIRE is providing tools for building a service better suited to the needs of European researchers, the European Web Index (EWI). The EWI system consists of the Combine harvesting robot and a search system using the standard Z39.50 protocol. EWI can therefore support a variety of user interfaces and can also be integrated with library systems. The Combine harvesting robot can easily be configured to either perform surgical indexing of particular servers or domains providing Internet documents which have research relevance, or to create medium sized to large regional WWW indexes, indexing virtually everything in that region.

The EWI system can take advantage of all kinds of metadata which may be embedded (in particular such data following the emerging Dublin Core standard) in today's documents or which in the future may be available by other means. Close collaboration with the Information Gateways work in DESIRE will ensure a shared approach to the implementation of new technologies and in the longer term closer integration of EWI and the subject gateways.

EWI is a development of the Nordic Web Index project which covers Norway, Sweden, Denmark, Finland and Iceland. EWI is specifically intended to harvest information in different languages and to provide user interfaces which match national needs. To ensure that the system will scale to include resources from across Europe, EWI has a distributed model in which additional resources can be deployed as needed.