Project Number:

RE 4004 (RE)

Project Title:

DESIRE II - Development of a European Service for Information on Research and Education II

Deliverable Type:

PU

Deliverable Number:

D3.6

Contractual Date of Delivery:

29th of February 2000

Actual Date of Delivery:

29th of February 2000

Title of Deliverable:

Automatic classification

Workpackage(s) contributing to the Deliverable:

WP3

Nature of the Deliverable:

PR

Author:

Traugott Koch

Contact Details:

email:traugott.koch@lub.lu.se

phone: +46 46 222 92 33

fax: +46 46 222 36 82

URL: http://www.lub.lu.se/person_tk.html

Other Authors:

Anders Ardö and Lars Noodén

URL

Overview of results

http://www.lub.lu.se/desire/DESIRE36a-overview.html

Automatic classification demonstration page

http://www.lub.lu.se/desire/demonstration.html

The construction of a robot-generated subject index

http://www.lub.lu.se/desire/DESIRE36a-WP1.html

Automatic classification of full-text HTML-documents from one specific subject area

http://www.lub.lu.se/desire/DESIRE36a-WP2.html (this document contains several pointers to figures and tables important to the context)

Abstract

Automatic methods of gathering and knowledge organization are necessary in order to improve the discovery of Internet resources. Not even a large co-operative effort can cope with the quantities and the amount of changes to the documents for a service or subject area of some size. Our EU-project, DESIRE, developed an approach, using cross-browsing and cross-searching, to integrate a manually selected, catalogued and quality assessed collection of WWW-resources with a much larger robot-generated subject index in the same subject area.

We started exploring different methods of gathering an Engineering index from the web and creating a stable database for the project. In order to provide a subject based browsing interface to this index some kind of automatic classification is needed. The goal was to structure the index using the same Ei (Engineering information Inc.) classification which is used in the quality service Engineering Electronic Library, Sweden (EELS). This will allow cross-browsing between both services. We explored and evaluated different methods of automatic classification based on this established classification system, each using different heuristics for matching, weighting and display. In co-operation with other projects we started studying options which might result from the usage of a universal classification system. After the project, the automatic classification and cross-browsing functionality will be added to the EELS service.

Keywords

Automatic classification, Harvesting, Robot-generated subject index, Subject gateways, Web resource discovery, Metadata, Engineering, Dewey Decimal Classification, Ei thesaurus and classification

Distribution List:

DESIRE Project Team; European Commission, DESIRE Public Web site

Issue:

V1.0

Reference:

deliverd36.rtf

Total Number of Pages:

11

Document Control

Issue Number

Issue Date

Reason for Change

v0.1

10/2/2000

Initial version for internal review

v0.2

14/2/2000

Revised in light of internal comments and sent to peer reviewers

v0.3

25/2/2000

Final version sent to Project Coordinator

v1.0

01/3/2000

Final version sent to Commission and posted on DESIRE site

Executive Summary

Goal

Automatic methods of gathering and knowledge organization are necessary in order to improve the discovery of Internet resources. Not even a large co-operative effort can cope with the quantities and the amount of changes to the documents for a service or subject area of some size. Our EU-project, DESIRE, developed an approach, using cross-browsing and cross-searching, to integrate a manually selected, catalogued and quality assessed collection of WWW-resources with a much larger robot-generated subject index in the same subject area.

We started exploring different methods of gathering an Engineering index from the web and creating a stable database for the project. In order to provide a subject based browsing interface to this index some kind of automatic classification is needed. The goal was to structure the index using the same Ei (Engineering information Inc.) classification which is used in the quality service Engineering Electronic Library, Sweden (EELS). This will allow cross-browsing between both services. We explored and evaluated different methods of automatic classification based on this established classification system, each using different heuristics for matching, weighting and display. In co-operation with other projects we started studying options which might result from the usage of a universal classification system. After the project, the automatic classification and cross-browsing functionality will be added to the EELS service.

Gathering

Among a number of theoretical options we tested two approaches for harvesting a robot-generated Engineering index. Starting from a number of quality collections on the Internet, these approaches are:

1.follow all links from the top-pages of these collections in three steps recursively

2.harvest a few collections completely and follow all links in two steps recursively.

We found a surprisingly low overlap between the Engineering link collections used as start sites for the robot and in addition a low overlap between the resulting databases from the two approaches. An intellectual evaluation of the contents of both databases showed the same percentage of relevant documents (77%) for each. Both methods combined are therefore required if one wishes to achieve a more complete index. We showed that a combination of different harvesting methods and a careful and broad selection of starting sites is necessary to reach a satisfactory level of coverage in the subject index.

An advanced gathering method, that we started to test with a demonstrator, is to use a thesaurus on-the-fly during the harvesting process. This method is utilized to determine whether a page should be included and its links followed, thus allowing the harvesting depth to be adapted to the distribution of relevant material. It improved the level of relevance of the collected resources to the subject area. Such focused crawling in robot-generated subject indices has the potential to overcome several of the most disturbing weaknesses of general web search engines in completeness, frequency of updating and topical coherence of the resources.

For usage in the automatic classification work in this project, we created a stable database containing 132 000 English language engineering documents from the web.

Automatic classification

To provide a subject based browsing interface for the robot index we carried out an automatic classification and generated an Engineering Information Inc. (Ei) structure equivalent to the one used in the quality service EELS.

We used the Ei thesaurus which contains more than 16 000 terms, intellectually mapped to more than 800 classification categories, to accomplish a rough subject classification. Some pre-processing was applied both to the vocabulary and to the full-text of the documents.

For each document in the index database all metadata, headings and plain text was extracted. Then each of the vocabulary terms from the thesaurus (and the captions from the classification categories) was matched against this text. If a match was found then the corresponding list of classification codes was associated with the record with a weighting score that's dependent on several factors like term complexity (single word, Boolean expression or phrase), type of classification (master or optional), match location (metadata, headings or plain text) and match frequency.

In the end all scores were summed for each class and scores from classes directly above in the Ei hierarchy added to the most specific classes below to follow the rule of assigning the most specific classification to a document. For every document a list of classification suggestions in decreasing order of the scores was generated. In order to decide how many classifications to assign to every record for the display in a browsing system we experimented with different heuristics for cut-off points.

We applied several different approaches, heuristics, weighting schemes and cut-off points to improve the classification process and the resulting browsing structure (cf. the demonstration page and working paper 2). Among other actions we were forced to omit the most frequently matched very general and ambiguous single word thesaurus terms. This is one of the problems when using a vocabulary system constructed for human use in an automatic process. Fortunately, the remaining terminology still placed the documents correctly. The result of all variants was analyzed as to its distributional effect across the browsing system and to its effects on the placement and ranking of individual documents. We could prove that our different heuristics did not change the basic classification of the documents but only influenced the number and ranks of documents displayed in the browsing structure.

The best outcome of this rough classification process, our main solution, does not apply stemming, but uses an expanded stopword list and normalized weights with relative matching frequency. A cut-off threshold of 3% of the total page scores and an absolute score of 2 is used to decide upon how many and which documents to display in a certain class in the browsing system. The average number of classifications per page is slightly less than 5. Since we are not using stemming 11% of the documents do not receive a classification. With stemming 98% of the documents would be classified but at the expense of many more erroneous classifications.

A side product of this work is a classification service for individual engineering pages using the Ei classification.

Evaluation

We carried out a couple of different evaluations of the outcomes of our automatic Ei classification.

First was an evaluation of the distribution effect which was as expected: the relative importance of the different levels of depth of the classification system and the editorial mapping of terms to the classes was closely mirrored. The distribution of documents across the main subject categories represented the topical distribution of documents on the web rather than the classification system. Thus, the structure and depth of the classification and its relationship to the size and topical content of the collection is crucial for a good result.

Next, we compared the results of the intellectually assigned classifications for close to 1000 web pages independently made for resources already included in the EELS service and the automatic classification for the same pages. We discovered a problem of lost context of the pages when treated by the automatic classification as compared to the contextual view of the human classifiers as one of the reasons for disagreement. The comparison shows, however, identical or a more specific automatic classification in 57-66% of the cases. This is a comparably good result most probably resulting from the rather good vocabulary system available which provides quite many terms intellectually mapped to the classes.

As an "ultimate" qualitative evaluation, we submitted the results of the classification to expert users. We found rather large differences in the correctness of the classification between different subject categories. The mean correctness value, although in this case not very useful, of about 59%, lies inside the area of agreement with the EELS classification staff (see above, 57-66%).

We identified three main causes of errors in the automatic classification related to the need to further disambiguate terminology. Two of them could be reduced considerably by future improvements to the process.

Experiments with universal classification systems

We started a co-operation with two other automatic classification projects, GERHARD and OCLC's Knowledge Organization Group, to study the effects of different linguistic approaches, other heuristics and above all of the usage of universal classification systems rather than subject specific ones. The universal classifications UDC and DDC offer several thousand engineering related categories each, as opposed to the 800 in Ei. They are, however, spread across tens of thousands of categories in the universal system.

A classification of our engineering database with UDC according to the methods from the German GERHARD project revealed a distribution of the documents across far too many classes to offer a useful browsing system. In addition, many classes do not indicate an obvious connection to engineering. We assume that the main reason for this result is the very limited vocabulary available to describe the content of a certain UDC class.

The co-operation with the Knowledge Organization group at OCLC aims at exploring the strenghts and weaknesses of the Dewey Decimal Classification system (DDC) and its advanced developments. As with UDC we will clasify our engineering database with DDC, but in this case enriched with mapped terminology (LC Subject Headings and Engineering terms). With linguistic methods we already have identified high level topical keyphrases from the full-text documents in our database which will be used for providing keyphrase browsing in our Ei classification. The DDC classification will use the words and phrases extracted instead of the complete text of the documents to carry out the classification.

In a couple of projects and co-operations we will continue to develop the methods of automatic classification and harvesting of a subject index, after the end of DESIRE.

Only a few tests will be run to improve the methods of creation of a robot-generated subject index , in the context of the EELS service with its "All" Engineering index and for a Danish subject index in Medicine. We wish to find out if it is necessary to actively include citing references, in other words, web pages that point to documents already in the database to improve the subject coverage of the index. In addition, we intend to apply the method of thesaurus filtering on-the-fly in a full scale project.

After the end of the project, the existing EELS/"All" Engineering service will be upgraded with the most suitable solution for automatic classification. A cross-browsing feature in both directions between the quality service and the robot-generated database will be introduced as well as the option to use classification as a search filter. A keyphrase browser will be added to improve the browsing in the subject index.

We will concentrate on trying more elaborate heuristics and methods of automatic classification in order to further improve the results of the classification. Especially, we want to include the context of documents into the classification and to find methods to disambiguate the terminology to reduce classification errors. A combination with solutions from alternative approaches like clustering or Neural Network techniques could prove useful. We will compare and evaluate different solutions and work in different subject areas in order to offer good recommendations to other Internet subject services.

As far as the advanced classification methods and the use of universal classifications are concerned we will at least explore the following:

We want to find out the potential improvements of using advanced linguistic software (i.e. noun phrase extractors and morphological analyzers), especially in cases where the classification process only has the very limited vocabulary of captions in the classification system available for matching and no connection to a thesaurus.

Together with the OCLC Office of Research Knowledge Organization group we will compare the performance of their classification assignment software Scorpion to our simple methods. A next step towards more advanced methods is to extend the Dewey Decimal Classification knowledge base in different ways with Engineering vocabulary and to check out the performance of this combination between a broad universal classification system and a deep subject-specific one, adapted to a specific subject area like Engineering. Navigation support has to be developed when a broad universal classification scheme is spreading the documents across thousands of categories. Distributed usage of vocabulary systems and methods of vocabulary mapping need to be explored, among others in the NKOS effort and the EU project Renardus. Finally, we want to explore how established classification systems and thesauri need to change in order to become suitable knowledge organization tools for large distributed digital collections and services.

The work to improve the knowledge organization in the digital world has just begun.

Scope Statement

This is a final report outlining work towards the creation of a prototype service providing automatic classification of Engineering resources. Other work within this Deliverable includes a demonstrator and report for a prototype thesaurus-based interface. This report will be made available from the DESIRE Web site shortly.

Technical Summary

Software

All software and tools developed in this work package are freely available, under the GNU Open Software model, among all the other DESIRE tools.


Title:
Issue: V1.0
Date: 29th of February 2000