|
Project Number: |
RE 4004 (RE) |
|
Project Title: |
DESIRE II - Development of a European Service for Information on Research and Education II |
|
Deliverable Type: |
PU |
|
Deliverable Number: |
D3.6 |
|
Contractual Date of Delivery: |
29th of February 2000 |
|
Actual Date of Delivery: |
29th of February 2000 |
|
Title of Deliverable: |
Automatic classification |
|
Workpackage(s) contributing to the Deliverable: |
WP3 |
|
Nature of the Deliverable: |
PR |
|
Author: |
Traugott Koch |
|
Contact Details: |
email:traugott.koch@lub.lu.se phone: +46 46 222 92 33 fax: +46 46 222 36 82 |
|
Other Authors: |
Anders Ardö and Lars Noodén |
|
URL |
Overview of results http://www.lub.lu.se/desire/DESIRE36a-overview.html Automatic classification demonstration page http://www.lub.lu.se/desire/demonstration.html The construction of a robot-generated subject index http://www.lub.lu.se/desire/DESIRE36a-WP1.html Automatic classification of full-text HTML-documents from one specific subject area http://www.lub.lu.se/desire/DESIRE36a-WP2.html (this document contains several pointers to figures and tables important to the context) |
|
Abstract |
Automatic methods of gathering and knowledge organization are necessary in order to improve the discovery of Internet resources. Not even a large co-operative effort can cope with the quantities and the amount of changes to the documents for a service or subject area of some size. Our EU-project, DESIRE, developed an approach, using cross-browsing and cross-searching, to integrate a manually selected, catalogued and quality assessed collection of WWW-resources with a much larger robot-generated subject index in the same subject area. We started exploring different methods of gathering an Engineering index from the web and creating a stable database for the project. In order to provide a subject based browsing interface to this index some kind of automatic classification is needed. The goal was to structure the index using the same Ei (Engineering information Inc.) classification which is used in the quality service Engineering Electronic Library, Sweden (EELS). This will allow cross-browsing between both services. We explored and evaluated different methods of automatic classification based on this established classification system, each using different heuristics for matching, weighting and display. In co-operation with other projects we started studying options which might result from the usage of a universal classification system. After the project, the automatic classification and cross-browsing functionality will be added to the EELS service. |
|
Keywords |
Automatic classification, Harvesting, Robot-generated subject index, Subject gateways, Web resource discovery, Metadata, Engineering, Dewey Decimal Classification, Ei thesaurus and classification |
|
Distribution List: |
DESIRE Project Team; European Commission, DESIRE Public Web site |
|
Issue: |
V1.0 |
|
Reference: |
deliverd36.rtf |
|
Total Number of Pages: |
11 |
|
Issue Number |
Issue Date |
Reason for Change |
|
v0.1 |
10/2/2000 |
Initial version for internal review |
|
v0.2 |
14/2/2000 |
Revised in light of internal comments and sent to peer reviewers |
|
v0.3 |
25/2/2000 |
Final version sent to Project Coordinator |
|
v1.0 |
01/3/2000 |
Final version sent to Commission and posted on DESIRE site |
Goal
Automatic methods of gathering and knowledge organization are necessary in order to improve the discovery of Internet resources. Not even a large co-operative effort can cope with the quantities and the amount of changes to the documents for a service or subject area of some size. Our EU-project, DESIRE, developed an approach, using cross-browsing and cross-searching, to integrate a manually selected, catalogued and quality assessed collection of WWW-resources with a much larger robot-generated subject index in the same subject area.
We started exploring different methods of gathering an Engineering index from the web and creating a stable database for the project. In order to provide a subject based browsing interface to this index some kind of automatic classification is needed. The goal was to structure the index using the same Ei (Engineering information Inc.) classification which is used in the quality service Engineering Electronic Library, Sweden (EELS). This will allow cross-browsing between both services. We explored and evaluated different methods of automatic classification based on this established classification system, each using different heuristics for matching, weighting and display. In co-operation with other projects we started studying options which might result from the usage of a universal classification system. After the project, the automatic classification and cross-browsing functionality will be added to the EELS service.
Gathering
Among a number of theoretical options we tested two approaches for harvesting a robot-generated Engineering index. Starting from a number of quality collections on the Internet, these approaches are:
1.follow all links from the top-pages of these collections in three steps recursively
2.harvest a few collections completely and follow all links in two steps recursively.
We found a surprisingly low overlap between the Engineering link collections used as start sites for the robot and in addition a low overlap between the resulting databases from the two approaches. An intellectual evaluation of the contents of both databases showed the same percentage of relevant documents (77%) for each. Both methods combined are therefore required if one wishes to achieve a more complete index. We showed that a combination of different harvesting methods and a careful and broad selection of starting sites is necessary to reach a satisfactory level of coverage in the subject index.
An advanced gathering method, that we started to test with a demonstrator, is to use a thesaurus on-the-fly during the harvesting process. This method is utilized to determine whether a page should be included and its links followed, thus allowing the harvesting depth to be adapted to the distribution of relevant material. It improved the level of relevance of the collected resources to the subject area. Such focused crawling in robot-generated subject indices has the potential to overcome several of the most disturbing weaknesses of general web search engines in completeness, frequency of updating and topical coherence of the resources.
For usage in the automatic classification work in this project, we created a stable database containing 132 000 English language engineering documents from the web.
Automatic classification
To provide a subject based browsing interface for the robot index we carried out an automatic classification and generated an Engineering Information Inc. (Ei) structure equivalent to the one used in the quality service EELS.
We used the Ei thesaurus which contains more than 16 000 terms, intellectually mapped to more than 800 classification categories, to accomplish a rough subject classification. Some pre-processing was applied both to the vocabulary and to the full-text of the documents.
For each document in the index database all metadata, headings and plain text was extracted. Then each of the vocabulary terms from the thesaurus (and the captions from the classification categories) was matched against this text. If a match was found then the corresponding list of classification codes was associated with the record with a weighting score that's dependent on several factors like term complexity (single word, Boolean expression or phrase), type of classification (master or optional), match location (metadata, headings or plain text) and match frequency.
In the end all scores were summed for each class and scores from classes directly above in the Ei hierarchy added to the most specific classes below to follow the rule of assigning the most specific classification to a document. For every document a list of classification suggestions in decreasing order of the scores was generated. In order to decide how many classifications to assign to every record for the display in a browsing system we experimented with different heuristics for cut-off points.
We applied several different approaches, heuristics, weighting schemes and cut-off points to improve the classification process and the resulting browsing structure (cf. the demonstration page and working paper 2). Among other actions we were forced to omit the most frequently matched very general and ambiguous single word thesaurus terms. This is one of the problems when using a vocabulary system constructed for human use in an automatic process. Fortunately, the remaining terminology still placed the documents correctly. The result of all variants was analyzed as to its distributional effect across the browsing system and to its effects on the placement and ranking of individual documents. We could prove that our different heuristics did not change the basic classification of the documents but only influenced the number and ranks of documents displayed in the browsing structure.
The best outcome of this rough classification process, our main solution, does not apply stemming, but uses an expanded stopword list and normalized weights with relative matching frequency. A cut-off threshold of 3% of the total page scores and an absolute score of 2 is used to decide upon how many and which documents to display in a certain class in the browsing system. The average number of classifications per page is slightly less than 5. Since we are not using stemming 11% of the documents do not receive a classification. With stemming 98% of the documents would be classified but at the expense of many more erroneous classifications.
A side product of this work is a classification service for individual engineering pages using the Ei classification.
Evaluation
We carried out a couple of different evaluations of the outcomes of our automatic Ei classification.
First was an evaluation of the distribution effect which was as expected: the relative importance of the different levels of depth of the classification system and the editorial mapping of terms to the classes was closely mirrored. The distribution of documents across the main subject categories represented the topical distribution of documents on the web rather than the classification system. Thus, the structure and depth of the classification and its relationship to the size and topical content of the collection is crucial for a good result.
Next, we compared the results of the intellectually assigned classifications for close to 1000 web pages independently made for resources already included in the EELS service and the automatic classification for the same pages. We discovered a problem of lost context of the pages when treated by the automatic classification as compared to the contextual view of the human classifiers as one of the reasons for disagreement. The comparison shows, however, identical or a more specific automatic classification in 57-66% of the cases. This is a comparably good result most probably resulting from the rather good vocabulary system available which provides quite many terms intellectually mapped to the classes.
As an "ultimate" qualitative evaluation, we submitted the results of the classification to expert users. We found rather large differences in the correctness of the classification between different subject categories. The mean correctness value, although in this case not very useful, of about 59%, lies inside the area of agreement with the EELS classification staff (see above, 57-66%).
We identified three main causes of errors in the automatic classification related to the need to further disambiguate terminology. Two of them could be reduced considerably by future improvements to the process.
Experiments with universal classification systems
We started a co-operation with two other automatic classification projects, GERHARD and OCLC's Knowledge Organization Group, to study the effects of different linguistic approaches, other heuristics and above all of the usage of universal classification systems rather than subject specific ones. The universal classifications UDC and DDC offer several thousand engineering related categories each, as opposed to the 800 in Ei. They are, however, spread across tens of thousands of categories in the universal system.
A classification of our engineering database with UDC according to the methods from the German GERHARD project revealed a distribution of the documents across far too many classes to offer a useful browsing system. In addition, many classes do not indicate an obvious connection to engineering. We assume that the main reason for this result is the very limited vocabulary available to describe the content of a certain UDC class.
The co-operation with the Knowledge Organization group at OCLC aims at exploring the strenghts and weaknesses of the Dewey Decimal Classification system (DDC) and its advanced developments. As with UDC we will clasify our engineering database with DDC, but in this case enriched with mapped terminology (LC Subject Headings and Engineering terms). With linguistic methods we already have identified high level topical keyphrases from the full-text documents in our database which will be used for providing keyphrase browsing in our Ei classification. The DDC classification will use the words and phrases extracted instead of the complete text of the documents to carry out the classification.
The work to improve the knowledge organization in the digital world has just begun.
This is a final report outlining work towards the creation of a prototype service providing automatic classification of Engineering resources. Other work within this Deliverable includes a demonstrator and report for a prototype thesaurus-based interface. This report will be made available from the DESIRE Web site shortly.
Software
All software and tools developed in this work package are freely available, under the GNU Open Software model, among all the other DESIRE tools.
Title:
Issue: V1.0
Date: 29th of February 2000