|
Project Number: |
RE 4004 (RE) |
||
|
Project Title: |
DESIRE II - Development of a European Service for Information on Research and Education II |
||
|
Deliverable Number: |
D3.6a |
||
|
Deliverable Title: |
Automatic classification |
||
|
Deliverable Type: |
PU |
||
|
Deliverable Kind: |
PR |
||
|
Principal Reviewer: |
Name |
Kjell Jansson |
|
|
|
Address |
KTHB SE-100 44 Stockholm Sweden |
|
|
|
|
kj@lib.kth.se |
|
|
|
Telephone |
+46 8 790 8971 |
|
|
|
Fax |
+46 8 790 8954 |
|
|
|
Credentials |
PhD in Physics 1977. Information specialist at the Royal Institute of Technology Library (KTHB) from 1980, online searching – SDI – teaching. Co-ordinator of EELS http://eels.lub.lu.se/ since its start 1994. |
|
|
Summary: |
Relevant |
5 (1 = poor, 5 = excellent) Very relevant. The authors have explored this issue in great detail and with several methods. |
|
|
|
State-of-Art |
5 The authors have several international co-operations in this field with leading organizations. |
|
|
|
Meets Objectives |
4 Many applications of the methods indicated. However, some criticisms regarding the intergration with EELS, see below. |
|
|
|
Clarity |
4 Good clarity - sometimes some ambiguity regarding what has been done and what is to be done. |
|
|
|
Value to Users |
5 Users of the Internet still needs something better than the search engines and the SBIGs that exist today. A good automatic classification should be of great value and this work is surely a good contribution to that. Also the libraries are showing a growing interest for including automated classification procedures. |
|
|
Specific Criticisms |
1 |
It would be useful with links to some text explaining concepts in the demonstrator http://mother.lub.lu.se/anders/cl.html and the resulting page after an automatic classification has been performed. For example, Porters algorithm and the @ notation, maybe in short way how the weight is calculated. |
|
|
|
2 |
One of the goals is to get a browsable list joined with the EELS service http://eels.lub.lu.se/ . However, even if the accomplished precision seems to be good compared with what has been reached by others one should try to reach much higher precision still in this application. This will mean lower recall, of course, but this should not be a problem really. On the contrary, very long browsing lists with thousands of links would probably be regarded with some confusion by users. If also many false hits are showing in a long list then the reaction might be somewhat negative from users. Some ideas to increase precision: - Avoid to count hits in plain text on single words. They seem to introduce false hits even if they get low weight at least in the material on Ei class 931.3 which also was examined even if it is not published in the working paper 2. One way to keep those might be to increase the cut off level rather much. A few hits in plain text on single words often mean hits where the document deals mainly with other subjects. Would it be possible to use cut off values depending of the type of words – single words, combinations and phrases? Now they have only different weights. - Avoid using acronyms especially if they come from the UF-field in the Ei Thesaurus. It is not sufficient to use the checking of capital letters since headings and titles e.g. often are in capitals. - Would it be possible to use the mutual citing between important web sites in the selection/weighting process? Often scientific groups working in the same field have links that points to each other –"clustering". |
|
|
|
3 |
The evaluation of some 200 sites per classification by subject specialists of the database that was collected by the robot for the study is very interesting. The result showed large differences between the classes 903, 801.2 and 412. There may be many reasons for this, maybe the subject itself and the corresponding Ei classification. To avoid an unnecessary reason for a systematic error the selection of these 200 sites should have been more random. For the unpublished 931.3 I noticed that the material often appeared to be collected from same servers in groups. In this way some features got a large impact on this limited material. |
|
|
|
4 |
_ |
|
|
Developer Response: |
1 |
To both pages mentioned we have added explaining texts and links to the relevant parts of the documentation in working paper 2. In the classification demonstrator pages the effects of changing the default parameters are now described and Porters algorithm is characterised. The result pages are now easier to understand and carry a short description of the weighting scores and the cut-off method with a link to the documentation. Done. |
|
|
|
2 |
The reviewer is right to express a preference for very high "precision" (compared with the subject in a certain class) among the documents displayed in a real end-user service like EELS, even at the expense of high recall. We imagine that the service provider using our methods will undertake a lot of practical considerations related to the document collection, the target audience and other conditions. Our methodology allows perfectly well to increase precision or to increase recall, both in a controlled way. We decided upon our general default solution for the DESIRE II project after controlling the outcome when applying different weighting algorithms and cut-off points. In the demonstrator page we display (and analyse) a more precision oriented solution (with a relative cut-off of 9% and an absolute cut-off score of 5). When we apply our methods to the EELS service in the future, we will support the service providers’ preference for very high relevance of the displayed pages by choosing high cut-off levels. In addition, we have indicated that we intend to use the key phrases identified with OCLC’s software (cf. demonstration page) to offer a high precision access to the many documents on every level of the classification structure through a key phrase browser. Other features that strive towards this goal are the ranked order of documents in every class and the browser option to keyword search the complete class. As proposed by the reviewer we could use the web links (citing references) between documents in our database to cluster or upgrade the ranking of documents frequently cited or associated with frequently cited documents. It must be made clear, however, that this feature belongs to citation indexing and/or clustering and not to topical classification. The first idea referred to by the reviewer, to omit hits on single words in the plain text, seems less useful to us. We will evaluate even lower weighting on single word hits but regarding the fact that 80% of the matches are of this type indicates the danger in doing so. We will try, in cooperation with OCLC, to apply methods of linguistic processing and to evaluate the effect of using key terms and phrases as the basis of classification rather than all terms in the document (as described in working paper 2). Regarding the second idea, we do not think that acronyms in general, to a larger degree than ordinary words, are prone to be homonyms. Apart from choosing harder cut-off and offering mentioned ranking, clustering and key phrase browsing features we believe that tackling the context and homonym problems discovered in our analysis of the expert evaluations is the most promising way forward when it comes to further improvements of relevance and precision. Some of these efforts probably need to be tailor made for every real document collection, service and subject because of the characteristics of specific vocabulary involved. |
|
|
|
3 |
This remark concerns the process of expert evaluation of the outcome of our classification methods. We are convinced that our process resulted in the selection of random pages for evaluation and that no specific kinds of pages or sites are overrepresented. All pages where treated equally, ie site information was not used at all when extracting pages. We used the following procedure to extract a random sample for evaluation. First all pages with the chosen classification was extracted into a list.From this list entries were chosen randomly (using a random number generator).The chosen entry was then deleted from the list. Each entry was submitted to two tests before it was actually recognised as a member of the evaluation selection. These tests were designed to minimize the impact from the dynamic nature of the web that renders our static tests database progressively out of date. The first test checked that the page still exists at the web site. The second test checked that our classification algorithm still classifies this page with the chosen classification code.Only pages passing these two tests are actually included in the random sample. This procedure might favour large stable servers. We have a total of 6960 pages from 2103 sites classified as 931.3. In the test sample we have 200 pages from 127 sites. These numbers do not indicate that we have any serious over-representation of sites in the random selection. However it is hard to say without a detailed investigation of the site distributions. One possible reason for the shifting levels of correctness of the classification measures we did not discuss in our working paper 2 is the behaviour of the evaluators. Initial controls show, that some evaluators, despite rather clear written instructions, consciously or not, paid attention to the usefulness and quality of the page and its contribution to the body of knowledge in the subject area. For example in the subject area of "concrete", pages concerning the construction of concrete canoes or about concrete piping was judged as erroneous classification by the evaluator, probably for the mentioned reasons. When we apply our methods to the EELS service (and other application areas) we will tune the heuristics with further expert evaluations. We will try to make sure that the evaluations are carried out more consistently than this time. This might improve the measure of the quality of our classification effort. The improvement of our methods and heuristics, however, is directed by the real cases and reasons for erroneous classification we discover and did report on in working paper 2. |
|
|
|
4 |
Clarifications regarding all criticisms above are added to the documentation. Done. Answer to the remark under "Clarity": "… sometimes some ambiguity regarding what has been done and what is to be done": We changed a couple of ambiguous formulations in the documentation. Done. |
|