Project Number: | RE 4004 (RE) | ||
Project Title: | DESIRE II - Development of a European Service for Information on Research and Education II | ||
Deliverable Number: | D3.7 | ||
Deliverable Title: | Upgraded harvesting system including multiple retrieval protocols | ||
Deliverable Type: | PU | ||
Deliverable Kind: | TO | ||
Principal Reviewer: | Name | David Beckett | |
Address | Computing Laboratory University of Kent at CanterburyCanterburyKentCT2 7NFUK | ||
D.J.Beckett@ukc.ac.uk | |||
Telephone | +44 1227 823552 | ||
Fax | +44 1227 762811 | ||
Credentials | David Beckett has been working at the University of Kent at Canterbury since 1990 and published research on the Internet, Web and Metadata technologies. He has participated in the Dublin Core workshops and International Web conferences since 1995 and developed several web systems using metadata. David currently works on the UK Mirror Service project (http://www.mirror.ac.uk/) which is a large distributed service providing replicated content for UK Higher Education. He developed the collections system the search systems both implemented using Dublin Core in RDF/XML format, indexed by Zebra (Z39.50) with a qualified DC query interface. David has also operated a web crawler/indexing service for UK academic web sites using Harvest technologies since 1998. | ||
Other Reviewers: | (if relevant) | ||
Summary: | Relevant | 4 The work package is pre-release but despite the installation problems (see attached detailed report) the harvesting system works reasonably well.This score and the others represent the current state of the combine software, as a pre-release version. | |
State-of-Art | 4 Combine uses a flexible model for web crawling that has not been used elsewhere and is a good, clean design. However, for larger web crawling systems the work of Larry Page and Servey Brin in the MIDAS project at Stanford, now commercialised as the Google search system is more representative of the state of the art. | ||
Meets Objectives | 3 [Assuming the objectives are those described on http://www.lub.lu.se/desire/desireIIindex.html]There is flexibility in the software to provide enhanced support for metadata gathering but this does not seem to presently be exploited, for example in web pages containing Dublin Core metadata.This project aims to cover harvesting, indexing and searching but the current pre-release contains little help with integrating the harvesting with the indexing and searching side of the work. This reviewer managed to use experience with Zebra and the information contained on http://www.lub.lu.se/combine/ to create a searchable index of the harvested results, and searching via a Z39.50 gateway.The current harvester only works for the HTTP protocol wheras the aims indicate that it should handle others such as FTP and NNTP. | ||
Clarity | 4 The system is pre-release but the system model is clean and simple to understand. There are problems with the installation and user documentation having errors, omissions and being out of date; for example the Technical Guide is for V1.0 of system. | ||
Value to Users | 3 The system will be a valuable and useful web harvesting system when the criticisms listed here are addressed, and the indexing and searching parts completed. | ||
Specific Criticisms | 1 | The user and technical documentation needs updating, correcting and improving. | |
2 | The harvesting does not seem to make use of existing Dublin Core embedded web metadata. It should also provide support for RDF/XML embedded metadata in web pages. | ||
3 | The web crawling may not be scalable beyond a few million web pages due to the centralised job control mechanism. | ||
4 | There is some information on indexing the pages with Zebra but searching is suggested to be done via the Europagate Z39.50 gateway which is not practical for a real system. | ||
5 | Searching via Europagate does not exploit all the richness of the harvested pages in the searching and results: a) there should be suggestions how to improve the indexing using the suggested Zebra filter filesb) the gateway-ed search results should return URLs to the original web pages | ||
6 | Harvesting of non-HTTP protocol URLs is not implemented. | ||
Developer Response: | 1 | David's comments on the documentation is fair, work is under way to revise the documentation. | |
2 | The harvesting does store embedded meta tags in the harvesting database. We use the search engine's input filter for loading this into appropriate search fields. This is a documentation flaw, not a problem of the software. Extraction of metainformation from XML tagging has to be customized for each application, so a general XML extraction software that need no configuration cannot be produced. What could be done for meeting this criticism would be a "tag-pattern matching configuration file" which permits users to configure start and end tags and where in the resulting record its content should go. This idea did not occur to us, and we will try to implement that on a later stage. RDF is not implemented for two reasons: (1) Worldwide Web consortium has not released RDF schema as a recommendation, and (2) the Dublin Core Metadata Initiative (DCMI) has neither issued recommendations on how DC should be encoded in RDF, nor a set of interoperability qualifiers. Therefor we are not yet prepared to support RDF metadata. | ||
3 | It is correct that the scheduling software have not been tested beyond a 5 million records. The scheduler has two algorithms that are frequently used: round-robin and slightly-sorted. The first algorithm implies that the scheduler will visit the first host which is visitable given the time limit (configurable with a default of 60 second). Slightly-sorted means that a preference for large sites. Both algorithms have their pros and cons.According to our experience, the size distribution of WWW servers are highly skewed. As a rule of thumb one may say that 10% of the servers carries about 90% of the documents. As a consequence the round robin will become painfully slow when only a few big sites are left in the queue. This problem is addressed by the slightly sorted algorithm, which has the problem that the loading of millions of URLs can take a lot of time. What we do in large jobs is to round-robin while ensuring that the scheduler that fresh URLs are loaded into the system at regular intervals.We seldom reach a speed beyond processing speed of 10,000 documents per hour when using the default latency period of 60 seconds, on an installation on a single machine. But we have not experienced the rate-limiting step should be a bottle neck in the scheduler. The bottle-neck seems Rather to be that harvesters are idling while waiting for response from remote servers and, occasionally, that the parsers not always are able to keep the pace of the rest of the system.The combine harvesting system can be configured to run on a cluster of machines, and in such a cluster it would be possible to have more than one scheduler, given that software is written to split the input of new URLs by server domain into mutually exclusive streams into each scheduler in the cluster. We do not, however, have the hardware needed to test such a setup. | ||
4-5 | We do not use Europa-gate for searching, rather we are using the Zebril gateway which is much mor flexible. We have also successfully used the Zap gateway. The reason that the Europa-Gate is still there is that the text hasn't been updated since the Zebril became free (GPL) software. The vendor has not yet prepared a release, so while we are waiting we need to point to something. These two points are more related to documentation than anything else. | ||
6 | In preparation. | ||
Starting with http://www.lub.lu.se/combine/dist/combine-b1.3-src.tar.gz
I extracted this on a standard Linux machine - Redhat 6.0 with latest GNU tool chain so it was a fair test of a typical unix configuration.
I followed the installation isntructions in the README to compile bits with libdb V2. The linux configuration was wrong for the latest compiler -lg++ is not needed.
maxwell 63$ bin/config.pl
bash: bin/config.pl: No such file or directory
maxwell 64$ head -1 bin/config.pl
#! /usr/local/bin/perl5
i.e. this only works on certain systems. You should be using the perl-runs-anywhere trick:
-------------------
#!/bin/sh -- # -*- perl -*- -p
eval 'exec perl -wS $0 ${1+"$@"}'
if $running_under_some_shell;
-------------------
and all the other perl programs don't work since they have the same path. I then hand edited all the Perl scripts to have correct path.
Back to the instructions in the README file:
maxwell 73$ bin/config.pl
WARNING: COMBINE no longer uses this file: bin/config.pl
and looking at the script it in fact does nothing. The web page http://www.lub.lu.se/combine/ mentions a file etc/config_collections replacing etc/combine.conf but this isn't mentioned in the instructions.
make onedir
installed the files in the right place eventually, although mkdir complained about creating directories that already existed
----------------------------------
Following the 'Running the Robot' instructions
[combine@maxwell combine]$ bin/start-cabin
starting id server at port 10012 ...
starting tellogd server at port 10014 ...
[combine@maxwell combine]$ bin/start-hdb
starting id server at port 10022 ...
starting rd server at port 10023 ...
[combine@maxwell combine]$ bin/start-harvester-local 5
[combine@maxwell combine]$
In '4. Control The Robot' the etc/setenv* files are not used as the bin/config.pl program said earlier.
[combine@maxwell combine]$ bin/sd-ctrl.pl open
COMBINE-SD/0.0 200 OK
Stat: OPENED
[combine@maxwell combine]$ echo http://acdc.hensa.ac.uk | bin/jcf-builder.pl | bin/sd-load.pl
1 JCFs have been added to SD at maxwell.mirror.ac.uk:10013...
now the SD at maxwell.mirror.ac.uk:10013 holds 1 JCFs.
which seemed to work fine. Later on in the README it mentions the config.sh file - this does not exist.
Ran
[combine@maxwell combine]$ bin/new-url.pl
which emited a bunch of urls to stdout as well as creating a .new file - why? What program gets rid of the .new files?
The user guide then said use:
sort < log/url.####.new | uniq | \ bin/selurl.pl | bin/jcf-builder.pl | \ bin/sd-load.pl
This doesn't work unless I remove the '\'s so I did:
[combine@maxwell combine]$ sort < log/url.*new | uniq | bin/selurl.pl | bin/jcf-builder.pl | bin/sd-load.pl 10 JCFs have been added to SD at maxwell.mirror.ac.uk:10013... now the SD at maxwell.mirror.ac.uk:10013 holds 55 JCFs.
later user guide typos:
'bin/sd-ctrl.pl hots ' shoud be 'bin/sd-ctrl.pl hosts '
'etc/setenv.sh bin/idb2hrs.pl bin/hrs2jcf.pl < hrs/idb.hrs | bin/sd-load.pl'
should probably be:
'bin/idb2hrs.pl; bin/hrs2jcf.pl < hrs/idb.hrs | bin/sd-load.pl'
README typo: etc/config_dissalow is wrong, should be etc/config_exclude
----------------------------------
Later on after lots of attempts to add new URLs using the sequence above, I couldn't get it to discover any new URLs or follow them. It always just does 1 round of work and stops. It took me a while to understand why that happens.
idb2hrs failed to work sometimes and gave this error (the server was running):
[combine@maxwell combine]$ bin/idb2hrs.pl
Client 400 Can't connect to server
the idb daemon seems to be doing something wrong since tracing it, it seems to be repeatedly trying and failing to do something:
read(5, "method: PUT\r\npassword: XXXXX\r"..., 1024) = 105
read(5, "\r\n", 1024) = 2
alarm(0) = 20
write(5, "COMBINE-IDBD/0.0 200 OK - put\r\n"..., 31) = 31
close(5) = 0
munmap(0x40014000, 4096) = 0
close(5) = -1 EBADF (Bad file descriptor)
accept(4, {sin_family=AF_INET, sin_port=htons(1874), sin_addr=inet_addr("212.219.56.162")}, [16]) = 5
fcntl(5, F_GETFL) = 0x2 (flags O_RDWR)
fstat(5, {st_mode=S_IFCHR|S_ISVTX|013, st_rdev=makedev(136, 190), ...}) = 0
mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40014000
_llseek(0x5, 0, 0, 0xbffff6c0, 0x1) = -1 ESPIPE (Illegal seek)
SYS_174(0xe, 0xbffff5dc, 0xbffff550, 0x8, 0xe) = 0
alarm(20) = 0
read(5, "method: GET\r\npassword: XXXXX\r"..., 1024) = 69
read(5, "\r\n", 1024) = 2
alarm(0) = 20
write(5, "COMBINE-IDBD/0.0 500 Error - dat"..., 51) = 51
close(5) = 0
munmap(0x40014000, 4096) = 0
close(5) = -1 EBADF (Bad file descriptor)
accept(4, {sin_family=AF_INET, sin_port=htons(1876), sin_addr=inet_addr("212.219.56.162")}, [16]) = 5
fcntl(5, F_GETFL) = 0x2 (flags O_RDWR)
fstat(5, {st_mode=S_IFCHR|S_ISVTX|013, st_rdev=makedev(136, 190), ...}) = 0
mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40014000
_llseek(0x5, 0, 0, 0xbffff6c0, 0x1) = -1 ESPIPE (Illegal seek)
SYS_174(0xe, 0xbffff5dc, 0xbffff550, 0x8, 0xe) = 0
alarm(20) = 0
read(5, "method: PUT\r\npassword: XXXXX\r"..., 1024) = 107
read(5, "\r\n", 1024) = 2
alarm(0) = 20
write(5, "COMBINE-IDBD/0.0 200 OK - put\r\n"..., 31) = 31
close(5) = 0
munmap(0x40014000, 4096) = 0
close(5) = -1 EBADF (Bad file descriptor)
accept(4,
and running bin/stop-all caused everything to stop except idbd which core dumped and died. I recompiled it and restarted everything and it seemed to work properly - I don't know why.
I then had trouble working out exactly what sequence of things to do regularly, since the examples given weren't correct for the current release of software. I ended up with this nightly work:
... PATH configuration etc ...
# Add New URLs
rm -f log/*.new
bin/new-url.pl | sort | uniq | bin/selurl.pl | bin/jcf-builder.pl | bin/sd-load.pl
# Retry unavailable URLs
bin/retry-unavailable.pl
-------------------