Go to
the Desire Project SiteReturn to
Contents Page
Go to the Desire Project
 SiteAbout DesireEnglish TextText FrancaisDeutsche TextNavigation Bar


The Combine Harvester
http://www.lub.lu.se/combine/

Combine is a software package for gathering web documents, parsing them and collecting them in a database.

Combine is intended for anyone setting up some kind of web-index, e.g., covering a country, or a university. It can be used in any scenario where you can specify a rule for the urls or servers that should be harvested, but it can also easily be extended w ith any kind of content based selection mechanism that might be required, such as, for example, an automatic classification tool.

Combine has the following features

  • Flexible and extendible. Combine consists of a number of relatively small parts communicating with specified protocols, allowing the user to combine these parts in a way that makes the system perform the required tasks. Combine has also been designed to make it easy to extend it with, e.g., a parser for a new kind of file format, or a new database format.

  • Parallel and disrtibutable. The different parts of Combine can be run in multiple instances on one or more computers, taking full advantage of multiple processors or networks of computers that may be available for the harvesting process.

  • Metadata aware. Combine can handle metadata embedded in the web documents, as well as fetching metadata from special metadata registries or databases.


See also
Combine Harvester Documentation
Combine Harvester: Examples of Use