As illustrated in Figure 1, Harvest consists of several subsystems. The Gatherer subsystem collects indexing information (such as keywords, author names, and titles) from the resources available at Provider sites (such as FTP and HTTP servers). The Broker subsystem retrieves indexing information from one or more Gatherers, suppresses duplicate information, incrementally indexes the collected information, and provides a WWW query interface to it. The Replicator subsystem efficiently replicates Brokers around the Internet. Users can efficiently retrieve located information through the Cache subsystem. The Harvest Server Registry (HSR) is a distinguished Broker that holds information about each Harvest Gatherer, Broker, Cache, and Replicator in the Internet.
Figure 1: Harvest Software Components
The Harvest software distribution contains a large amount of functionality, in approximately 160,000 lines of code. You don't need to install all of the software to have a useful system. Three common configurations are:
We recommend that you start by running a Gatherer plus a Broker, which is the standard setup created by the binary software distribution. If your Broker becomes so popular that it creates bottlenecks, you can run a Replicator (see Section 7). You may also want to run an object cache (see Section 6), to reduce network traffic for popular data. Finally, you can distribute the gathering and brokering processes to optimize CPU and network use. We discuss this in the next subsection.