Tutorial: Robot Technology and Applications

Robots: algorithm for finding WWW documents

In the WWW step 1 of the information retrieval process (finding the documents) is the most difficult. In order to create an "index-database" a list of pointers to documents needs to be built. Since the WWW uses hypertext, the pointers to documents are "hidden" inside the documents themselves. Almost all search-facilities for the WWW use the following approach:

A robot (or spider) program retrieves documents whose names appear in an initial list.
From each document some information is extracted and put in the index-database.
From each document the links to other documents are extracted and added to the list.
The process is repeated until no more new documents are found, or until some limit is exceeded (time or disk space).
The index database is then offered to the public, with some query interface.

The problem of deciding when to stop, or when to not retrieve a document, is inherently undecidable. When retrieving documents from a WWW server it is impossible to detect whether this is a "static" document, or whether it is generated by a program, and therefore may change even over short periods of time. Often, such dynamic documents contain a link to a document with the same name (but possibly with a different contents). There is an inherent danger of going into an infinite loop retrieving (different) documents, all with the same name. This happens for instance with a document that shows the current time and that has an 'update' button.