Tutorial: Robot Technology and Applications

Robots and Network Resources

Robots are faced with contradictory goals: they must retrieve as much information as possible in as little time as possible, but they must not overload the network or servers.

Spread the load: Avoid loading many documents from a single server in a short period of time. Spread the load over many servers.
Use HEAD and Conditional-GET: Use HEAD if you only need meta-information, and a Conditional GET for documents you already have, to avoid loading an unmodified document again.
Use Accept: Indicate which data types you can handle, using the HTTP Accept header. Loading gifs when scanning for words in text is not useful.
Predict document types: If your robot can only handle text, links to files with a name ending in .gif, .mpeg, etc. are probably not useful.
Check results: Check return codes that may indicate than many other URLs you still have in the queue are useless. When 5 documents in the same directory are refused without password, chances are all documents in that directory will be refused.
Avoid duplicates and loops: Different server names may point to the same machine. Different URLs on a single machine may point to the same document.
Avoid sensitive information: Don't try to find just anything. Some servers may accidentally give access to /etc/passwd. Don't use it.
Share results: Webmasters will be less upset about robots roaming their site if the purpose is clear. Provide feedback on the purpose of your robot, and if possible make the results publicly available through the Web.