Department of Mathematics and Computing ScienceWorld Wide Web OrganizationPaul De Bra

Robots and Network Resources

Robots are faced with contradictory goals: they must retrieve as much information as possible in as little time as possible, but they must not overload the network or servers.

Spread the load
Avoid loading many documents from a single server in a short period of time. Spread the load over many servers.

Use HEAD and Conditional-GET
Use HEAD if you only need meta-information, and a Conditional GET for documents you already have, to avoid loading an unmodified document again.

Use Accept
Indicate which data types you can handle, using the HTTP Accept header. Loading gifs when scanning for words in text is not useful.

Predict document types
If your robot can only handle text, links to files with a name ending in .gif, .mpeg, etc. are probably not useful.

Check results
Check return codes that may indicate than many other URLs you still have in the queue are useless. When 5 documents in the same directory are refused without password, chances are all documents in that directory will be refused.

Avoid duplicates and loops
Different server names may point to the same machine. Different URLs on a single machine may point to the same document.

Avoid sensitive information
Don't try to find just anything. Some servers may accidentally give access to /etc/passwd. Don't use it.

Share results
Webmasters will be less upset about robots roaming their site if the purpose is clear. Provide feedback on the purpose of your robot, and if possible make the results publicly available through the Web.


home