

Robots are faced with contradictory goals: they must retrieve as much
information as possible in as little time as possible, but they must not
overload the network or servers.
- Spread the load
- Avoid loading many documents from a single server in a short period of
time. Spread the load over many servers.
- Use HEAD and Conditional-GET
- Use HEAD if you only need meta-information, and a Conditional GET
for documents you already have, to avoid loading an unmodified document
again.
- Use Accept
- Indicate which data types you can handle, using the HTTP
Accept header. Loading gifs when scanning for words in
text is not useful.
- Predict document types
- If your robot can only handle text, links to files with a name ending
in .gif, .mpeg, etc. are probably not
useful.
- Check results
- Check return codes that may indicate than many other URLs you still
have in the queue are useless. When 5 documents in the same directory
are refused without password, chances are all documents in that directory
will be refused.
- Avoid duplicates and loops
- Different server names may point to the same machine.
Different URLs on a single machine may point to the same document.
- Avoid sensitive information
- Don't try to find just anything. Some servers may accidentally
give access to /etc/passwd. Don't use it.
- Share results
- Webmasters will be less upset about robots roaming their site
if the purpose is clear. Provide feedback on the purpose of your robot,
and if possible make the results publicly available through the Web.