Tutorial: Robot Technology and Applications

Tricks for Finding Documents

Robots may monitor several netnews groups, scanning messages for embedded URL's.
Robots may scan directories of ftp servers for the existence of HTML files (possibly containing links to WWW documents).
Robots may manipulate URL's to generate new URL's of documents that may or may not exist. If document http://host/dir/subdir/file.html exists then http://host/dir/subdir/, http://host/dir/ and http://host/ should also exist.
Robots may manipulate URL's to retrieve directory listings. If document http://host/dir/subdir/file.html exists then http://host/dir/subdir/., http://host/dir/. and http://host/. may generate directory listings, containing HTML files.
Robots may try a limited number of coordinates in clickable images.
Robots may try to fill out forms that contain only one text field.
Robots may retrieve the same URL more than once, and check whether the returned document is always the same.

Robots that try to be too smart about finding documents may run into trouble. Not only may there infinite loops (like in the time example), but they may also find the same server more than once, under a different name. (Checking the actual IP number would solve this.) Also, robots may stumble upon documents protected by user/password combinations. Guessing for the password is not acceptable behavior.