In recognition of this problem, many Web Robots offer facilities for Web site administrators and content providers to show the robot what it should do. This is achieved through this mechanism:
|
The Robots Protocol
|
A Web site administrator can indicate which parts of the site
should be visited by a robot, by providing a specially formatted
file on their site, in http://.../robots.txt.
|
|
The Robots.txt example file
|
This is a zipped file that can be used by webmasters on their sites. To use it, download the archive, unzip and upload the robots.txt file into your root on your server.
|
Note that these methods rely on cooperation from the Robot, and are by no means guaranteed to work for every Robot. The main search engines, however, do respect the robots protocol, and will follow your directions as to indexing your web site..
In a nutshell, when a Robot vists a Web site, say http://www.foobar.com/, it firsts checks for http://www.foobar.com/robots.txt. If it can find this document, it will analyse its contents for records like:
to see if it is allowed to retrieve the document. If it is allowed to crawl and index the site it will do so and the pages it has crawled will then be added to a search engine's index. The precise details on how these rules can be specified, and what they mean, can be found in:User-agent: * Disallow: /