How to use the robot meta tag or robots.txt.
As a search engine optimizer most of the time you want your pages indexed by the robot ( search engine spider ). But sometimes you want to advise SE’s not to access a page or pages if you have:
- Pages under construction
- Pages in a language crawled by some SE’s
- Pages optimized for specific search engines
- Downloadable multimedia files, documents & other files you don’t want SE’s to access
- Temporary pages.
The robot meta tag and the robots.txt file are used to advise search engine robots on what areas of your site you want them to access.
There are literally hundreds of search engine spiders, robots, bots, whatever term you prefer. Many of the robots are country-specific so if your site does not exist within a domain for that country you will not see it.
Some global robots also have local country variants. There are about 200 main robots, but most web sites will only ever see just a few of them from the major search engines.
For the site owner & optimizer robots are welcome because they collect information about web pages to become indexed by the search engine.
We can then start down that frequently long road toward recognition & placement in a top-ranking of our chosen search engine’s.
To a web server and its administrator, the robot is frequently not a friend but a foe, an unwanted nuisance that consumes valuable resources. It’s one of the main reasons for wanting to control access to a robot.
When they come calling they can make thousands of page requests that can tie up processors, slow down access times to the server for its customers and also consume valuable bandwidth.
Of course on a modern well-designed & managed web server, the impact is minimal to its performance, but none the less it’s an additional burden or overhead.
As an optimizer and web site owner you will need to develop an understanding of your weblog files or web statistics. That I cover in another section, but for now let’s just say that when you analyze these log files there will be entries for user access to the site.
Among the names & domains from which your visitors come you will find names like Slurp, MantraAgent, Scooter, and Architextspider. These are the search engine spiders, some call daily, others weekly, some just once a month. As already mentioned two ways exist to allow or disallow a robot.
The robot meta tag advises the search engine robot about an individual web page and its links.
The robots.txt file advises the search engine robot about what it should access in an individual file (page), several files or entire directories of files.
What should you use, robot meta tag on a page, or robots.txt file on the webserver?
The answer can depend on your web hosting. Some accounts do not support individual users having a robots.txt file. It advises the robot what it should NOT access on the server.
Check with technical support at your hosting provider if you can use it on your account.
Personally I think it’s a good idea to use both the robot meta tag and the robots.txt. If you only want to use one, then use robots.txt, if you can.
The reason you need to use a robots.txt file is that most search engine spiders will look for it and if it’s not there, some get confused and go away. They can also generate a lot of no file found or page 404 entries in your weblog files.
The robots.txt file
The robots.txt file as the name suggests is a simple text file that contains instructions for the search engine robot ( spider ) as to what files it’s not allowed to access.
The file can be created in a simple text editor like notepad then uploaded to the root directory on your web host where your web site resides. Check with your web host that you can use a robots.txt file.
It must be called robots.txt in lower case letters only and be saved in the root directory.
If you put it anywhere else the spider will not find it.
If your web site is at http://www.yourdomain.com then with the robots.txt file in the root directory on your web server space the spider will find it as http://www.yourdomain.com/robots.txt
A simple robots.txt file might look something like:
In the above example, Disallow: /cgi-bin/ tells the robot to stay out of the cgi-bin directory and away from your Perl scripts, etc.
The next Disallow tells the robot to stay out of a form directory where there are form templates and pages stored that we don’t want to be indexed.
The final line tells the robot not to index the single web page contact.html.
Each directory or file to be disallowed needs a separate Disallow: command line.
The User-agent:* means any robot, it is not a wildcard character as one might use in a search query. So you can not use it as Disallow: /*.jpg
It’s not possible to stop the robot accessing jpg image files with any filename but file extension .jpg.
To exclude all robots to the site would require:
There is not an Allow command except that to do this you would use
You can also disallow only particular robots by giving the name of the search engine robot in the User-agent command line.
This would allow all robots access except Marvin which happens to be Infoseek.de (Germany).
NOTE: Not all robots observe the robots.txt file.
The robots meta tag
The Robots Meta tag is a simple way to tell a visiting robot if a page should be indexed, or links on the page followed.
It has four basic commands, INDEX, NOINDEX, FOLLOW, NOFOLLOW and two additional commands ALL=INDEX,FOLLOW and NONE=NOINDEX,NOFOLLOW.
These commands are not case sensitive so can be written in either upper or lower case.
Examples would be:
<meta name=”robots” content=”index,follow”> or <meta name=”robots” content=”all”>
<meta name=”robots” content=”noindex,follow”>
<meta name=”robots” content=”index,nofollow”>
<meta name=”robots” content=”noindex,nofollow”> or <meta name=”robots” content=”none”>
As a META tag it should be placed in the HEAD section of an HTML page:
<meta name=”robots” content=”index,nofollow”>
<meta name=”description” content=”This page ….”>