The robots.txt file

When it comes to SEO, most people understand that a Web site must have content, “search engine friendly” site architecture/HTML, and meta data such as title tags, graphic alt tag tags and so on.

However, some web sites totally disregarded the robots.txt file. When optimizing a Web site: don’t disregard the power of this little text file.

What is a Robots.txt File?

Simply put, if you go to www.domain.com/robots.txt, you should see a list of directories of the Web site that the site owner is asking the search engines to “skip” (or “disallow”). However, if you’re not careful when editing a robots.txt file, you could be putting information in your robots.txt file that could really hurt your business.

There’s tons of information about the robots.txt file available at the Web Robots Pages, including the proper usage of the disallow feature, and blocking “bad bots” from indexing your Web site.

The general rule of thumb is to make sure a robots.txt file exists at the root of your domain (e.g., www.domain.com/robots.txt). To exclude all robots from indexing part of your Web site, your robots.txt file would look something like this:

User-agent:
* Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/

The above syntax would tell all robots not to index the /cgi-bin/, the /tmp/, and the /junk/ directories on your Web site.

There are situations where you might use the Robots.txt file to cause issues with your site optimisation.  For instance if you include a * Disallow: “/” in your Robots.txt file it will be telling the search engines not to crawl any part of the web site giving you no web presence – not what you want.

Another point to watch out for is if you modify your Robots.txt file to dissallow old legacy pages and directories – you should really do a 301 permanent redirect to pass the value from the old Web pages to the new web pages.

Robots.txt Dos and Don’ts

There are many good reasons to stop the search engines from indexing certain directories on a Web site and allowing others for SEO purposes.

Here’s what you should do with robots.txt:

* Take a look at all of the directories in your Web site. Most likely, there are directories that you’d want to disallow the search engines from indexing, including directories like /cgi-bin/,  /wp-amin/,  /cart/,  /scripts/,  and others that might include sensitive data.
* Stop the search engines from indexing certain directories of your site that might include duplicate content. For example, some Web sites have “print versions” of Web pages and articles that allow visitors to print them easily. You should only allow the search engines to index one version of your content.
* Make sure that nothing stops the search engines from indexing the main content of your Web site.
* Look for certain files on your site that you might want to disallow the search engines from indexing, such as certain scripts, or files that might contain e-mail addresses, phone numbers, or other sensitive data.

Here’s what you should not do with robots.txt:

* Don’t use comments in your robots.txt file.
* Don’t list all your files in the robots.txt file. Listing the files allows people to find files that you don’t want them to find.
* There’s no “/allow” command in the robots.txt file, so there’s no need to add it to the robots.txt file.

By taking a good look at your Web site’s robots.txt file and making sure that the syntax is set up correctly, you’ll avoid search engine ranking problems.  By disallowing the search engines to index duplicate content on your Web site, you can potentially overcome duplicate content issues that might hurt your search engine rankings.

Test a robots.txt file

Google provides a facility as part of there Webmaster Tools system to enable you to test a robots.txt file.

Test a site’s robots.txt file:

On the Webmaster Tools Home page, click the site you want.
Under Health, click Blocked URLs.
If it’s not already selected, click the Test robots.txt tab.
Copy the content of your robots.txt file, and paste it into the first box.
In the URLs box, list the site to test against.
In the User-agents list, select the user-agents you want.

Any changes you make in this tool will not be saved. To save any changes, you’ll need to copy the contents and paste them into your robots.txt file.

Posted in SOE

Leave a Reply