Meta tags, robots.txt and Better Search Results with WordPress


Everton, from Connected Internet, posed an interesting question today concerning robots.txt and WordPress. It got me thinking about what exactly I do and don't want indexed by Google. It also got me reviewing my Google Webmaster Tools account to see how Google was doing while indexing my site.

What followed was an afternoon of reviewing the way robots.txt works as well as how the meta tag for robots works. I had originally planned on just implementing the changes and leaving a comment on Connected Internet but ultimately it was more deserving of a full post.

Getting Started with Robots.txt and Meta Tags

Here's the thing, the robots that index your content are either designed to behave the rules or blindly crawl. Short of blocking a robot's access you can only control it so much. Google, Yahoo, MSN and other big name search engines are likely to have robots that follow the rules. Other spiders, like feed scrapers, are going to do what they please until they're blocked.

Your robots.txt file is just a text file too. It's not encoded in any way and can be quickly read by any visitor. Don't believe me? Just type append robots.txt to my homepage's URL and you'll get a peak at what I'm telling robots visiting this site to do.

What does this mean? Because bad robots, and bad people, have as much access to your robots.txt file as good robots and people, you've got to be sure that you only reveal what you have to. You want it easy to read and easy to follow too. Ideally your robots.txt file should include only what it has to. Now it's just a question of what to disallow.

What Should You Put in your Robots.txt?

Perhaps a better question is what shouldn't you put in your Robots.txt file. There are two rules here:

  1. include anything that you specifically don't want indexed
  2. don't include anything you don't want people to know about.

If you have a secret directory that only you know about, don't mention it in your robots.txt file. It's like putting a big sign to all the bad robots that says "hey look here!"

If you have a directory / page that don't want indexed, mention it -cause that's what robots.txt is for. Generally speaking, if you're using WordPress, just about everything should be fair game in your robots.txt file. I've seen people suggest that you disallow your "wp-" directories but because you'll never link to any content in those areas I don't really see the point.

Sample Robots.txt

Here's a sample Robots.txt for WordPress using the ideas above.
User-agent: *
Disallow: /wp-content/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-
Disallow: /feed/
Disallow: /trackback/

These lines would dissallow the "wp-" directories, the feeds and trackbacks. Disallowing the feeds and trackbacks is done mostly to prevent duplicate content. It also prevents your RSS feeds from turning up as Google search results - I don't know if this would block Google Blogsearch.

The next thing that is popularly disallowed is the archives. The idea here is that when your archive pages are indexed they show duplicate content. Because your recent archives change frequently (because you add to it every time you post in any given time period) there's a solid chance that visitors will arrive on an archive page that's been indexed and the content will be outdated. The thing is this, while you want to prevent indexing your archives you don't want to prevent spidering through them - that's where meta tags come in.

The Difference Between Meta Tags and Robots.txt

While you can choose between allow and disallow in robots.txt you can't choose between follow and index and that's a major shortcoming. Telling a robot that it's disallowed is the equivalent of putting a locked door between the robot and any data on the page - that means that any links on that page won't be spidered. Because your archives are a significant part of your site navigation you really don't want to put them under lock and key.

While you probably don't want them indexed it's safe to let robots crawl through the content looking for links. That's where the Robots Meta tag comes in. The meta tag allows you to tell robots that they can follow the links in a page but that they shouldn't index it. What this does is allow the robot to crawl through your archives collecting link data and indexing your posts while at the same time ignoring things that could result in duplicate content.

The Robots Meta Tag in Action

You basically have four potential meta tags that can be placed at the top of any given page.

  • <meta name="robots" content="index, follow" />
  • <meta name="robots" content="index, nofollow" />
  • <meta name="robots" content="noindex, follow" />
  • <meta name="robots" content="noindex, nofollow" />

The fourth and first can pretty much be ignored. If you don't want a page indexed and you don't want the connecting links followed then just add it to your robots.txt file and disallow it. If you want a page indexed and all it's links followed just leave it be - the robot will do all the work. Now it's just a question of what you think you want accessible.

Thoughts on Content

Right now I allow just about everything to be indexed. The only thing I've specifically disallowed are my tag lists. I did this primarily because I have no idea how to add a meta tag to the tag pages generated by Simple Tagging. When I figure it out I'll move to the Robots Meta tag.

My Archives and Categories are given "noindex, follow" values in the robots meta for the reasons I've outlined above. The Category pages and sub-pages are constantly being updated so at any given time the posts listed on them are different - by blocking the robots from indexing those pages I'm hoping to have my actual posts appear before my category pages when search results are displayed.

Like I said, this is an afternoon's worth of reading and researching the topic. Anyone out there have any more suggestions about handling robots.txt and meta tags?


2 Responses to “Meta tags, robots.txt and Better Search Results with WordPress”

  1. March 25th, 2007 at 6:41 pm

    I had originally planned on just implementing the changes and leaving a comment on Connected Internet but ultimately it was more deserving of a full post.

    Don’t forget to leave a comment though!

    I started doing the research as well today but ran out of time that’s why I threw the question out in a post.

    - Everton
  2. March 25th, 2007 at 7:04 pm

    Hi Everton, thanks for stopping by.

    I actually did leave a comment too - this full post just kind of expands on some of what I found.

    - WildBil
Leave a Reply