Google Not Interpreting robots.txt Consistently

I had an issue where Googlebot was spidering parts of my site that were not allowed in the robots.txt file…

My old robots.txt file…

User-agent: *
Disallow: /tools/suggestion/?
Disallow: /search.php
Disallow: /go.php
Disallow: /scripts/
Disallow: /ads/

User-agent: Googlebot
Disallow: /ebay_

Hmmmm… that’s weird… Googlebot is still spidering stuff it shouldn’t be…

www.digitalpoint.com 66.249.66.138 - - [14/Mar/2006:06:21:07 -0800] "GET /ads/ HTTP/1.1" 302 38 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +//www.google.com/bot.html)"
www.digitalpoint.com 66.249.66.138 - - [14/Mar/2006:10:26:18 -0800] "GET /ads/ HTTP/1.1" 302 38 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +//www.google.com/bot.html)"
www.digitalpoint.com 66.249.66.138 - - [14/Mar/2006:14:29:35 -0800] "GET /ads/ HTTP/1.1" 302 38 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +//www.google.com/bot.html)"
www.digitalpoint.com 66.249.66.138 - - [14/Mar/2006:17:47:21 -0800] "GET /ads/ HTTP/1.1" 302 38 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +//www.google.com/bot.html)"

So I made an inquiry to Google about this, and I actually heard back (nice!)…

While we normally don’t review individual sites,
we did examine your robots.txt file. Please be advised that it appears
your Googlebot entry in your robots.txt file is overriding your generic
User-Agent listing. We suggest you alter your robots.txt file by
duplicating the forbidden paths under your Googlebot entry:

User-agent: *
Disallow: /tools/suggestion/?
Disallow: /search.php
Disallow: /go.php
Disallow: /scripts/
Disallow: /ads/

User-agent: Googlebot
Disallow: /ebay_
Disallow: /tools/suggestion/?
Disallow: /search.php
Disallow: /go.php
Disallow: /scripts/
Disallow: /ads/

Once you’ve altered your robots.txt file, Google will find it
automatically after we next crawl your site.

Okay… I can live with that… easy fix. But check this out… Google’s own robots.txt testing tool within Google Sitemaps show the old robots.txt as being able to block Googlebot as expected.

So how about some consistency here? And more importantly, if anyone at Google is reading this, how about someone tell me why my blog is banned in your index… 🙂

11 thoughts on “Google Not Interpreting robots.txt Consistently”

  1. Looks like the Googlebot Spider algorithm is doing a match on the user-agent first, and if it passes that test, then don’t bother checking for “*” – tsk, tsk …

  2. Looks like GoogleBot needs to pick it up and start recognizing * and not overwriting the new entries. That would be stupid if you needed to make separate entries just for Googlebot like that.

  3. In general, I believe that if there is
    – a generic User-Agent entry
    – a more specific Googlebot entry

    then we’ll go with the more specific entry for Googlebot. I’ll ask the Sitemaps team to check out the apparent difference between Sitemaps and Googlebot.

    Matt

  4. Could be that they’re using an older version of the file parser, and broke something in an update. Wouldn’t be the first time it’s happened to programmers ;P. Hmm, found your site by searching google too… Maybe was a temporary removal to immediatly flush the offending information, pending the next crawl?

  5. Matt Cutts said: In general, I believe that if there is
    – a generic User-Agent entry
    – a more specific Googlebot entry
    then we’ll go with the more specific entry for Googlebot. I’ll ask the Sitemaps team to check out the apparent difference between Sitemaps and Googlebot.

    Actually, the correct way to do this is to check BOTH of them. I.e. if you are disallowed in either the generic OR specific case, then the spider should not poke around … which is another way of echo’ing my first comment.

Leave a Reply

Your email address will not be published. Required fields are marked *