Wednesday, January 21, 2009

Analyzing robots.txt: The White House and Search Engine Indexing

Jason Kottke's kottke.org has an interesting post about the difference in the robots.txt seen on the whitehouse.gov website before and after the inauguration. The previous administration used a 2400+ line robots.txt file, which prohibited automated indexing of a wide variety of pages. The new administration's robots.txt has two lines.

The Bush White House's site robots.txt included lines like this:

Disallow: /911/911day/text
Disallow: /911/heroes/text
While the text of those pages isn't sensitive, some sites may prefer that the data that they deny access to isn't generally indexed and searchable.

This is a great reminder to security minded administrators: relying on robots.txt to keep your content obscure only works with well mannered robots - those who read their logs well will note that many spiders do not actually heed robots.txt.

What would reconaissance of your site reveal if your robots.txt was reviewed, then a web spider tool was told to specifically ignore it and index your data? Would this provide more information than you might like? Some sites rely on the obscuring that not having their data indexed provides for a modicum of security or privacy. Do you?

No comments: