This Time Self-Hosted
dark mode light mode Search

So You Think You Can Crawl

I have written about crawlers before ranting about the bad behavior of some crawlers, especially those of “marketing sites” that try to find out information about your site to resell to other companies (usually under the idea that they’ll find out who’s talking about their products).

Now, I don’t have a website that has so much users that it can be taken down by crawlers, but I just don’t like waiting time, disk space and bandwidth for software that makes neither me nor anybody else any good.

I don’t usually have trouble with newly-created crawlers and search engines, but I do have problems when their crawlers are just hitting my websites without following at least some decency rules:

  • give me a way to find out who the heck you are, like a link to a description of the bot — *in English, please*, like it or not it is the international language;
  • let me know what your crawler is showing; give me a sample search or something, even if your business is reselling services it shouldn’t be impaired if you just let everybody run the same search;
  • if your description explicitly states that robots.txt is supported, make sure you’re actually fetching it; I had one crawlers trying to fetch each and every article of my blog the other day, without having ever fetched robots.txt, and with the crawler’s website stating that it supported it;
  • support deflate compression! XHTML is an inherently redundant language, and deflate compression does miracles to that; even better on pages that contain RDF information (as you’re likely to repeat the content of other tags, in semantic context); the crawler above stated to be dedicated to fetch RDF information and yet didn’t support deflate;
  • don’t be a sore loser: if you state (again, that’s the case for the crawler above) that you always wait at least two seconds between requests, don’t start fetching without any delay at all when I start rejecting your requests with 403;
  • provide a support contact, for I might be interested in allowing your crawler, but want it to behave first;
  • support proper caching; too many feed-fetchers seem to ignore Etag and If-Modified-Since headers, which gets pretty nasty especially if you have a full-content feed; even worse if you support neither these nor deflate, your software is likely getting blacklisted;
  • make yourself verifiable via Forward-confirmed reverse DNS (FCrDNS); as I said in another post of mine most search engine crawlers already follow this idea, it’s easy to implement with ModSecurity, and it’s something that even Google suggests webmasters to do; now, a few people misunderstand this as a security protection for the websites; it couldn’t be farther from the true real reason: by making your crawler verifiable, you won’t risk to get hit by sanctions over crawlers trying to pass for you; this is almost impossible if your crawler uses EC2 of course.

Maybe rather than SEOs – by the way, it is just me who dislike this term and finds most people self-describing that way to just try be on the same level of CEOs – we should have Crawlers Experts running around to fix all the crappy crawlers that people write for their own startup.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.