For the past few days the web server at the business I contract for has taking a pounding. Why I don't know. The bandwidth used for Tuesday was 20mb plus 7mb for known spiders. Now whilst this may not sound like much over the space of a day, a lot of this was happening within a small timeframe and also our server runs it's own connection so down the ADSL line are site visitors as well as staff's own internet browsing, email and another office connecting to other servers as well.

I spent a lot of yesterday looking and analysing the raw logs trying to find the reason behind why the server slowed so much that even internal connections weren't being made and the server had to be rebooted. Besides the MSNbot racking up 5mb of bandwidth, I found a few bots in the visitor stats that were running page views at about 10 per second – this isn't good. That coupled with spiders, NTL constantly caching the server with their bots and also human visitors getting frustrated and refreshing as well, I'm not suprised the server died.

So I started to look at the robots.txt file and what else I could put into it. For those of you who don't know, the robots.txt file is a simple text file that sits in the root directory of your web site. If you don't have one then some spiders may not spider your site (not the major ones but smaller ones possibly), and it will also increase your 404 errors on your web stats. Even if you don't want to control anything by way of spiders I always recommend and upload myself, a file containing the following:

User-agent: *
Disallow:

This tells all spiders that they can spider the entire web site.

However this can be expanded on. Yesterday I noticed that the PDFs on the web site were over 1mb in size so I figured that I'd prevent the spiders from trying to cache those. Along with blocking the cgi-bin directory too, this is what I now have:

User-agent: *
Disallow: /cgi-bin/
Disallow: /pdfs/

Note here you use the directory name in relation to the root directory by the use of the first forward slash. This will tell any spider that adheres to the robots.txt file, to not enter those directories and index anything.

To me the main problem has been seeing statistics where I could see a spider hitting different pages at over 10 pages per second. MSNbot is a major problem for this and whilst MSN have slowed their bot down a little (yes it was worse at first!) it's still quite intensive. So what can be done? Well I found a little extra that can go into a robots file, some spiders will follow the rule some won't. You can specify to individual spiders a crawl delay to slow them down, increase the delay between their hits to the server. Now whilst 1 second would probably be fine I figured that 10 seconds was good enough. However this has to be done per spider and you cannot use the asterisk to cover them all unfortunately. To determine which spiders/bots you need to target you will need to check your statistics. I found two extras under the raw logs yesterday and have added them all to the file. I know the OmniBot will follow the rule but from what I've read I'm not sure about the DTAAgent – but this was a very resource intensive visit from it and so I figured it was worth a shot.

User-agent: msnbot
Crawl-delay: 10
User-agent: Slurp
Crawl-delay: 10
User-agent: Googlebot
Crawl-delay: 10
User-agent: DTAAgent
Crawl-delay: 10
User-agent: OmniExplorer_Bot
Crawl-delay: 10

On Tuesday the site used 20mb of bandwidth, on Wednesday for the same number of visitors, more page views and hits, it used 9mb. Whether this file did much I doubt it considering those statistics do not include the search engine spiders but I'm still happy that I added what I have done to slow down some of the spiders anyhow.

Trackback URL for this post: http://www.stuffbysarah.net/2006/01/05/bots-spiders-and-bandwidth/trackback/
  • Sarah Comments:

    Just an update as I can see a lot of people coming to this page also searching for information on the DTAAgent. DTA stands for "Distributed Traffic Agent". One page I've found on it is from a bank that uses DTAs to pass images back and forth. I'm still not sure how this is connected to 192.com, but apparently it is. I'll keep digging however, to see what I can come up with.

  • Richard Comments:

    I have been ravaged by the Janet Bot. Not funny

  • Leave a Reply

    Your email address will not be published. Required fields are marked *

    *

    Latest Tweets