Spam Fighting
Wed, 22 November, 2006 – 6:41 pm
This month has seen a major increase in visitors yet no increase in referrals from other sites. Whilst I'd love to be getting all of these daily visitors I felt that I needed to look into the figures as things looked a bit fishy especially as the most viewed page on the site has been a redundant photo gallery which I installed to review for someone.
So I installed a copy of Trace Watch which was suggested for another job by Matt last month. Whilst I don't think it'll do the job I need for my work's sites it's done a good job for me in tracking the activity on this site. I started to see similar user agents and poxies being used to access various parts of the site, posts that are months old and not that specific being constantly accessed, and unnaturally paths being taken through the site ie. pages that aren't linked together being visited one after the other. Typical bot activity that knows exactly where to go.
So first up was a common user agent that a lot of them had. This was identified as Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90). Whilst this could be human it's a small minority as it's MSIE 5.5 and Windows 98, both of which are very uncommon these days, so sorry to anyone who does use it! Anyhow, after a search around the web I came across a post on blocking spambots using htaccess. Unfortunately my htaccess skills aren't very good so with the help of Khalid I now have an additional line of
RewriteCond %{HTTP_USER_AGENT} MSIE\ 5\.5;\ Windows\ 98;\ Win\ 9x\ 4\.90)$ [NC]
In my htaccess before the Rewrite returns a 403 (Forbidden) - if anyone wants to see the whole code let me know. I then continued to look and discovered that this redundant photo gallery (which was deleted at the start of the week) was actually full of spam comments. Stupidity on my part to not look at this once in a while and it had been filled with pretty dire comments (I've only just worked this out after looking at a cached version of a page in a search engine). Anyway, with WordPress if you got a page that doesn't exist it returns the WordPress 404 page, however the only things going to this removed gallery are bots so I'd rather they just go away. To combat this I've simply used a 410 which will also tell search engine spiders that the pages have gone too.
410 is similar to 404 in that it returns a page with a notice on it. 404 of course says page cannot be found, whereas 410 is a lot less known and informs the visitor or bot that the page used to be there and has now been permanently removed. Note this is different to using a 301 that tells the bot that the page has been permanently moved to another location. A 410 simply says this page no longer exists end of story. You can use a 410 similar to a 301 except you just don't specify a second URL to redirect to ie.
RedirectMatch 410 ^/scooch$
So if you now go to http://www.stuffbysarah.net/blog/scooch you'll get a return of 410 Gone. It stops the bots in their tracks and will also start to ease the bandwidth on this site a little.

