Software, Technology

Automated Web Robots Revealed

Automated Web Robots are basically programs which can automate some specific function on the web. For example the search engines use robots or internet bots to visit your web page and index them or update the index with latest addition. But unfortunately there might also be some automated web robot visitors which steal your content, interfere with interactive elements and use up large amounts of bandwidth.

How do I know if my site is being attacked by a robot?
There is no hard and fast rule to detect one but the following indicates a robotic presence:

1. Large numbers of requests are coming from a single IP address or IP addresses within same subnet.
2. Most request are for data driven content
3. Requests made from browser which do not support ASP sessions.
4. Sudden large increase in traffic volume but no corresponding increase in business/transaction.
5. Large numbers of spam.

Do these robots really harm?

Yes they do. At times quite considerably. These robots are often responsible for:
1. IP redirects i.e. redirecting your user from your site to other site.
2. Increase in web traffic and thus bandwidth used by your site, and increasing your cost thereon.
3. Robots can distort statistics, steal contents or worse damage it or damage interactive activities of your site.
4. Republish content to benefit form pay per click advertising.

What can I do to prevent such attrocities?

1. There is a standard for Robot Exclusion and the details of it are at http://www.robotstxt.org/wc/norobots.html. This standard proposes that web servers that want to change the behavior of robots visiting the site should control the behavior through a robots.txt text file placed in the root of the web server. But problem with this is harmful robots will not oblige.

2. You can use the robots meta tag in individual pages of the website. Place the following tag within the element of the document.


 <meta name="robots" content="noindex">
                or
 <meta name="robots" content="nofollow">
                or
 <meta name="robots" content="noindex, nofollow">

3. Another useful way is to make registration mandatory for accessing the valuable content of the site. But again this might stop a search engine’s web robots from indexing your site’s content in their catalog.You can also include some mechanism of distinguishing between human and robotic visitors. A common means of achieving this is by using a graphical sequence of characters that a user has to type into the form before submission (i.e. a Captcha, see http://www.captcha.net/).

4. The following lines can be added to the robots.txt file to slow down the robot instead of stopping.

        User-agent: Slurp Crawl-delay: 10

The Slow Down Manager ASP.NET component within VAM: Visual Input Security is able to slow down any robot which makes repeated requests for pages and can be configured to deny them access to the pages if they make more than a certain number of requests. Further details about the Slow Down Manager are available from http://www.peterblum.com/VAM/VISETools.aspx#SDM.

5. Robots are rarely able to execute JavaScript either, so configuring the registration or login process to rely on the execution of a particular JavaScript function could also be used.

This list is not exhaustive, many such valuable suggestions are available on the net. You must choose a solution which suits your need. This article just tries to redirect your interest and attention towards this problem which has been a nightmare to many webmasters.