Syndicate content
> Identifying Search Engine Spiders : Banning spiders and agents

Identifying Search Engine Spiders : Banning spiders and agents

Introduction
Banning spiders and agents
Search engine spider identification
Further learning resources:


Banning spiders and agents

If you notice entries like Teleport Pro and WebStripper in your traffic reports, someone's been busy attempting to download your web site. You don't have to just sit back and let this happen. If you are commercially hosted, you'll be able to add a line to your robots.txt file to prevent repeat offenders from stripping your site.

The robots.txt file gives search engine spiders and agents direction by informing them what directories and files they are allowed to examine and retrieve. These rules are called The Robots Exclusion Standard.

To prevent certain agents and spiders from accessing any part of your web site, simply enter the following lines into the robots.txt file:

User-agent: NameOfAgent
Disallow: /

Ensure that you enter the name of the agent exactly as it appeared in your reports/logs e.g. Teleport Pro/1.29 and that there is a separate entry for each agent. Skip a line between entries. You could do the same to exclude search engine spiders, but somehow I don't think you'll really want to do this :0). The "/" in the above example means disallow access to any directory. You can also disallow access by spiders and agents to certain directories e.g.

User-agent: *
Disallow: /cgi-bin/

In this example the asterisk (wildcard) indicates "all". Don't use the asterisk in the Disallow statement to indicate "all", use the forward slash instead.

If you don't have a robots.txt file, create one in notepad and upload it to the docs directory (or the root of whichever directory your web pages are stored in). Never use a blank robots.txt file as some search engines may see this as an indication that you don't want your site spidered at all! Have at least one entry in the file.

Unfortunately, defining web stripper agents and spiders in your robots.txt file won't work in all cases as some mirroring software applications have the ability to mimic web browser identifiers; but at least it's some protection that may save you some valuable bandwidth.

If you're not able to create a robots.txt file, which is usually the case if you are hosted by a free hosting service, this article may be useful:

http://www.tamingthebeast.ne...


Michael Bloch
www.tamingthebeast.netMichael Bloch
Taming the Beast.net
http://www.tamingthebeast.ne...
Tutorials, web content, tools and software.
Web Marketing, eCommerce & Development solutions.
_____________________________________________

Copyright information.... This article is free for reproduction but must be reproduced in its entirety & this copyright statement must be included. Visit http://www.tamingthebeast.ne... to view great articles, tutorials and tools for site owners, web developers and Internet marketers! Subscribe for free to our popular ecommerce/web design ezine!

Christian-Web-Masters.com newsletter

Stay informed on our latest news!