How Web Crawlers or Web Robots Control Access to Your Site?

How Web Crawlers or Web Robots Control Access to Your Site?

Why you should control the access of web robots or web crawlers to your site or why there are many reasons. As much as you like, you want Googlebot to come to the site, you do not want to get spam bots and gather personal information from your site. Not to mention that when a robot crawls your site, it also uses the bandwidth of the website! In this post I have explained how you can control the use of Web robots on your site using a simple ‘robots.txt’ file.

What are Web Robots or Web Spiders? 

Web Crawlers
Web Crawlers

Web robots (also called Boats, Web Spiders, Web Crawlers, Ats) are such programs that cross the World Wide Web in an automated way. Search engines (like Google, Yahoo, etc.) use web pages in index to make web crawlers available on web data basis.

Why use the ‘robots.txt’ file?

Gooble bot can crawl your site to provide better search results but simultaneously collect personal information such as other spam bots email address for spamming purposes. If you want to control the access of web crawlers to your site, you can do this by using the “robots.txt” file.

You may read – List of useful and frequently used DOS Commands

How do I create a ‘robots.txt’ file?

‘Robots.txt’ is a plain text file. Use any text editor to create a ‘Robots.txt’ file

‘Robots.txt’ file format

Entries (rules) in the Robots.txt file are entered in a ‘field’  ‘value’ pair.

<field>: <value>

A simple robots.txt file uses the following three areas:

User-agent: the web robot the following rule applies to.
Disallow: the URL you want to block the robot from accessing.
Allow: the URL you want to allow the robot to access.

User-agent: Web robot applies to the following rules

Disallow: URLs that want to block robots from accessing.

Allow: URL you want to allow to access robots

You may read – 10 Tips to Increase Your Sales Page Conversion

All the following robots will prevent your site from crawling (‘*’ means all and ‘/’ is the root directory.)

User-Agent: *

Disallow: /

The following all robots will be blocked from crawling the ‘private’ directory.

Consumer Agent: *

Disallow: /Private

Googlebot prevents Googlebot from indexing your images for image search. Use it to save bandwidth If you do not want your images to be available for Google Image Search, read the Radius bandwidth usage post to learn more.

You may read – Why FeedDemon is Better with Google Reader and Bloglines?

User-agent: Googlebot-Image

Disallow: /

All following robots will be prevented from crawling your site except Googlebot

User-agent: *

Disallow :/

User-agent: Googlebot

Allow: /

Where to place the robots.txt file?

Place the robots.txt file in the root directory of your website. For example, do not place the file in www.yoursite.com in the sub-directory like www.yoursite.com/sub-directory. In most cases it will be the “public_html” directory of your site.

You can verify that a bot coming to your site is actually Googlebot following the instructions on this page.

You may read – Who Owns Content Published through RSS Feeds?

Shares 0

Leave a Reply

Your email address will not be published. Required fields are marked *