How Web Crawlers or Web Robots Control Access to Your Site?
Why you should control the access of web robots or web crawlers to your site or why there are many reasons. As much as you like, you want Googlebot to come to the site, you do not want to get spam bots and gather personal information from your site. Not to mention that when a robot crawls your site, it also uses the bandwidth of the website! In this post I have explained how you can control the use of Web robots on your site using a simple ‘robots.txt’ file.
What are Web Robots or Web Spiders?
Web robots (also called Boats, Web Spiders, Web Crawlers, Ats) are such programs that cross the World Wide Web in an automated way. Search engines (like Google, Yahoo, etc.) use web pages in index to make web crawlers available on web data basis.
Why use the ‘robots.txt’ file?
Gooble bot can crawl your site to provide better search results but simultaneously collect personal information such as other spam bots email address for spamming purposes. If you want to control the access of web crawlers to your site, you can do this by using the “robots.txt” file.
You may read – List of useful and frequently used DOS Commands
How do I create a ‘robots.txt’ file?
‘Robots.txt’ is a plain text file. Use any text editor to create a ‘Robots.txt’ file
‘Robots.txt’ file format
Entries (rules) in the Robots.txt file are entered in a ‘field’ ‘value’ pair.
A simple robots.txt file uses the following three areas:
User-agent: the web robot the following rule applies to.
Disallow: the URL you want to block the robot from accessing.
Allow: the URL you want to allow the robot to access.
User-agent: Web robot applies to the following rules
Disallow: URLs that want to block robots from accessing.
Allow: URL you want to allow to access robots
You may read – 10 Tips to Increase Your Sales Page Conversion
All the following robots will prevent your site from crawling (‘*’ means all and ‘/’ is the root directory.)
The following all robots will be blocked from crawling the ‘private’ directory.
Consumer Agent: *
Googlebot prevents Googlebot from indexing your images for image search. Use it to save bandwidth If you do not want your images to be available for Google Image Search, read the Radius bandwidth usage post to learn more.
You may read – Why FeedDemon is Better with Google Reader and Bloglines?
All following robots will be prevented from crawling your site except Googlebot
Where to place the robots.txt file?
Place the robots.txt file in the root directory of your website. For example, do not place the file in www.yoursite.com in the sub-directory like www.yoursite.com/sub-directory. In most cases it will be the “public_html” directory of your site.
You can verify that a bot coming to your site is actually Googlebot following the instructions on this page.
You may read – Who Owns Content Published through RSS Feeds?