We all learned that robots.txt file is used to block search engine crawling of any page, file or resources of a website. But in this article, I will explain that robots.txt does not completely block crawling of a website page. In the previous article, you get details about how to solve 404 Not found error of google search console. You can consider this article as second part of the previous one.
First, we need to understand the basic functionality of robots.txt file and how it works.
The file uses Robots Exclusion Standard, which is a protocol with a small set of commands that can be used to indicate access to your site by section and by specific kinds of web crawlers (such as mobile crawlers vs desktop crawlers). Robots.txt file present in the root directory of your website, indicates which part of a website (Page, images, resources) you want to hide from search engine crawler.
That means when search engine bot, about to crawl your website, it first visits the robots.txt file and takes instructions from it, which file it not need to visit. Most of the respected crawler follows instructions given in robots.txt file but not all crawlers.
Let’s take an example, I want to block access of about-us.html page for all search engines. Robots.txt file will look like as given below
When search engine crawler access this robots.txt file, it will get following instructions
- User-agent: * means given rules will be applicable for all search engine crawlers.
- Disallow:/about-us.html means all user agents (bots) are not allowed to visit about-us.html page.
Robots.txt file doesn’t hide a web page from google search result
For website pages robots.txt file only provides instruction to the crawler to Not crawl a given page. It just controls crawling traffic, typically because you don’t want your server to be overwhelmed by Google’s crawler or to waste crawl budget crawling unimportant or similar pages on your site. It is advised that never use the robots.txt file to hide website pages to Google search results. Because with robots.txt instruction crawler will not crawl given page but search engine crawler can index your page from other pages which have the link of your page. Now understand with the example.
I want to block access of about-us.html page of my website www.xyz.com for all search engines. There is another website www.abc.com which have the link of my about us page. My website have robots.txt file with the following code
When search engine crawler visits my website, it will not index about-us.html page due to instructions are given in the robots.txt file. But when it will visit www.abc.com (which have the link of my about us page) and get the link of my about us page, the crawler will index my about-us.html page.
To completely block your website page from crawling you can use following two methods
- Use noindex tag (<meta name=”robots” content=”noindex” />) at every page which you don’t want to be index.
- Make website page password-protected.
When you should use Robots.txt file
- For Image Files: txt file works fine with image files and it prevents images to appear in google search results. So you can use the robots.txt file to block unwanted images of your website.
- Resource Files: Resource files such as scripts or style file can be blocked with the robots.txt file. But you need to very careful while blocking resource file and block only those unimportant files which you think that pages loaded without these resources will not be significantly affected by the loss. If you block any resource file which makes the harder for google crawler to understand about your page then page ranking can be affected.
Limitations of Robots.txt file
This file comes with following limitations.
- Robots.txt instructions are directives only: which means, it totally depend on crawler whether it follows given instructions or not. Robots.txt file can’t enforce instructions to every search engine crawler. Almost every respectable search engine crawler follows instructions of robots.txt file but other crawler might not.
- Robots.txt file can’t prevent references to your URLs from other sites: which means if a URL is blocked in robots.txt file google crawler directly will not crawl or index it but it if the link of that URL is found on another website then it will crawl by google.
I will suggest robots.txt file for every website because with this file you can provide instruction to search engine crawler which file and resource they don’t need to crawl. It will increase the crawling rate of your website. It is also a decent idea to block a website page with the robots.txt file but you need to take some additions steps for those unwanted pages which you really want to hide from search engine result. Please use the comment box to ask questions about the same topic or provide any suggestions.
BlueHost: Get web hosting at discounted rate
MyThemeShop: Download several Premium WordPress theme for Free.