Robots.txt File Basic Syntax and Best Practices Guide

What is a Robots.txt File?

Robots.txt is a file that contains instructions for search engine crawlers, sometimes known as bots. It tells search engines whether or not to crawl specific pages or sections of a website. This instruction set is typically followed by search engines (Google, Bing, and Yahoo). A robots.txt file tells Google or other web crawlers that you do not want particular pages indexed.

How Does a Robots.txt File Work?

Search engines have two primary functions:

Discover content by crawling the web.
Indexing the content so that information-seeking searchers can find it from the search result.

Search engines crawl sites by following links from one website to another, eventually scanning billions of links and web pages.

The search crawler will seek a robots.txt file after arriving at a website. If a robots.txt file is discovered, the search crawler will scan it before proceeding through the pages. The robots.txt file specifies how the search engine should crawl a site. The file can contain instructions that prevent a user-agent from crawling some information on the site.

What is the Significance of the Robots.txt File?

The robots.txt file is quite significant for SEO purposes in general. Robots.txt is required for larger websites to give search engines specific instructions on what content they should not access.

There are three primary reasons why you should utilize a robots.txt file.

Block Non-Public and Duplicate Pages

You might not want some pages on your website to be indexed. For instance, you may have a staging version of a page, internal search results pages, duplicate pages, or even a login page. These pages are required, although they are not required to be indexed and discovered by search engines. This is an ideal situation for robots.txt to prevent crawlers and bots from accessing these pages.

Crawl Budget

The number of pages Google will crawl on your site at any given time is referred to as the “crawl budget.” The amount may vary depending on your site’s size, health, and number of backlinks. If you’re experiencing difficulty getting all of your pages indexed, you may have a crawl budget issue. And pages that are not indexed will not rank for anything. The Robots.txt file can help Googlebot (Google’s web crawler) detect irrelevant pages. As a result, Googlebot can spend more of your crawl budget on pages that are important to you.

Prevent Resource Indexing

You may wish Google to omit multimedia items such as PDFs, videos, and photos from search results. Perhaps you want to keep some resources private, or you want Google to prioritize more critical stuff. In that scenario, the simplest option to prevent them from being indexed is to use robots.txt.

Robots.txt File Syntax

A typical robots.txt file might consist of several components and aspects, such as:

User-agent
Command to “allow” or “disallow”
Crawl-delay
Sitemap
Comments

User-Agent

Each search engine has its user agent. Website administrators can offer special instructions for particular bots in the robots.txt file by writing separate instructions for user agents.

For example, if you want a specific website to appear in Bing search results but not in Google, you can put two instruction sets in the robots.txt file. One set of directives will be with “User-agent: Bingbot” while the other will be with “User-agent: Googlebot.”

User-agent: bingbot 
Allow: /Folder1/

User-agent: Googlebot 
Disallow: /Folder1/

However, we can utilize the (*) wildcard to give directives to all user-agents.

User-agent: * 
Disallow: /

There are hundreds of different user agents, however here are a few that are important for SEO:

Search Engines	Type	User-agent
Google	Desktop	Googlebot
Google	Smartphone	Googlebot
Google	News	Googlebot-News
Google	Videos	Googlebot-Video
Google	Images	Googlebot-Image
Google	AdSense Mobile	Mediapartners-Google
Google	AdSense Desktop	Mediapartners-Google
Bing	Desktop and Mobile	Bingbot
Bing	Predecessor of Bingbot	MSNBot
Bing	Images and Videos	MSNBot-Media
Bing	Bing Ads	AdIdxBot
Bing	Page Snapshots	BingPreview
Baidu	Desktop	Baiduspider
Baidu	Advertisements	Baiduspider-ads
Baidu	Image Search	Baiduspider-image
Baidu	News Search	Baiduspider-news
Baidu	Video Search	Baiduspider-video
Yandex	Desktop	YandexBot
Yandex	Mobile	YandexMobileBot
Yandex	All Crawling	Yandex
Yandex	Advertising	YandexDirect
Yandex	Media	YandexMedia
Yandex	Images	YandexImages
Yandex	Advertising	YaDirectFetcher
Yandex	News	YandexNews
Yandex	Calendar	YandexCalendar
Yahoo!	All Search	Slurp
DuckDuckGo	Search	DuckDuckBot
AOL	Search	aolbuild

Disallow Directive

In the robots.txt file, the Disallow command is the most commonly used. Depending on the rule, it instructs a search engine not to crawl a given path or collection of URLs. If you want to restrict access to several portions of a website, there can be one or more lines of disallowing rules.

Disallowed pages aren’t technically “hidden” – they’re merely uninteresting to the ordinary Google or Bing user, thus they’re not shown to them.

If you want your full website blocked or not crawled by search engines, you may use the below disallow all syntax in the robots.txt file.

User-agent: * 
Disallow: /

This will tell all search engines to crawl the entire website. There is no rule on restrictions.

User-agent: * 
Disallow:

This will tell search engine Google not to crawl your website.

User-agent: googlebot  
Disallow: /

All search engines are directed not to visit the /wp-admin/ directory in this case.

User-agent: * 
Disallow: /wp-admin/

In this case, all search engine bots will be denied access to /wp-admin/, but Bingbot will be denied access to Disclaimer.

User-agent: * 
Disallow: /wp-admin/

User-agent: bingbot 
Disallow: /Disclaimer/

Allow Directive

Allow directive can counteract a Disallow directive. For example, if you are disallowing a directory but want to allow a file under that directory to be accessed. The robots.txt file would be:

Here search engines can access 123.JPG. But they can’t access any other file inside /photos.

User-agent: * 
Disallow: /photos
Allow: /Photos/123.jpg

Avoid using wildcards when combining the Allow and Disallow directives, since this may result in contradicting directives. This directive is supported by both Google and Bing.

Crawl-Delay Directive

The crawl delay directive slows the rate at which a bot crawls your website. Because the crawl delay directive is an unofficial guideline, not all search engines will follow it.

Google and Baidu do not follow this directive. Bing and Yahoo support the crawl delay directive. Basic syntax:

User-agent: bingbot 
Disallow: /wp-admin/
crawl-delay: 5

Sitemap Protocol in Robots.txt File

The sitemap directive in the robots.txt file informs search engines where to find your Sitemap and allows different search engines to easily identify the URLs on your website. It is an alternative to submitting sitemaps to search engines using their webmaster tools. However, we strongly recommend you submit sitemaps to search engines.

The sitemap directive should be included at the bottom of your robots.txt file.

As an example, consider the following:

User-agent: * 
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://itcheats.com/sitemap_index.xml

Comments

A robots.txt file can contain comments, however, they are only visible to humans and not bots, as everything following a hashtag is disregarded.

Comments can be valuable for a variety of reasons, including:

Explains why some rules exist.
Indicates which sections of a website the restrictions apply to.
Describes what the regulations do.

How to Create or Edit a Robots.txt File?

Create and Edit Robots.txt File via cPanel

Using cPanel, you can create a robots.txt file. Navigate to the cPanel of your hosting provider. Locate the file manager and go to the website’s root directory (public_html). There, you can either create or edit the file.

Robots.txt File in cPanel

Create and Edit Robots.txt File on WordPress

Creating a robots.txt file for WordPress is easy. You will need the Yoast SEO plugin to create or edit a robots.txt file in WordPress. Go to the Yoast dashboard, then to Tools and File Editor. The robots.txt file can now be seen and edited in the WordPress dashboard.

Create and Edit Robots.txt File on WordPress

Create and Edit Robots.txt File on WordPress

Robots.txt File Generator

You can write down your website’s robots.txt file on a notepad. Or use a robots.txt generator for generating a file. Some free tools are:

How to Audit your Robots.txt File for Errors

Robots.txt errors can easily slide through the cracks, so keep a lookout for them. Check the “Coverage” report in Search Console regularly for robots.txt problems. Some of the errors you may encounter, what they mean, and how to resolve them are listed below.

Submitted URL Blocked by Robots.txt

This signifies that robots.txt has blocked at least one of the URLs in your uploaded sitemap. No submitted pages should be blocked by robots.txt if your sitemap was properly created and excludes canonicalized, no indexed, and redirected pages. If they are, look into which pages are affected and make the necessary changes to your robots.txt file to remove the block for that page.

You can use Google’s robots.txt tester to determine which directive is preventing access to the content. Just be cautious when doing so. It’s simple to make errors that affect other sites and files.

Blocked by Robots.txt

This signifies you have robots.txt-blocked content that Google isn’t presently indexing. Remove the crawl block in robots.txt if this content is significant and should be indexed. If you used robots.txt to exclude content from Google’s index, remove the crawl block and replace it with a robots meta tag or x-robots-header. That is the only way to ensure that information is not included in Google’s index.

Indexed, though Blocked by Robots.txt

This implies that some of the robots.txt-blocked content are still indexed by Google. Again, if you want to remove this content from Google’s search results, robots.txt isn’t the way to go. To prevent indexing, remove the crawl block and use a meta robots tag or an x-robots-tag HTTP header.

If you accidentally blocked this content and wish to retain it in Google’s index, remove the crawl block in robots.txt. This may help to increase the visibility of the content in Google searches.

What are Regular Expressions and Wildcards in the Robots.txt?

The robots.txt standard does not “officially” accept regular expressions or wildcards; yet, all major search engines recognize it. Regular expressions may be used to match patterns, and the two primary characters that both Google and Bing follow are:

The dollar symbol ($) that appears at the end of a URL
* is a wildcard character that can represent any sequence of characters.

We have included examples of regular expressions and wildcards in the robots.txt file in the later section

Robots.txt File Best practices

Check whether you’re restricting any content or parts of your website that you want crawling.
Each directive should be used on a new line. Otherwise, it can confuse search engines.

Not Recommended

User-agent: * Disallow: /directory1/ Disallow: /directory2/

Place the robots.txt file in the root directory of a website and name it robots.txt, for example, https://domainname.com/robots.txt. If the robots.txt file is not located in its default position, search engines will presume there are no instructions and will explore your website.
You can use same the User-Agent multiple times. However, referring to it only once maintains things nice and straightforward and lowers the possibility of human error.

Not Recommended

User-agent: Googlebot

Disallow: /about

User-agent: Googlebot

Disallow: /Terms & Conditions

Not Recommended:

User-agent: *

Disallow: /List/shirts?

Disallow: /List/pants?

However, this is inefficient. It would be preferable to use a wildcard to simplify things.

Not Recommended:

User-agent: *

Disallow: /a.jpg

Disallow: /b.jpg

Common Questions about Robots.txt Files

Maximum Size of a Robots.txt File

Google presently maintains a 500 kibibyte file size restriction for robots.txt files (KiB).

Where is Robots.txt in WordPress?

Exactly in the same place as every website. yourdomain.com/robots.txt.

Is the Ordering of Rules Important?

The sequence of the rules is unimportant. However, the longest matching path rule is used first, taking priority over the less specific shorter rule. If both pathways have the same length, less stringent criteria will be used.

Can a Robots.txt File Prevent the Indexing of Content?

Robots.txt file can not 100% prevent indexing of content. Search Engines do ignore Robots.txt files sometimes. The only way to prevent your content from being indexed is to include a noindex meta tag on the page.

Robots.txt is a simple yet effective file. If used correctly, it can have a good influence on SEO. If you use it carelessly, you’ll live to regret it.

4 Comments

rtp slot live on November 12, 2022

I am curious to find out what blog platform you are working
with? I’m experiencing some small security problems with my latest blog
and I would like to find something more secure. Do you have any recommendations?

- shuvonaz on November 23, 2022
  
  I am using WordPress. You can try it.
  
Jonathan Woods on December 23, 2022

Hey I know this is off topic but I was wondering if you knew of any widgets I could add to my blog that automatically tweet my newest twitter updates.

I’ve been looking for a plug-in like this for quite some time and was hoping maybe you would have some experience with something like this.

Please let me know if you run into anything. I truly enjoy
reading your blog and I look forward to your new updates.

pragmatic play on December 25, 2022

I wanted to thank you for this very good read!! I absolutely loved every little bit of it.
I have got you bookmarked to check out new stuff you post…

Press ESC to close

What is a Robots.txt File?

How Does a Robots.txt File Work?

What is the Significance of the Robots.txt File?

Block Non-Public and Duplicate Pages

Crawl Budget

Prevent Resource Indexing

Robots.txt File Syntax

User-Agent

Disallow Directive

Allow Directive

Crawl-Delay Directive

Sitemap Protocol in Robots.txt File

Comments

How to Create or Edit a Robots.txt File?

Create and Edit Robots.txt File via cPanel

Create and Edit Robots.txt File on WordPress

Robots.txt File Generator

How to Audit your Robots.txt File for Errors

Submitted URL Blocked by Robots.txt

Blocked by Robots.txt

Indexed, though Blocked by Robots.txt

What are Regular Expressions and Wildcards in the Robots.txt?

Robots.txt File Best practices

Not Recommended

Recommended:

Not Recommended

Recommended:

Not Recommended:

Recommended:

Not Recommended:

Recommended:

Common Questions about Robots.txt Files

Maximum Size of a Robots.txt File

Where is Robots.txt in WordPress?

Is the Ordering of Rules Important?

Can a Robots.txt File Prevent the Indexing of Content?

Share Article:

Best Laptops 2022: For Every Requirement and Budget

SEO-friendly URLs Optimization – Best Practices

4 Comments

Leave a Reply Cancel reply