Robots.Txt For Seo: The Ultimate Guide

Have you ever wondered what a Robots.Txt file is? Well, have no fear, as we explore all the components of this SEO and key insight.

It’s a feature that lives on your website that is definitely worth having, as it comes recommended by one of the most important elements of your website: Google.

Robots.Txt For Seo The Ultimate Guide

And if Google can’t find it, or other crawlers are unable to locate it, guess what? Your website will not be crawled and you’ll be at a major disadvantage against other competitors in your niche, who know what they are doing.

In fact; one single incorrect character can break an entire site, so it’s definitely worth understanding this fully before your SEO becomes unusable.

Expect to learn the best practices surrounding robots.txt file, why you should care about having it in the first place, the limitations of using robots.txt file, and what it looks like.

But first, let’s understand what it is in more detail, so you can better understand its functions and be able to contextualize why it will boost your performance.

What Is A Robots.txt File?

A robots.txt file Is a fundamental process in which it tells search engines what your website’s rule of Engagement is going to be.

SEO is a complicated topic and we’ll take more than one article to explain it, but the simple process is that you want to be sending the correct signals to search engines

In order to do this, using a robots.txt file is one of the best ways to let them know your crawling preferences for search engines.

Website Crawling

For those that don’t know; website crawling is the fetching of web pages via a software process, which is done automatically, and it will index any content on the website of your choice so it can be searched for by people using the internet. 

A crawler analyzes a page’s content and it will look for links to other pages and it will then fetch and index these.

There are two common types of crawling.

The first is known as ‘site crawls’ which attempt to gather an entire site at one time, typically starting with the home page, and grab links from that page so that you can continue to crawl your website for other contents on the site.

This is more commonly known as Spidering.

The second most common type of crawl is a page crawl, and this is the attempt to crawl a single page such as a blog post by the crawler.

Different Types Of Crawlers

Depending on the search engine and technology what ultimately depends on the method of getting a website content.

Some crawlers will get a snapshot of a site, and crawl the site for a specific point in time.  This is what is known as the brute force approach because it is trying to crawl the site each time in its entirety.

One of the downsides of this method is that it is slow and takes up a lot of resources, but the upside is that it has the most accurate up-to-date information on pages, so if you have a website that is consistently updated it will allow the changes to be more searchable.

You can also get single-page calls, which as it sounds is much more efficient and allows only crawling on new content or content that has been updated recently.

There are a few ways to find new or updated content types, which include RSS feeds, site maps, and ping services, and algorithms used with crawling that can detect this updated information. 

Google Updates

In 2019, there were a few updates around the robost.txt file standard,  which allowed for Google’s interpreter to become more flexible and definitely a lot more forgiving.

Another important feature is that it is on the safe side now, and we’ll actually assume that sections should be classified as restricted rather than unrestricted.

Directives

Search engines have a system where they will search to see if a website has a robots.txt file, and decipher whether there are any instructions in which they want to crawl the website.

This is known as directives.

If the search engine runs a search and concludes that there is no robots.txt file or any applicable directives, then what happens is the search engine will simply automatically crawl the entire website.

You might be wondering which search engines use this process?

All major search engines will respect the robots.txt file, however, sunset changes may actually choose to ignore this, or at least some parts of it to some extent.

Robots.txt files are a powerful signal for search engines and they should ultimately work with each other, but it’s worth pointing out that this file is a set of options and directives as opposed to a mandatory feature.

Why Is It Important And Why Should You Care?

From an SEO perspective, the robots.txt file is vital in the grand scope of your website that should not be overlooked. Letting a search engine know how best to check over your website can be the difference between traffic and tumbleweeds.

Using this file will stop search engines from finding certain pages and documents of the website that you don’t want to be found, allowing it to not duplicate content on your website, and generally speaking provide search engines more information with how to crawl your website.

Just one word of warning: make sure that if you ever make changes to your robots.txt file you are as careful as possible, as one error can potentially cause important parts of the website to be inaccessible to search engines.

It can also be useful for maintaining certain areas or documents from being crawled and indexed. 

For example, if you have a PDF all your website contains some form of staging site, you should ensure you plan carefully to decide what should not be shown by search engines, and that often there is content on websites that do not want to be searchable; especially if there are areas on your website that contain links.

What Are The Most Common Issues With Robots.txt File?

One of the most common issues with this file is it is often overused to reduce any potential content that is duplicated.  This effectively kills internal linking which is one of the cornerstones of an effective website.

We would recommend using it for files of pages that search engines should never see.

Some of the most common pages that shouldn’t come up on search engines are log-in pages that will need to generate plenty of URLs and any test areas where air navigation can exist from a multiple-faceted perspective.

Another common issue is the incorrect use of wildcards.

A wildcard is a process that allows you to redirect all of the URLs in a folder on your site, such as a rebranding, and, for a new CMS that will use a completely different structure, as well as redirecting traffic to updated content.

It’s not often rare to see aspects of a site blocked off that work actually intended to be blocked off, and oftentimes directives will simply conflict with one another.

You also need to be careful when working as a team. For example; a developer might make a change without anybody knowing, and if they change the code they made potentially make changes to the robots.txt file without you even knowing about it.

Also, some directives simply do not belong in a robots.txt file because it is a web standard and is limited in use. Making up directives that do not work can be harmless action, but can sometimes cause massive issues to crawlers.

We often see issues using robots.txt files with larger websites and it is easy to make mistakes like blocking a site out after a new design process or cms is integrated. 

If you run a larger website, making sure search engines can efficiently crawl means that you need to have a well-structured robots.txt file in place, and strategize as to which sections of the website do not need to be shown on Google.

Where Are Robots.txt Files Located?

This can be found in the root directory of your website in WordPress. in order to locate it, you need to open up the FTP cPanel,  and then you’ll be able to locate the file on your website directory under public_ HTML.

The good news is that these files are not heavily dependent on file size, and should only be a few hundred bytes.

An Example Of Robots.txt File In Action

Let’s see a website that sells printed t-shirts. Visitors have the opportunity to filter through different styles of t-shirts, and the types of patterns that are provided.

A filter will be able to generate pages that show the same content as a number of the other pages will.

This is super helpful from a user-friendly perspective, which likely means that customers will return because a website is easy to use; it’s a bit of a nightmare for search engines because it would simply generate content.

Something you don’t want is for search engines to index filter pages and use up bandwidth on URLs with filtered content. You could use a rule called ‘Disallow’  which means that search engines will not be able to access filtered pages with products.

This can also be achieved using a canonical URL. However this will not address search engines only crawling pages, and it will only show pages in the search results.

One factor to consider is search engines only have a finite amount of time to crawl a website, and you don’t want this time wasted. Instead, this time should be spent on pages that need to be shown in search engines to help your website.

What Does Robots.txt File Look Like?

Here is a basic idea of what it will look like for a WordPress website:

User-agent: *

Disallow: /wp-admin/

Here is the full breakdown of how this operates:

  • The user-agent What indicates which search engines the directives that are used on meant
  • An * indicates the directives are meant for all search engines and shouldn’t exclude any.
  • Disallow, as already discussed, is a directive that indicates what content should not be accessible to the user-agent.
  • And the/wp-admin/  is the path in which it will be inaccessible for the user-agent.

So to summarize; the example above will tell all search engines to stay away from the /wp-admin/ directory.

There are many com opponents of robots.txt files that need to be addressed in more detail.

Disallow

We’ll start with disallowing, as this is one directive we have already explored.

Simply tell search engines not to access certain files, sections, or pages by using the disallow directive. Make sure that the disallowed directive is followed by the path that should not be accessed, because if no path is defined then the directive will simply be ignored.

Here’s an example of disallow in action:

User-agent: *

Disallow: /media/

 Now, all search engines are told not to access the Media directory.

User-agent

One point to note is that each search engine should identify with a user agent. For example, Google will identify as Googlebot, and Yahoo’s robot will identify as Slurp whilst Bing’s is BingBot.

User-agents define the start of a group of directors, and any of the directives that end up between the first user-agent and the next user-agent can be treated as directives for the first user-agent.

Something that is often overlooked is that user agents can be applied to all user-agents, in which case a wildcard is used.

Allow

The allow directive is used to effectively counteract a disallow directive., and is currently supported by search engines like Google and Bing.

When using both directives ‘allow’ and ‘disallow’ in unison; you are able to let a search engine become aware that they can access a file or page within a directory that is actually disallowed.

Here is an example of an Allow directive:

User-agent: *

Allow: /media/policy-terms.pdf

Disallow: /media.

As you can see, the allow directive is followed by the path that can be accessed, and if no path is defined the directive will simply be ignored.

All search engines are not allowed to access the directory that is labeled as media, except for the file that is called ‘policy terms’.

One important point to note is that when using allow and disallow together,  do not use wildcards as these could potentially conflicting directives.

Crawl-Delay

This directive is actually unofficial but is very helpful in preventing service overloading with too many requests. 

When a search engine is overloaded with a server, using a crawl-delay can be a solid temporary fix. I remember that this is only temporary, and it is likely that your website is going to be running in a poor hosting environment.

It will also mean that your website is not configured correctly, and should be rectified immediately.

What makes this direct confusion is that search engines will handle it differently depending on the search engine. Whilst we will not be able to cover all of the search engines; here are how the most common search engines will be able to handle it:

  • Google and their GoogleBot do not support crawl-delay directives. This means they do not attempt to define Google with a crawl delay. Though, it does support defining a crawl rate via Google Search Console.
  • Yahoo, Yandex, and Bing support the crawl-delay directive, but iit will slightly differ between search engines so make sure you check their documentation before proceeding.
  • Baidu also doesn’t support the crawl-delay directive,  however, you can register a webmasters tool account with Baidu which will be able to control call frequency, just like Google Search Console.

When using crawl-delay with Bing, Yahoo, or Yandex, ensure that this directive is placed straight after the disallow or allow directive:

User-agent: Slurp

Disallow: /wp-admin/

Crawl-delay: 10

Sitemap

The main purpose of the robots.txt file is to tell search engines what pages not to crawl, fun fact is that it can also be used to show search engines to the XML sitemap, which is supported by Bing, Yahoo, Ask, and of course, Google.

If you are using an XML sitemap, then it should be used as an absolute URL, and does not have to be used on the same host where the robots.txt file is located.

This is a good idea to always do, even if you have already submitted your XML sitemap in either Google Search Console or Bing Webmaster Tools, also do not forget there are many other search engines available to use.

An example of multiple  XML sitemaps can be used as the following:

User-agent: *

Disallow: /wp-admin//

Sitemap: https://www.abcexample.com/sitemap1.xml

Sitemap: https://www.abcexample.com/sitemap2.xml

The above will tell all search engines not to access the directory /wp-admin/  and also that there are two XML sitemaps which are located at the above two web addresses.

To use a single XML sitemap, simpy use the following command:

User-agent: *

Disallow: /wp-admin/

Sitemap:  https://www.abc.com/sitemap_index.xml

In this example, all search engines will not have access to the /wp-admin/  the directory and then you can find the XML sitemap at the above link.

When Is The Best Time To Use Robots.txt File?

This is a bit of a trick question because you should always use this file. As long as you avoid the common issues highlighted above, there’s no harm in having a robots.txt file, and it will make having search engine directives more useful and more effective at crawling a website.

As long as you follow the advice above carefully, there should be no reason why you have any hiccups on your website that causes it to stop working.

Best Practices To Consider With Robots.txt Files

There are a number of best practices to follow and we will cover each of them individually.

Filename And Location

They should always be placed at the root of a website, and should still carry the filename robots.txt. An example would be:

https://www.abc.com/robots.txt

Note that the URL is case sensitive just like any other URL, and if you are unable to find it in a default location, then search engines will make the Assumption there are no directives and do any form of crawling through your website.

Groups Of Directives Per Robot & Specificity

You are only able to define a single group of directives with each individual search engine because if you have multiple groups of directives for a single search engine you will just end up unsure and confused.

This means you should also be specific because directives such as disallow will trigger partial matches as well. The key is to be as specific as possible and this will prevent the disallowing of access to certain files that you did not intend. 

Order Of Precedence

As we’ve already pointed out, search engines are individual by nature and will handle this type of file differently. As a rule of fun, the first matching directive will always be the winner.

However, when it comes to Google and Bing, then going back to specificity is going to be your best play.

Robots.Txt For Seo The Ultimate Guide

Directives For All Robots And Directives For A Specific Robot

For a robot, it is good practice that only one group of directives will be valid. But in case directives that are for all robots are preceded with a directive that is specific to that robot, only specific directives will be taken into consideration for them.

For a specific robot to follow no directives of all robots, ensure that you repeat the same directives for that specific robot.

This means that certain search engines can be filtered out and not allowed access.  For example,  you may not want search engines to access a test portion of your site or a page under construction, but can still allow GoogleBot, for example, access to the test page.

This comes with a warning, as having this file is publicly available for anyone to view. This means that disallowing certain sections of your website can be used by people with malicious intent to attack your website.

Robots.txt File For Each Sub Domain

Robots.txt  directives will only apply to these subdomains where the file is specifically hosted.

You should only have one file available on each of your subdomains, but if you decide to have multiple robots.txt files available, then you need to ensure that they will return an HTTP status 404, or a 301 and redirect them to the correct robots.txt file.

Monitoring Your Robots.txt File

It goes without saying that you need to monitor your file for any potential changes, as many issues will arise with incorrect directives, or any potential changes that are made which will cause issues with the SEO.

You’ll often find this crop up when new features or a new website has been tested via a test environment.

Guidelines And Conflicts

Google will often choose to use settings to find in Google Search Console over the directives in your file, just in case your robots.txt file has a conflict with settings defined in Google Search Console.

Noindex In Your Robots.txt File

Google is well known for suggesting that you do not use the unofficial noindex directive, and as recently as September 1st, 2019, they stopped supporting it entirely. And other search engines seem to not work with them, either.

Therefore you should not index using meta robots tags, or X-Robots-Tag.

Preventing UTF-8 BOM When Using Robots.txt File

BOM means byte order mark, Which is an invisible character that starts a file 2 indicating the encoding of a text file.

Google has made the statement that they ignore the Unicode byte found at the beginning of the file, and so we recommend that you prevent the use of UTF-8 BOM As many people have complained about issues with the interpretation of the file via search engines.

Removing any ambiguity with your preferences and crawling will help make the process a lot smoother, and of course, there are many other search engines despite Google being the most popular, even if Google plans that is forgiven. It’s still best to avoid it.

Are There Any Limitations To Using Robots.txt File?

With anything, there are going to be certain limitations to using robots.txt files and you need to be aware of them so you are not caught short with anything.

The first thing to consider is that the robots.txt file will contain directives, and is not technically a mandate. 

It doesn’t mean they aren’t respected by search engines and are certainly a step above an advisory, it just simply means that they can be considered an instruction or guideline, as opposed to rules set in stone.

Something else to consider is that pages of your website will still appear in search results, even if they are accessible for search engines but have links to them, Mr when they are links from a page that is crawled.

You can technically remove these URLs from Google, and this can be done in the Google Search Console.

However, these URLs are only going to be temporarily hidden, and if you want them kept out of Google’s search results you will need to put in a request every 180 days to hide the URL you want to stay hidden.

Another limitation of using robots.txt files is that according to Google, they are cached for up to 24 hours, so always consider this when you’re making changes to your file.

As of now, it is not clear how all search engines will deal with caching of this file, but as a rule of thumb, you should avoid caching robots.txt files, so that you can avoid having to work extra and take longer to pick up on changes.

You should bear in mind that file size is going to play an important role, and Google will currently support a limit of 500 Kibibytes as a maximum file size limit. Any content that is added after this point may potentially be ignored.

However this is just Google, and it is unclear as to where the other search engines are limited by maximum file size for this type of file.

Examples of Robots.txt Files

Now that we’ve laid out all the information, it’s time to take a look at some examples of robots.txt files.

Below are some of the most common examples that cover a wide range of commands.

Access To Everything

If you want to allow access to everything, there are a few options to do. You can simply leave an empty robots.txt file that does not prevent pages from being restricted, or you can input the following and leave disallow blank:

User-agent: *

Disallow:

Disallow Everything

To disallow everything, you will tell search engines not to access the entire site:

User-agent: *

Disallow: /

Simply add a forward slash and access is now blocked (you can see how one misplaced character can make or break a website!)

No Access To Specific Bots

You can select characters to disallow certain bots to have access. For example:

User-agent: Slurp

Disallow: /

Robots To Lose Access To Directories Or Specific Files

In order Disallow access for robots to certain directories:

User-agent: *

Disallow: /admin/

Disallow: /private/

You can  also select robots to not have access to one specific file:

User-agent: *

Disallow: /media/policy-terms.pdf

Set Up Your Robots.txt File Correctly Using Robots.txt Checker

Robots.txt file will take into consideration all of the content on your site, and having this Chequer miss any of these key ingredients is essentially useless.

So without any context,  this check or only be able to check whether you have any syntax mistakes, or if you have used deprecated directives.

This is why you should fully call your website and then analyze the robots.txt, ensuring that all content is audited for mistakes or errors.

To do this you need to ensure that you know the exact minute that things change with your file, as minor changes will have a wide-reaching impact on your SEO and Performance.

Therefore you should receive alerts in case any changes are made, and

Should You Block SEMrush In Your Robots.txt File?

We often get asked if you should block SEMrush. This is a bot, also commonly known as a web robot, web crawler, or web spider that will perform repetitive tasks in a structured manner to allow data to be collected more efficiently.

It will crawl your site using a list of your web page URLs, and it will save hyperlinks from the page for any potential crawling in the future. This software can discover new and updated web data.

For those that are not using it, and want to preserve their bandwidth, or simply prevent it from indexing backlinks, then.

To do this you can input the following:

User-agent: semrushBot

Disallow: / 

Or instead of completely blocking access to semrushbot, you can command it to do a less strict search of your sites, and to cruel at a slower pace if you are losing too much resource:

User-agent: semrushBot

Disallow: 60

You can also block their backlink audit tool, and allow other tools access: To do this:

User-agent: semrushBot-BA

Disallow: /

It will take around 1 hour or roughly 100 requests to process the contents and recheck your robots.txt file, and this is according to SEMrush.

Final Thoughts

Robots.txt files are one of the most important files for search engines, and this directive can stop a search engine from crawling your website for parts that you do not want to be seen. This instruction means that it plays an integral role in SEO and shouldn’t be understated.

If you follow the best practices laid out in our guide, this will ensure that you are cautious enough to make changes that do not potentially cause havoc to big parts of your website, so that they remain inaccessible to search engines.

You should also understand how different search engines will operate with this directive, and that with some of the most popular search engines like Google and Bing, it’s always better to be specific. 

Frequently Asked Questions

How Can You Prevent Search Engines From Indexing Search Result Pages On WordPress?

You can do this by ensuring that they do not have access to them.

It is therefore imperative that you include the following directives in your file, in order to prevent all search engines from indexing pages with search results on your website with WordPress.

This assumes that no changes were made to the functioning and the way the search result page operates, and no search result pages have already been indexed.

Use  the following close to ensure that indexing with search engines can be prevented:

User-agent: *

Disallow: / ?s=

Disallow: /search/

One final thing to note is that if your robots.txt file changes any points, and the search result and the pages that are included with which are accessible in search engines, you should ensure that there is a second line of defense.

This  is achieved by using the following on your search results page:

 <meta name=”robots” content=”noindex”> attribute

Can You Still Use Noindex?

As the no index rule is no longer supported,  he needs to ensure that your pages are not indexed by looking at these no-index meta tags, which allows bots to access the page but not let it be indexed or appear in SERPS.

So the disallow rule might not be as effective as the no-index tag, you can still block bots from crawling your page.

What Content Should You Avoid Blocking?

You should avoid blocking content that is good and that you wish the public to see.

One of the biggest mistakes you can make is blocking content that the audience of your website enjoys and regularly checks out, especially those that run an eCommerce website where sales are made regularly.

This will inevitably hurt your SEO down the line so you must always check your website pages for noindex tags and any disallow rules.

Justin Shaw