How to add robots txt. How to edit robots txt file

21.03.2022

System

Most of the robots are well designed and do not pose any problems for site owners. But if the bot is written by an amateur or “something went wrong”, then it can create a significant load on the site that it bypasses. By the way, spiders do not enter the server at all like viruses - they simply request the pages they need remotely (in fact, these are analogues of browsers, but without the page browsing function).

Robots.txt - user-agent directive and search engine bots

Robots.txt has a very simple syntax, which is described in great detail, for example, in help yandex And Google help. It usually specifies which search bot the following directives are intended for: bot name (" user-agent"), allowing (" allow") and forbidding (" Disallow"), and "Sitemap" is also actively used to indicate to search engines exactly where the map file is located.

The standard was created quite a long time ago and something was added later. There are directives and design rules that will be understood only by the robots of certain search engines. In RuNet, only Yandex and Google are of interest, which means that it is with their help in compiling robots.txt that you should familiarize yourself in particular detail (I provided the links in the previous paragraph).

For example, earlier for the Yandex search engine it was useful to indicate that your web project is the main one in the special "Host" directive, which only this search engine understands (well, also Mail.ru, because they have a search from Yandex). True, at the beginning of 2018 Yandex still canceled Host and now its functions, like those of other search engines, are performed by a 301 redirect.

Even if your resource does not have mirrors, it will be useful to indicate which of the spellings is the main one - .

Now let's talk a little about the syntax of this file. Directives in robots.txt look like this:

<поле>:<пробел><значение><пробел> <поле>:<пробел><значение><пробел>

The correct code should contain at least one "Disallow" directive after each "User-agent" entry. An empty file assumes permission to index the entire site.

user-agent

"User-agent" directive must contain the name of the search bot. With it, you can set up rules of conduct for each specific search engine (for example, create a ban on indexing a separate folder only for Yandex). An example of writing a "User-agent", addressed to all bots that come to your resource, looks like this:

User-agent: *

If you want to set certain conditions in the "User-agent" for only one bot, for example, Yandex, then you need to write this:

User agent: Yandex

The name of the search engine robots and their role in the robots.txt file

Bot of each search engine has its own name (for example, for a rambler it is StackRambler). Here I will list the most famous of them:

Google http://www.google.com Googlebot Yandex http://www.ya.ru Yandex Bing http://www.bing.com/ bingbot

For major search engines, sometimes except for the main bots, there are also separate instances for indexing blogs, news, images, and more. You can get a lot of information on the types of bots (for Yandex) and (for Google).

How to be in this case? If you need to write a no indexing rule that all types of Googlebots must follow, then use the name Googlebot and all other spiders of this search engine will also obey. However, you can only prohibit, for example, the indexing of images by specifying the Googlebot-Image bot as the User-agent. Now it is not very clear, but with examples, I think it will be easier.

Examples of using the Disallow and Allow directives in robots.txt

Let me give you a few simple examples of using directives explaining his actions.

The code below allows all bots (indicated by an asterisk in the User-agent) to index all content without any exceptions. It is given empty Disallow directive. User-agent: * Disallow:
The following code, on the contrary, completely prohibits all search engines from adding pages of this resource to the index. Sets this to Disallow with "/" in the value field. User-agent: * Disallow: /
In this case, all bots will be prohibited from viewing the contents of the /image/ directory (http://mysite.ru/image/ is the absolute path to this directory) User-agent: * Disallow: /image/
To block one file, it will be enough to register its absolute path to it (read): User-agent: * Disallow: /katalog1//katalog2/private_file.html
Looking ahead a little, I’ll say that it’s easier to use the asterisk character (*) so as not to write the full path:
Disallow: /*private_file.html
In the example below, the "image" directory will be prohibited, as well as all files and directories that begin with the characters "image", i.e. files: "image.htm", "images.htm", directories: "image", " images1", "image34", etc.): User-agent: * Disallow: /image The fact is that by default, an asterisk is implied at the end of the entry, which replaces any characters, including their absence. Read about it below.
Via allow directives we allow access. Good complement to Disallow. For example, with this condition, we forbid the Yandex search robot from downloading (indexing) everything except web pages whose address starts with /cgi-bin: User-agent: Yandex Allow: /cgi-bin Disallow: /
Well, or this is an obvious example of using the Allow and Disallow bundle:
User-agent: * Disallow: /catalog Allow: /catalog/auto
When describing paths for Allow-Disallow directives, you can use the symbols "*" and "$", thus setting certain logical expressions.
1. Symbol "*"(star) means any (including empty) sequence of characters. The following example prevents all search engines from indexing files with the ".php" extension: User-agent: * Disallow: *.php$
2. Why is it needed at the end $ (dollar) sign? The fact is that according to the logic of compiling the robots.txt file, a default asterisk is added at the end of each directive (it doesn’t exist, but it seems to be there). For example we write: Disallow: /images
  Assuming it's the same as:
  Disallow: /images*
  Those. this rule prohibits the indexing of all files (web pages, images, and other types of files) whose address starts with /images and anything else follows (see the example above). So here it is $ symbol simply overrides that default (unspecified) asterisk at the end. For example:
  Disallow: /images$
  Only disables indexing of the /images file, not /images.html or /images/primer.html. Well, in the first example, we prohibited indexing only files ending in .php (having such an extension), so as not to catch anything extra:
  Disallow: *.php$

In many engines, users (human-readable URLs), while system-generated URLs have a question mark "?" in the address. You can use this and write such a rule in robots.txt: User-agent: * Disallow: /*?

The asterisk after the question mark suggests itself, but, as we found out a little higher, it is already implied at the end. Thus, we will prohibit the indexing of search pages and other service pages created by the engine, which the search robot can reach. It will not be superfluous, because the question mark is most often used by CMS as a session identifier, which can lead to duplicate pages getting into the index.

Sitemap and Host directives (for Yandex) in Robots.txt

In order to avoid unpleasant problems with site mirrors, it was previously recommended to add the Host directive to robots.txt, which pointed the Yandex bot to the main mirror.

Host directive - specifies the main site mirror for Yandex

For example, before, if you have not switched to a secure protocol yet, it was necessary to indicate in the Host not the full URL, but the domain name (without http://, i.e. .ru). If you have already switched to https, then you will need to specify the full URL (like https://myhost.ru).

A wonderful tool for combating duplicate content - the search engine simply will not index the page if a different URL is registered in Canonical. For example, for such a page of my blog (a page with pagination), Canonical points to https: // site and there should not be any problems with duplicating titles.

But I digress...
If your project is based on any engine, then duplicate content will occur with a high probability, which means you need to fight it, including with the help of a ban in robots.txt, and especially in the meta tag, because in the first case, Google can ignore the ban, but it can no longer give a damn about the meta tag ( brought up like that).
For example, in WordPress, pages with very similar content can be indexed by search engines if indexing is allowed for both category content, tag archive content, and temporary archive content. But if using the Robots meta tag described above to create a ban for the tag archive and temporary archive (you can leave the tags, but prohibit the indexing of the contents of the categories), then duplication of content will not occur. How to do this is described by the link given just above (to the OlInSeoPak plugin)
Summing up, I’ll say that the Robots file is designed to set global rules for denying access to entire site directories, or to files and folders whose names contain specified characters (by mask). You can see examples of setting such prohibitions a little higher.
Now let's look at specific examples of robots designed for different engines - Joomla, WordPress and SMF. Naturally, all three options created for different CMS will differ significantly (if not cardinally) from each other. True, they all will have one common moment, and this moment is connected with the Yandex search engine.
Because Yandex has a fairly large weight in Runet, then you need to take into account all the nuances of its work, and here we Host directive will help. It will explicitly indicate to this search engine the main mirror of your site.
For her, it is advised to use a separate User-agent blog, intended only for Yandex (User-agent: Yandex). This is due to the fact that other search engines may not understand Host and, accordingly, its inclusion in the User-agent record intended for all search engines (User-agent: *) can lead to negative consequences and incorrect indexing.
It’s hard to say how things really are, because search algorithms are a thing in themselves, so it’s better to do as they advise. But in this case, you will have to duplicate in the User-agent: Yandex directive all the rules that we set User-agent: * . If you leave User-agent: Yandex with an empty Disallow: , then in this way you will allow Yandex to go anywhere and drag everything into the index.
Robots for WordPress
I will not give an example of a file that the developers recommend. You can watch it yourself. Many bloggers do not limit Yandex and Google bots at all in their walks through the content of the WordPress engine. Most often on blogs you can find robots automatically filled with a plugin.
But, in my opinion, one should still help the search in the difficult task of sifting the wheat from the chaff. Firstly, it will take a lot of time for Yandex and Google bots to index this garbage, and there may not be time at all to add web pages with your new articles to the index. Secondly, bots crawling through the junk files of the engine will create an additional load on the server of your host, which is not good.
You can see my version of this file for yourself. It is old, has not changed for a long time, but I try to follow the principle “don’t fix what didn’t break”, and it’s up to you to decide: use it, make your own or peep from someone else. I still had a ban on indexing pages with pagination there until recently (Disallow: */page/), but recently I removed it, relying on Canonical, which I wrote about above.
But in general, the only correct file for WordPress, probably does not exist. It is possible, of course, to implement any prerequisites in it, but who said that they would be correct. There are many options for ideal robots.txt on the web.
I will give two extremes:
you can find a megafile with detailed explanations (the # character separates comments that would be better removed in a real file): User-agent: * # general rules for robots, except for Yandex and Google, # because the rules for them are below Disallow: /cgi-bin # hosting folder Disallow: /? # all query options on the main page Disallow: /wp- # all WP files: /wp-json/, /wp-includes, /wp-content/plugins Disallow: /wp/ # if there is a /wp/ subdirectory where the CMS is installed ( if not, # rule can be removed) Disallow: *?s= # search Disallow: *&s= # search Disallow: /search/ # search Disallow: /author/ # author archive Disallow: /users/ # author archive Disallow: */ trackback # trackbacks, notifications in comments when an open # article link appears Disallow: */feed # all feeds Disallow: */rss # rss feed Disallow: */embed # all embeds Disallow: */wlwmanifest.xml # manifest xml file Windows Live Writer (if not using, # can be removed) Disallow: /xmlrpc.php # WordPress API file Disallow: *utm= # utm-tagged links Disallow: *openstat= # openstat-tagged links Allow: */uploads # open folder with files uploads User-agent: GoogleBot # rules for Google (do not duplicate comments) Disallow: /cgi-bin Disallow: /? Disallow: /wp- Disallow: /wp/ Disallow: *?s= Disallow: *&s= Disallow: /search/ Disallow: /author/ Disallow: /users/ Disallow: */trackback Disallow: */feed Disallow: */ rss Disallow: */embed Disallow: */wlwmanifest.xml Disallow: /xmlrpc.php Disallow: *utm= Disallow: *openstat= Allow: */uploads Allow: /*/*.js # open js scripts inside /wp - (/*/ - for priority) Allow: /*/*.css # open css files inside /wp- (/*/ - for priority) Allow: /wp-*.png # images in plugins, cache folder and etc. Allow: /wp-*.jpg # images in plugins, cache folder, etc. Allow: /wp-*.jpeg # images in plugins, cache folder, etc. Allow: /wp-*.gif # pictures in plugins, cache folder, etc. Allow: /wp-admin/admin-ajax.php # used by plugins to not block JS and CSS User-agent: Yandex # rules for Yandex (do not duplicate comments) Disallow: /cgi-bin Disallow: /? Disallow: /wp- Disallow: /wp/ Disallow: *?s= Disallow: *&s= Disallow: /search/ Disallow: /author/ Disallow: /users/ Disallow: */trackback Disallow: */feed Disallow: */ rss Disallow: */embed Disallow: */wlwmanifest.xml Disallow: /xmlrpc.php Allow: */uploads Allow: /*/*.js Allow: /*/*.css Allow: /wp-*.png Allow: /wp-*.jpg Allow: /wp-*.jpeg Allow: /wp-*.gif Allow: /wp-admin/admin-ajax.php Clean-Param: utm_source&utm_medium&utm_campaign # Yandex recommends not closing # from indexing, but deleting tag parameters, # Google does not support such rules Clean-Param: openstat # similar # Specify one or more Sitemap files (no need to duplicate for each User-agent #). Google XML Sitemap creates 2 sitemaps as in the example below. Sitemap: http://site.ru/sitemap.xml Sitemap: http://site.ru/sitemap.xml.gz # Specify the main mirror of the site, as in the example below (with WWW / without WWW, if HTTPS # then write protocol, if you need to specify the port, specify). Host command understands # Yandex and Mail.RU, Google does not take into account. Host: www.site.ru
Here is an example of minimalism: User-agent: * Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php Host: https://site.ru Sitemap: https://site. ru/sitemap.xml

The truth probably lies somewhere in the middle. Also, don't forget to register the Robots meta tag for "extra" pages, for example, using the wonderful plugin - . He will also help set up Canonical.
Correct robots.txt for Joomla
User-agent: * Disallow: /administrator/ Disallow: /bin/ Disallow: /cache/ Disallow: /cli/ Disallow: /components/ Disallow: /includes/ Disallow: /installation/ Disallow: /language/ Disallow: /layouts/ Disallow: /libraries/ Disallow: /logs/ Disallow: /modules/ Disallow: /plugins/ Disallow: /tmp/
In principle, almost everything is taken into account here and it works well. The only thing is that you should add a separate User-agent: Yandex rule to it to insert the Host directive that defines the main mirror for Yandex, as well as specify the path to the Sitemap file.
Therefore, in the final form, the correct robots for Joomla, in my opinion, should look like this:
User-agent: Yandex Disallow: /administrator/ Disallow: /cache/ Disallow: /includes/ Disallow: /installation/ Disallow: /language/ Disallow: /libraries/ Disallow: /modules/ Disallow: /plugins/ Disallow: /tmp/ Disallow: /layouts/ Disallow: /cli/ Disallow: /bin/ Disallow: /logs/ Disallow: /components/ Disallow: /component/ Disallow: /component/tags* Disallow: /*mailto/ Disallow: /*.pdf Disallow : /*% Disallow: /index.php Host: vash_sait.ru (or www.vash_sait.ru) User-agent: * Allow: /*.css?*$ Allow: /*.js?*$ Allow: /* .jpg?*$ Allow: /*.png?*$ Disallow: /administrator/ Disallow: /cache/ Disallow: /includes/ Disallow: /installation/ Disallow: /language/ Disallow: /libraries/ Disallow: /modules/ Disallow : /plugins/ Disallow: /tmp/ Disallow: /layouts/ Disallow: /cli/ Disallow: /bin/ Disallow: /logs/ Disallow: /components/ Disallow: /component/ Disallow: /*mailto/ Disallow: /*. pdf Disallow: /*% Disallow: /index.php Sitemap: http://path to your map XML format
Yes, also note that in the second option there are directives Allow, allowing indexing of styles, scripts and pictures. This was written specifically for Google, because its Googlebot sometimes swears that indexing of these files is prohibited in robots, for example, from the folder with the theme used. He even threatens to lower the rankings for this.
Therefore, we allow this whole thing to be indexed in advance using Allow. By the way, the same thing happened in the sample file for WordPress.

Good luck to you! See you soon on the blog pages site
You may be interested
Domains with and without www - the history of their appearance, using 301 redirects to glue them together
Mirrors, duplicate pages and Url addresses - an audit of your site or what could be the cause of the crash during its SEO promotion SEO for Beginners: 10 Essentials for a Technical Website Audit
Bing webmaster - center for webmasters from the search engine Bing
Google Webmaster - Search Console Tools (Google Webmaster)
How to avoid common mistakes when promoting a website
How to promote your site yourself by improving internal keyword optimization and removing duplicate content
Yandex Webmaster - indexing, links, site visibility, region selection, authorship and virus check in Yandex Webmaster

The robots.txt file is one of the most important when optimizing any website. Its absence can lead to a high load on the site from search robots and slow indexing and re-indexing, and an incorrect setting can lead to the site completely disappearing from the search or simply not being indexed. Therefore, it will not be searched in Yandex, Google and other search engines. Let's take a look at all the nuances of properly setting up robots.txt.

First, a short video that will give you a general idea of what a robots.txt file is.

How robots.txt affects site indexing

Search robots will index your site regardless of the presence of a robots.txt file. If such a file exists, then the robots can be guided by the rules that are written in this file. At the same time, some robots may ignore certain rules, or some rules may be specific only to some bots. In particular, GoogleBot does not use the Host and Crawl-Delay directives, YandexNews has recently begun to ignore the Crawl-Delay directive, and YandexDirect and YandexVideoParser ignore more general robots directives (but are guided by those specified specifically for them).

More about exceptions:
Yandex exceptions
Robot Exception Standard (Wikipedia)

The maximum load on the site is created by robots that download content from your site. Therefore, by specifying what to index and what to ignore, as well as at what time intervals to download, you can, on the one hand, significantly reduce the load on the site from robots, and on the other hand, speed up the download process by prohibiting bypassing unnecessary pages .

Such unnecessary pages include ajax, json scripts responsible for pop-up forms, banners, captcha output, etc., order forms and a shopping cart with all the steps of making a purchase, search functionality, personal account, admin panel.

For most robots, it is also desirable to disable indexing of all JS and CSS. But for GoogleBot and Yandex, such files must be left for indexing, as they are used by search engines to analyze the convenience of the site and its ranking (Google proof, Yandex proof).

robots.txt directives

Directives are rules for robots. There is a W3C specification from January 30, 1994 and an extended standard from 1996. However, not all search engines and robots support certain directives. In this regard, it will be more useful for us to know not the standard, but how the main robots are guided by certain directives.

Let's look at it in order.

user-agent

This is the most important directive that determines for which robots the rules follow.

For all robots:
User-agent: *

For a specific bot:
User agent: GoogleBot

Note that robots.txt is case insensitive. Those. The user agent for Google can just as well be written like this:
user agent: googlebot

Below is a table of the main user agents of various search engines.

Bot	Function
Google
Googlebot	Google's main indexing robot
Googlebot news	Google News
Googlebot Image	Google Pictures
Googlebot Video	video
Mediapartners-Google
media partners	Google Adsense, Google Mobile Adsense
AdsBot-Google	landing page quality check
AdsBot-Google-Mobile-Apps	Google Robot for Apps
Yandex
YandexBot	Yandex's main indexing robot
YandexImages	Yandex.Images
YandexVideo	Yandex.Video
YandexMedia	multimedia data
YandexBlogs	blog search robot
YandexAddurl	robot accessing the page when it is added via the "Add URL" form
YandexFavicons	robot that indexes site icons (favicons)
YandexDirect	Yandex.Direct
YandexMetrika	Yandex.Metrica
YandexCatalog	Yandex.Catalog
YandexNews	Yandex.News
YandexImageResizer	mobile services robot
Bing
bingbot	the main indexing robot Bing
Yahoo!
Slurp	main indexing robot Yahoo!
Mail.Ru
Mail.Ru	main indexing robot Mail.Ru
Rambler
StackRambler	Formerly the main indexing robot Rambler. However, as of June 23, 2011, Rambler ceases to support its own search engine and now uses Yandex technology on its services. No longer relevant.

Disallow and allow

Disallow closes pages and sections of the site from indexing.
Allow forcefully opens pages and sections of the site for indexing.

But everything is not so simple here.

First, you need to know additional operators and understand how they are used - these are *, $ and #.

* is any number of characters, including their absence. At the same time, you can not put an asterisk at the end of the line, it is understood that it is there by default.
$ - indicates that the character before it must be the last one.
# - comment, everything after this character in the line is not taken into account by the robot.

Examples of using:

Disallow: *?s=
Disallow: /category/$

Second, you need to understand how nested rules are executed.
Remember that the order in which the directives are written is not important. The rule inheritance of what to open or close from indexing is determined by which directories are specified. Let's take an example.

Allow: *.css
Disallow: /template/

http://site.ru/template/ - closed from indexing
http://site.ru/template/style.css - closed from indexing
http://site.ru/style.css - open for indexing
http://site.ru/theme/style.css - open for indexing

If you want all .css files to be open for indexing, you will have to additionally register this for each of the closed folders. In our case:

Allow: *.css
Allow: /template/*.css
Disallow: /template/

Again, the order of the directives is not important.

Sitemap

Directive for specifying the path to the Sitemap XML file. The URL is written in the same way as in the address bar.

For example,

Sitemap: http://site.ru/sitemap.xml

The Sitemap directive is specified anywhere in the robots.txt file without being tied to a specific user-agent. You can specify multiple sitemap rules.

Host

Directive for specifying the main mirror of the site (in most cases: with www or without www). Please note that the main mirror is indicated WITHOUT http://, but WITH https://. Also, if necessary, the port is specified.
The directive is only supported by Yandex and Mail.Ru bots. Other robots, in particular GoogleBot, will not take the command into account. Host is registered only once!

Example 1:
Host: site.ru

Example 2:
Host: https://site.ru

Crawl-delay

Directive for setting the time interval between downloading the site pages by the robot. Supported by Yandex robots, Mail.Ru, Bing, Yahoo. The value can be set in integer or fractional units (separator - dot), time in seconds.

Example 1:
Crawl delay: 3

Example 2:
Crawl delay: 0.5

If the site has a small load, then there is no need to set such a rule. However, if the indexing of pages by a robot leads to the fact that the site exceeds the limits or experiences significant loads, up to server outages, then this directive will help reduce the load.

The higher the value, the fewer pages the robot will download in one session. The optimal value is determined individually for each site. It is better to start with not very large values - 0.1, 0.2, 0.5 - and gradually increase them. For search engine robots that are less important for promotion results, such as Mail.Ru, Bing and Yahoo, you can initially set higher values than for Yandex robots.

Clean param

This rule tells the crawler that URLs with the specified parameters should not be indexed. The rule is given two arguments: a parameter and a section URL. The directive is supported by Yandex.

Clean-param: author_id http://site.ru/articles/

Clean-param: author_id&sid http://site.ru/articles/

Clean-Param: utm_source&utm_medium&utm_campaign

Other Options

In the extended robots.txt specification, you can also find the Request-rate and Visit-time parameters. However, they are currently not supported by the leading search engines.

Meaning of directives:
Request-rate: 1/5 - load no more than one page in five seconds
Visit-time: 0600-0845 - Load pages only between 6 am and 8:45 GMT.

Closing robots.txt

If you need to configure your site to NOT be indexed by search robots, then you need to write the following directives:

User-agent: *
disallow: /

Make sure that these directives are written on the test sites of your site.

Proper setting of robots.txt

For Russia and the CIS countries, where Yandex's share is tangible, directives should be written for all robots and separately for Yandex and Google.

To properly configure robots.txt, use the following algorithm:

Close the site admin panel from indexing
Close personal account, authorization, registration from indexing
Close cart, order forms, shipping and order data from indexing
Close from ajax indexing, json scripts
Close cgi folder from indexing
Close plugins, themes, js, css from indexing for all robots except Yandex and Google
Close search functionality from indexing
Close service sections from indexing that do not carry any value for the site in search (error 404, list of authors)
Close technical duplicates of pages from indexing, as well as pages on which all content is duplicated in one form or another from other pages (calendars, archives, RSS)
Close from indexing pages with filter, sort, compare options
Stop indexing pages with UTM tags and sessions parameters
Check what is indexed by Yandex and Google using the “site:” parameter (type “site:site.ru” in the search bar). If there are pages in the search that also need to be closed from indexing, add them to robots.txt
Specify Sitemap and Host
If necessary, write Crawl-Delay and Clean-Param
Check the correctness of robots.txt using Google and Yandex tools (described below)
After 2 weeks, check again if there are new pages in the SERP that should not be indexed. If necessary, repeat the above steps.

robots.txt example

# An example of a robots.txt file for setting up a hypothetical site https://site.ru User-agent: * Disallow: /admin/ Disallow: /plugins/ Disallow: /search/ Disallow: /cart/ Disallow: */?s= Disallow : *sort= Disallow: *view= Disallow: *utm= Crawl-Delay: 5 User-agent: GoogleBot Disallow: /admin/ Disallow: /plugins/ Disallow: /search/ Disallow: /cart/ Disallow: */?s = Disallow: *sort= Disallow: *view= Disallow: *utm= Allow: /plugins/*.css Allow: /plugins/*.js Allow: /plugins/*.png Allow: /plugins/*.jpg Allow: /plugins/*.gif User-agent: Yandex Disallow: /admin/ Disallow: /plugins/ Disallow: /search/ Disallow: /cart/ Disallow: */?s= Disallow: *sort= Disallow: *view= Allow: /plugins/*.css Allow: /plugins/*.js Allow: /plugins/*.png Allow: /plugins/*.jpg Allow: /plugins/*.gif Clean-Param: utm_source&utm_medium&utm_campaign Crawl-Delay: 0.5 Sitemap: https://site.ru/sitemap.xml Host: https://site.ru

How to add and where is robots.txt

After you have created the robots.txt file, it must be placed on your site at site.ru/robots.txt - i.e. in the root directory. The crawler always accesses the file at the URL /robots.txt

How to check robots.txt

Checking robots.txt is carried out at the following links:

In Yandex.Webmaster — on the Tools>Robots.txt analysis tab
IN Google Search Console- on the Scan tab > robots.txt file inspection tool

Common mistakes in robots.txt

At the end of the article, I will give some typical robots.txt file errors.

robots.txt is missing
in robots.txt the site is closed from indexing (Disallow: /)
the file contains only the most basic directives, there is no detailed study of the file
pages with UTM tags and session IDs are not blocked from indexing in the file
the file contains only directives
Allow: *.css
Allow: *.js
Allow: *.png
Allow: *.jpg
Allow: *.gif
while css, js, png, jpg, gif files are closed by other directives in a number of directories
Host directive is written multiple times
Host does not specify https protocol
the path to the Sitemap is incorrect, or the wrong protocol or site mirror is specified

P.S.

P.S.2

Useful video from Yandex (Attention! Some recommendations are only suitable for Yandex).

) we can move on to the practical part, or rather, to preparing the site for promotion. Today we will analyze the question: how to create robots.txt?

robots.txt is a file that contains indexing parameters for search engines.

Creating this file is one of the first steps to SEO promotion. And that's why.

What is robots.txt for?

After you add your site to Yandex and Google (we have not gone through this yet), PS will start indexing everything, absolutely everything that is in your folder with the site on the server. This is not very good in terms of promotion, because the folder contains a lot of “garbage” that is not needed for the PS, which will negatively affect positions in search results.

It is the robots.txt file that prevents indexing of documents, folders, and unnecessary pages. Among other things, the path to the site map (the topic of the next lesson) and the main address are indicated here, about a little more.

I won’t talk much about the site map, I’ll just say one thing: the site map improves the indexing of the site. But it is worth talking about the main address in more detail. The fact is that each site initially has several mirrors (copies of the site) and are available at different addresses:

www.site
website
website/
www.site/

With all these mirrors, the site becomes non-unique. Naturally, PS do not like non-unique content, preventing such sites from rising in search results.

How to fill in the robots.txt file?

Any file designed to work with various external services, in our case, search engines, must have filling rules (syntax). Here are the rules for robots:

The name of the robots.txt file must begin with a small letter. You don't need to name it either Robots.txt or ROBOTS.TXT. Right: robots.txt;
Unix text format. The format is typical of regular notepad in Windows, so creating robots.txt is quite simple;

robots operators

And now let's talk, in fact, about the robots operators themselves. There are about 6 in total, in my opinion, but only 4 are necessary:

user-agent. This operator is used to specify the search engine to which the indexing rules are addressed. With it, you can specify different rules for different PSs. Filling example: User-agent: Yandex;
Disallow. An operator that prohibits the indexing of a particular folder, page, file. Fill example: Disallow: /page.html;
Host. This operator indicates the main address (domain) of the site. Filling example: Host: site;
Sitemap. Points to the sitemap address. Filling example: Sitemap: site/sitemap.xml;

Thus, I forbade Yandex to index the page “page .. Now the Yandex search robot will take into account these rules and the page “page.html” will never be in the index.

user-agent

As mentioned above, the User-agent specifies the search engine to which the indexing rules will be used. Here is a small table:

Search system	User-agent parameter
Yandex	Yandex
Google	Google
Mail.ru	Mail.ru
Rambler	StackRambler

If you want the indexing rules to apply to all PS, then you need to make the following entry:

User-agent: *

That is, use, as a parameter, an ordinary asterisk.

Disallow

This operator is a little more complicated, so you need to be careful with filling it out. It is written after the "User-agent" operator. Any mistake can lead to very disastrous consequences.

What do we ban?	Parameter	Example
Site indexing	/	disallow: /
File in the root directory	/File name	Disallow: /page.html
File at a specific address	/path/filename	Disallow: /dir/page.html
Folder indexing	/folder name/	Disallow: /folder/
Indexing a folder at a specific address	/path/folder name/	Disallow: /dir/folder/
Documents starting with a specific character set	/symbols	/symbols
Documents starting with a specific set of characters at an address	/path/symbols	/dir/symbols

Once again I say: be extremely careful when working with this operator. It also happens that, purely by chance, a person prohibits the indexing of his site, and then is surprised that it is not in the search.

It makes no sense to talk about other operators. What is written above is enough.

Would you like to have an example robots.txt? Catch:

User-agent: * Disallow: /cgi-bin Disallow: /wp-admin Disallow: /wp-includes Disallow: /wp-content/plugins Disallow: /wp-content/cache Disallow: /wp-content/themes Disallow: / trackback Disallow: */trackback Disallow: */*/trackback Disallow: */*/feed/*/ Disallow: */feed Disallow: /tag Host: site.ru Sitemap:site.ru/sitemap.xml

By the way, this example can be used, like a real robots.txt file, by people whose sites are powered by WordPress. Well, those who have ordinary sites, write yourself, ha ha ha. Unfortunately, there is no single one for everyone, everyone has their own. But with the information I've given you, creating a robots.txt shouldn't be too hard.

Goodbye friends!

Previous article
Next article

This requires instructions for working, search engines are no exception to the rule, which is why they came up with a special file called robots.txt. This file must be in the root folder of your site, or it can be virtual, but it must be opened on request: www.yoursite.ru/robots.txt

Search engines have long learned to distinguish the necessary html files from the internal sets of scripts of your CMS system, or rather, they have learned to recognize links to content articles and all sorts of rubbish. Therefore, many webmasters already forget to make robots for their sites and think that everything will be fine anyway. Yes, they are 99% right, because if your site does not have this file, then search engines are limitless in their search for content, but there are nuances that can be taken care of in advance.

If you have any problems with this file on the site, write comments to this article and I will quickly help you with this, absolutely free. Very often, webmasters make minor mistakes in it, which brings the site to poor indexing, or even exclusion from the index.

What is robots.txt for?

The robots.txt file is created to set up the correct indexing of the site by search engines. That is, it contains rules for allowing and denying certain paths of your site or content type. But this is not a panacea. Everything the rules in the robots file are not guidelines follow them exactly, but just a recommendation for search engines. Google writes for example:

You cannot use a robots.txt file to hide a page from Google Search results. Other pages may link to it, and it will still be indexed.

Search robots themselves decide what to index and what not, and how to behave on the site. Each search engine has its own tasks and functions. As much as we would like, this is a way not to tame them.

But there is one trick that does not directly relate to the subject of this article. To completely prevent robots from indexing and showing a page in search results, you need to write:

Let's get back to robots. The rules in this file can close or allow access to the following types of files:

Non-graphic files. Basically, these are html files that contain some information. You can close duplicate pages, or pages that don't provide any useful information (pagination pages, calendar pages, archive pages, profile pages, etc.).
Graphic files. If you want site images not to appear in searches, you can set this in the robots.
Resource files. Also, with the help of robots, you can block the indexing of various scripts, CSS style files and other unimportant resources. But you should not block resources that are responsible for the visual part of the site for visitors (for example, if you close the css and js of the site that display beautiful blocks or tables, the search robot will not see this and will swear at it).

To visually show how robots works, look at the picture below:

The search robot, following the site, looks at the indexing rules, then starts indexing according to the recommendations of the file.
Depending on the rule settings, the search engine knows what can be indexed and what can't be indexed.

With the syntax of the robots.txt file

To write rules for search engines in the robots file, directives with various parameters are used, with the help of which robots follow. Let's start with the very first and probably the most important directive:

User-agent directive

user-agent- With this directive, you specify the name of the robot that should use the recommendations in the file. These robots are officially in the world of the Internet - 302 pieces. Of course, you can write the rules for everyone separately, but if you don’t have time for this, just write:

User-agent: *

* - in this example means "All". Those. your robots.txt file should start with "who exactly" the file is for. In order not to bother with all the names of the robots, just write an asterisk in the user-agent directive.

I will give you detailed lists of robots of popular search engines:

Google- Googlebot- main robot

Other Google Robots

Googlebot news- news search robot
Googlebot Image- robot pictures
Googlebot Video- robot video
Googlebot mobile- robot mobile version
AdsBot-Google- landing page quality check robot
Mediapartners-Google- adsense robot

Yandex - YandexBot- the main indexing robot;

Other Yandex robots

Disallow and Allow Directives

Disallow- the most basic rule in robots, it is with the help of this directive that you prohibit indexing certain places on your site. The directive is written like this:

Disallow:

Very often you can see the Disallow directive: empty, i.e. allegedly telling the robot that nothing is prohibited on the site, index whatever you want. Be careful! If you put / in disallow, then you will completely close the site for indexing.

Therefore, the most standard version of robots.txt, which "allows the indexing of the entire site for all search engines" looks like this:

User Agent: * Disallow:

If you don't know what to write in robots.txt, but have heard about it somewhere, just copy the code above, save it to a file called robots.txt and upload it to the root of your site. Or don't create anything, because even without it, robots will index everything on your site. Or read the article to the end, and you will understand what to close on the site and what not.

According to robots rules, the disallow directive must be required.

With this directive, you can disable both a folder and a separate file.

If you want to deny folder you should write:

Disallow: /folder/

If you want to disable a specific file:

Disallow: /images/img.jpg

If you want to disallow certain types of files:

Disallow: /*.png$

Regular expressions are not supported by many search engines. Google supports.

allow— permissive directive in Robots.txt. It allows the robot to index a specific path or file in the deny directory. Until recently, it was used only by Yandex. Google caught up with this and started using it too. For example:

Allow: /content Disallow: /

these directives prohibit indexing all site content, except for the content folder. Or here are some more popular directives lately:

Allow: /themplate/*.js Allow: /themplate/*.css Disallow: /themplate

these values allow indexing of all CSS and JS files on the site, but prevent everything in your template folder from being indexed. Over the past year, Google sent a lot of letters to webmasters with the following content:

Googlebot can't access CSS and JS files on website

And the related comment: We have discovered an issue on your site that may prevent it from being crawled. Googlebot cannot process JavaScript code and/or CSS files due to restrictions in the robots.txt file. This data is needed to evaluate the performance of the site. Therefore, if access to resources is blocked, then this may worsen the position of your site in the Search.

If you add the two allow directives that are written in the last code to your Robots.txt, then you will not see such messages from Google.

And using special characters in robots.txt

Now about signs in directives. Basic signs (special characters) in prohibiting or allowing this /, *, $

About slashes (forward slash) "/"

The slash is very deceptive in robots.txt. I observed an interesting situation several dozen times when, out of ignorance, they added to robots.txt:

User-Agent: * Disallow: /

Because they read somewhere about the structure of the site and copied it to themselves on the site. But in this case you disable indexing of the entire site. To prohibit indexing of the directory, with all the internals, you definitely need to put / at the end. For example, if you write Disallow: /seo, then absolutely all links on your site that contain the word seo will not be indexed. Even though it will be the /seo/ folder, even though it will be the /seo-tool/ category, even though it will be the /seo-best-of-the-best-soft.html article, all this will not be indexed.

Look carefully at everything / in your robots.txt

Always put / at the end of directories. If you set / to Disallow, you will prevent indexing of the entire site, but if you do not set / to Allow, you will also disable indexing of the entire site. / - in some sense means "Everything that follows the directive /".

About asterisks * in robots.txt

The special character * means any (including empty) sequence of characters. You can use it anywhere in robots like this:

User-agent: * Disallow: /papka/*.aspx Disallow: /*old

Forbids all files with the aspx extension in the papka directory, also forbids not only the /old folder, but also the /papka/old directive. Tricky? So I do not recommend you to play around with the * symbol in your robots.

By default in indexing and ban rules file robots.txt is * on all directives!

About the special character $

The $ special character in robots terminates the * special character. For example:

Disallow: /menu$

This rule forbids '/menu' but does not forbid '/menu.html', i.e. the file disallows search engines only with the /menu directive, and cannot disallow all files with the word menu in the URL.

host directive

The host rule only works in Yandex, so is optional, it determines the primary domain from your site mirrors, if any. For example, you have a dom.com domain, but the following domains are also purchased and configured: dom2.com, dom3,com, dom4.com and from them there is a redirect to the main domain dom.com

In order for Yandex to quickly determine which of them is the main site (host), add the host directory to your robots.txt:

host: website

If your site does not have mirrors, then you can not prescribe this rule. But first, check your site by IP address, it is possible that your main page opens on it, and you should register the main mirror. Or perhaps someone copied all the information from your site and made an exact copy, the entry in robots.txt, if it was also stolen, will help you with this.

The host entry must be one, and if necessary, with a prescribed port. (Host: site:8080)

Crawl-delay directive

This directive was created in order to remove the possibility of loading your server. Search bots can make hundreds of requests to your site at the same time, and if your server is weak, it can cause minor glitches. To prevent this from happening, we came up with a rule for Crawl-delay robots - this is the minimum period between page loads on your site. The default value for this directive is recommended to be set to 2 seconds. In Robots it looks like this:

Crawl delay: 2

This directive works for Yandex. In Google, you can set the crawl rate in the webmaster panel, in the Site Settings section, in the upper right corner with a "gear".

Clean-param directive

This parameter is also only for Yandex. If site page addresses contain dynamic parameters that do not affect their content (for example: session IDs, user IDs, referrers IDs, etc.), you can describe them using the Clean-param directive.

The Yandex robot, using this information, will not repeatedly reload duplicate information. Thus, the efficiency of crawling your site will increase, and the load on the server will decrease.
For example, the site has pages:

www.site.com/some_dir/get_book.pl?ref=site_1&book_id=123

Parameter ref is used only to track from which resource the request was made and does not change the content, the same page with the book book_id=123 will be shown at all three addresses. Then if you specify the directive like this:

User-agent: Yandex Disallow: Clean-param: ref /some_dir/get_book.pl

the Yandex robot will reduce all page addresses to one:
www.site.com/some_dir/get_book.pl?ref=site_1&book_id=123,
If a page without parameters is available on the site:
www.site.com/some_dir/get_book.pl?book_id=123
then everything will come down to it when it is indexed by the robot. Other pages on your site will be crawled more often as there is no need to refresh the pages:
www.site.com/some_dir/get_book.pl?ref=site_2&book_id=123
www.site.com/some_dir/get_book.pl?ref=site_3&book_id=123

#for addresses like: www.site1.com/forum/showthread.php?s=681498b9648949605&t=8243 www.site1.com/forum/showthread.php?s=1e71c4427317a117a&t=8243 #robots.txt will contain: User-agent: Yandex Disallow: Clean-param: s /forum/showthread.php

Sitemap Directive

With this directive, you simply specify the location of your sitemap.xml. The robot remembers this, “thanks you”, and constantly analyzes it along the given path. It looks like this:

Sitemap: http://site/sitemap.xml

And now let's look at the general questions that arise when compiling a robot. There are many such topics on the Internet, so we will analyze the most relevant and most frequent.

Correct robots.txt

There is a lot of “correct” in this word, because for one site on one CMS it will be correct, and on another CMS it will give errors. "Correctly configured" for each site is individual. In Robots.txt, you need to close from indexing those sections and those files that are not needed by users and do not carry any value for search engines. The simplest and most correct version of robots.txt

User-Agent: * Disallow: Sitemap: http://site/sitemap.xml User-agent: Yandex Disallow: Host: site.com

This file contains the following rules: settings for prohibition rules for all search engines (User-Agent: *), indexing of the entire site is completely allowed (“Disallow:” or you can specify “Allow: /”), the host of the main mirror for Yandex is specified (Host : site.ncom) and the location of your Sitemap.xml (Sitemap: .

Robots.txt for WordPress

Again, there are a lot of questions, one site can be online stores, another blog, the third is a landing page, the fourth is a business card site of the company, and all this can be on the WordPress CMS and the rules for robots will be completely different. Here is my robots.txt for this blog:

User-Agent: * Allow: /wp-content/uploads/ Allow: /wp-content/*.js$ Allow: /wp-content/*.css$ Allow: /wp-includes/*.js$ Allow: / wp-includes/*.css$ Disallow: /wp-login.php Disallow: /wp-register.php Disallow: /xmlrpc.php Disallow: /template.html Disallow: /wp-admin Disallow: /wp-includes Disallow: /wp-content Disallow: /category Disallow: /archive Disallow: */trackback/ Disallow: */feed/ Disallow: /?feed= Disallow: /job Disallow: /?.net/sitemap.xml

There are a lot of settings here, let's analyze them together.

Allow in WordPress. The first permission rules are for content that users need (these are pictures in the uploads folder), and robots (these are CSS and JS for displaying pages). It is css and js that Google often swears at, so we left them open. It was possible to use the method of all files by simply inserting "/ * .css $", but the prohibition line of these folders where the files are located did not allow them to be used for indexing, so I had to write the path to the prohibition folder in full.

Allow always points to the path of content that is prohibited in Disallow. If something is not forbidden for you, you should not prescribe allow for it, supposedly thinking that you are giving an impetus to search engines, like “Come on, here’s your URL, index faster.” That won't work.

Disallow in WordPress. A lot of things need to be banned in CMS WP. A lot of different plugins, a lot of different settings and themes, a bunch of scripts and various pages that do not carry any useful information. But I went further and completely forbade indexing everything on my blog, except for the articles themselves (posts) and pages (about the Author, Services). I even closed the categories on the blog, I will open them when they are optimized for queries and when a text description for each of them appears there, but now these are just duplicate post previews that search engines do not need.

Well Host and Sitemap are standard directives. Only it was necessary to take out the host separately for Yandex, but I did not bother about this. Let's finish with Robots.txt for WP.

How to create robots.txt

It is not as difficult as it seems at first glance. You just need to take a regular notepad (Notepad) and copy the data for your site according to the settings from this article. But if this is difficult for you, there are resources on the Internet that allow you to generate robots for your sites:

No one will tell more about your Robots.txt than these comrades. After all, it is for them that you create your “forbidden file”.

Now let's talk about some of the little bugs that can be in robots.

« Empty line' - it is not allowed to make an empty string in the user-agent directive.
At conflict between two directives with prefixes of the same length, priority is given to the directive allow.
For each robots.txt file is processed only one Host directive. If multiple directives are specified in the file, the robot uses the first one.
Directive Clean Param is cross-sectional, so it can be listed anywhere in the robots.txt file. If there are several directives, all of them will be taken into account by the robot.
Six Yandex robots do not follow the Robots.txt rules (YaDirectFetcher, YandexCalendar, YandexDirect, YandexDirectDyn, YandexMobileBot, YandexAccessibilityBot). To prevent them from indexing on the site, you should make separate user-agent parameters for each of them.
User agent directive, must always be written above the deny directive.
One line, for one directory. You cannot write multiple directories on one line.
File name should only be like this: robots.txt. No Robots.txt, ROBOTS.txt, and so on. Only small letters in the title.
In directive host you should write the path to the domain without http and without slashes. Incorrect: Host: http://www.site.ru/, Correct: Host: www.site.ru
When the site uses a secure protocol https in directive host(for the Yandex robot) must be specified with the protocol, so Host: https://www.site.ru

This article will be updated as interesting questions and nuances come in.

With you was, lazy Staurus.

Robots.txt is a special file located in the site's root directory. The webmaster specifies in it which pages and data to close from indexing from search engines. The file contains directives that describe access to sections of the site (the so-called exclusion standard for robots). For example, it can be used to set various access settings for search robots designed for mobile devices and regular computers. It is very important to set it up correctly.

Is robots.txt necessary?

Option 2:

This option assumes that your site already has robots.txt at the root of the site.

Left select Tools - Analysisrobots.txt

Do not forget that all the changes that you make to the robots.txt file will not be available immediately, but only after some time.

Checking robotx.txt for Google crawler

In Google Search Console, select your site, navigate to the review tool, and view the contents of the robots.txt file. Syntactic And brain teaser errors in it will be highlighted, and their number will be indicated under the editing window.
At the bottom of the interface page, enter the desired URL in the corresponding window.
From the drop-down menu on the right, select robot.
Click the button VERIFY.
Status will be displayed AVAILABLE or NOT AVAILABLE. In the first case, Googlebots can go to the address you specify, but in the second case, they cannot.
If necessary, make changes to the menu and check again. Attention! These fixes will not be automatically added to the robots.txt file on your site.
Copy the modified content and add it to the robots.txt file on your web server.

In addition to verification services from Yandex and Google, there are many others online. robots.txt validators.

robots.txt generators

Service from SEOlib.ru.
With this tool, you can quickly get and check the limits in the Robots.txt file.
Generator from pr-cy.ru.
As a result of the Robots.txt generator, you will receive text that must be saved to a file called Robots.txt and uploaded to the root directory of your site.