Crawler directives
Crawler directives
Coveo’s Web source crawler behaves similarly to bots of web search engines such as Google. The crawler only needs a starting URL and then discovers other web pages by following the site navigation and hyperlinks appearing on the pages. Bots (including Coveo’s Web source crawler) can have a negative impact on the performance of the targeted website. This is why mechanisms (that is, directives) were developed to let websites and web pages provide crawlers with indexing instructions.
By default, when indexing the content of a website, the Coveo crawler obeys all directives it encounters.
This can represent an obstacle if, for example, you want to use the Coveo Web source to index content that Google doesn’t.
Only as a last resort should you consider configuring the Web source to override a directive.
Instead, you should leverage the fact that directives can usually be given on a per-crawler basis and configure coveobot
-specific directives.
The goal of this article is to provide information on the crawler directives the Web source can override and, for each, the actual way you should allow more freedom to the Coveo crawler.
The "robots.txt" override setting
The Web source robots.txt directives override setting pertains to the robots.txt file.
This file specifies Allow
/Disallow
directives that tell a crawler which parts of the website it should or shouldn’t visit.
If you need to give greater access to the coveobot
crawler, consider adding a User-agent: coveobot
section to your robots.txt
file.
See The User-Agent Line and Simple Example for further details.
Note
The Web source crawler is coded to request no more than one page per second.
If the site |
The "noindex" and "nofollow links" override settings
The Web source noindex and nofollow links directives override settings are grouped here because their related directives are often used together in a website implementation.
The noindex
and nofollow
directives apply to an entire HTML page.
They’re case-insensitive and can appear:
-
In the
<head>
section of an HTML page as thecontent
attribute value of a<meta>
tag (for example,<meta name="robots" content="noindex">
). -
In the web server
X-Robots-Tag
HTTP response header following the request for a given page.
The noindex
directive instructs crawlers not to index the page.
The nofollow
directive instructs crawlers not to follow any of the links of the page.
Whether using <meta>
tags or the X-Robots-Tag
HTTP response headers, there are shorthand content
values.
For example:
-
<meta name="robots" content="none">
is equivalent to<meta name="robots" content="noindex, nofollow">
. -
<meta name="robots" content="all">
means that there are no indexing and no link following restrictions.
Meta directives with name="robots"
and HTTP response header directives without a specified crawler name apply to all crawlers.
However, if you want some directives to apply to the Coveo crawler only, you can set the name
property to coveobot
.
As a result, whenever your page contains directives specifically intended for the Coveo crawler, the crawler follows these instructions and ignores the general, all-robot directives.
<meta>
tagWith the following page <meta>
tag, all robots are instructed not to index the page and not to follow the links in it, except the coveobot
crawler:
<meta name="robots" content="nofollow, noindex" />
<meta name="coveobot" content="all" />
With the following X-Robots-Tag HTTP header, all robots are instructed not to index the page and not to follow the links in it, except the coveobot
crawler:
HTTP/1.1 200 OK
Date: Mon, 10 April 2023 15:08:11 GMT
(…)
X-Robots-Tag: nofollow, noindex, coveobot: all
(…)
The "nofollow anchors" override setting
Whereas the <meta>
tag directives apply to an entire page, the Web source nofollow anchors directive override setting pertains to nofollow
directives specified on individual anchor tags (that is, <a rel="nofollow" …>
tags).
This link shouldn’t be followed by robots: <a href="signin.php" rel="nofollow">sign in</a>
.
There’s no way to change the anchor tag to make the rel="nofollow"
attribute target a specific crawler.
However, if you’re using the rel="nofollow"
attribute to prevent the target page from being indexed, you may want to consider using the mechanisms mentioned above instead (see Evolving "nofollow" - new ways to identify the nature of links).