Crawler directives

The Coveo Web source crawler behaves similarly to bots of web search engines such as Google. The crawler only needs a Site URL and then discovers other web pages by following the site navigation and hyperlinks appearing on the pages. Bots (including the Coveo Web source crawler) can have a negative impact on the performance of the targeted website. This is why mechanisms (i.e., directives) were developed to let websites and web pages provide crawlers with indexing instructions.

By default, when indexing the content of a website, the Coveo crawler obeys all directives it encounters. This can represent an obstacle if, for example, you want to use the Coveo Web source to index content that Google doesn’t. Only as a last resort should you consider configuring the Web source to ignore a directive. Instead, you should leverage the fact that directives can usually be given on a per-crawler basis and configure coveobot-specific directives.

The goal of this article is to provide information on the crawler directives the Web source can override and, for each, the actual way you should allow more freedom to the Coveo crawler.

The "Respect Robots.txt Directives" setting

The Web source Respect Robots.txt Directives setting pertains to the robots.txt file. This file specifies Allow/Disallow directives that tell a crawler which parts of the website it should or shouldn’t visit.

If you need to give greater access to the coveobot crawler, consider adding a User-agent: coveobot section to your robots.txt file. See The User-Agent Line and Simple Example for further details.

The Web source crawler is coded to request no more than one page per second. If the site robots.txt file includes a Crawl-delay directive with a different value, the slowest crawling speed applies.

The "Respect Noindex Directives" and "Respect Nofollow Directives" settings

The Web source Respect Noindex Directives and Respect Nofollow Directives settings are grouped here because their related directives are often used together in a website implementation. The noindex and nofollow directives apply to an entire HTML page. They’re case-insensitive and can appear:

  • In the <head> section of an HTML page as the content attribute value of a <meta> tag (e.g., <meta name="robots" content="noindex">).

  • In the web server X-Robots-Tag HTTP response header following the request for a given page.

    Response header that contains an X-Robots-Tag | Coveo

The noindex directive instructs crawlers not to index the page. The nofollow directive instructs crawlers not to follow any of the links of the page.

Whether using <meta> tags or the X-Robots-Tag HTTP response headers, there are shorthand content values. For example:

  • <meta name="robots" content="none"> is equivalent to <meta name="robots" content="noindex, nofollow">.

  • <meta name="robots" content="all"> means that there are no indexing and no link following restrictions.

Meta directives with name="robots" and HTTP response header directives without a specified crawler name apply to all crawlers. However, if you want some directives to apply to the Coveo crawler only, you can set the name property to coveobot. As a result, whenever your page contains directives specifically intended for the Coveo crawler, the crawler follows these instructions and ignores the general, all-robot directives.

Example: Coveo-specific directives with a <meta> tag

With the following page <meta> tag, all robots are instructed not to index the page and not to follow the links in it, except the coveobot crawler:

<meta name="robots" content="nofollow, noindex" />
<meta name="coveobot" content="all" />
Example: Coveo-specific directives with an X-Robots-Tag response header

With the following X-Robots-Tag HTTP header, all robots are instructed not to index the page and not to follow the links in it, except the coveobot crawler:

HTTP/1.1 200 OK
Date: Mon, 10 April 2023 15:08:11 GMT
X-Robots-Tag: nofollow, noindex, coveobot: all

The "Respect Nofollow Anchors" setting

Whereas <meta> tag directives apply to an entire page, the Web source Respect Nofollow Anchors setting pertains to nofollow directives specified on individual anchor tags (i.e., <a rel="nofollow" …​> tags).


This link shouldn’t be followed by robots: <a href="signin.php" rel="nofollow">sign in</a>.

There’s no way to change the anchor tag to make the rel="nofollow" attribute target a specific crawler. However, if you’re using the rel="nofollow" attribute to prevent the target page from being indexed, you may want to consider using the mechanisms mentioned above instead (see Evolving "nofollow" - new ways to identify the nature of links).