Crawler directives

Coveo’s Web source crawler behaves similarly to bots of web search engines such as Google. The crawler only needs a starting URL and then discovers other web pages by following the site navigation and hyperlinks appearing on the pages. Bots (including Coveo’s Web source crawler) can have a negative impact on the performance of the targeted website. This is why mechanisms (that is, directives) were developed to let websites and web pages provide crawlers with indexing instructions.

By default, when indexing the content of a website, the Coveo crawler obeys all directives it encounters. This can represent an obstacle if, for example, you want to use the Coveo Web source to index content that Google doesn’t. Only as a last resort should you consider configuring the Web source to override a directive. Instead, you should leverage the fact that directives can usually be given on a per-crawler basis and configure coveobot-specific directives.

The goal of this article is to provide information on the crawler directives the Web source can override and, for each, the actual way you should allow more freedom to the Coveo crawler.

The "robots.txt" override setting

The Web source robots.txt directives override setting pertains to the robots.txt file. This file specifies Allow/Disallow directives that tell a crawler which parts of the website it should or shouldn’t visit.

If you need to give greater access to the coveobot crawler, consider adding a User-agent: coveobot section to your robots.txt file. See The User-Agent Line and Simple Example for further details.

The Web source crawler is coded to request no more than one page per second. If the site robots.txt file includes a Crawl-delay directive with a different value, the slowest crawling speed applies.

The Web source noindex and nofollow links directives override settings are grouped here because their related directives are often used together in a website implementation. The noindex and nofollow directives apply to an entire HTML page. They’re case-insensitive and can appear:

  • In the <head> section of an HTML page as the content attribute value of a <meta> tag (for example, <meta name="robots" content="noindex">).

  • In the web server X-Robots-Tag HTTP response header following the request for a given page.

    Response header that contains an X-Robots-Tag | Coveo

The noindex directive instructs crawlers not to index the page. The nofollow directive instructs crawlers not to follow any of the links of the page.

Whether using <meta> tags or the X-Robots-Tag HTTP response headers, there are shorthand content values. For example:

  • <meta name="robots" content="none"> is equivalent to <meta name="robots" content="noindex, nofollow">.

  • <meta name="robots" content="all"> means that there are no indexing and no link following restrictions.

Meta directives with name="robots" and HTTP response header directives without a specified crawler name apply to all crawlers. However, if you want some directives to apply to the Coveo crawler only, you can set the name property to coveobot. As a result, whenever your page contains directives specifically intended for the Coveo crawler, the crawler follows these instructions and ignores the general, all-robot directives.

Example: Coveo-specific directives with a <meta> tag

With the following page <meta> tag, all robots are instructed not to index the page and not to follow the links in it, except the coveobot crawler:

<meta name="robots" content="nofollow, noindex" />
<meta name="coveobot" content="all" />
Example: Coveo-specific directives with an X-Robots-Tag response header

With the following X-Robots-Tag HTTP header, all robots are instructed not to index the page and not to follow the links in it, except the coveobot crawler:

HTTP/1.1 200 OK
Date: Mon, 10 April 2023 15:08:11 GMT
(…)
X-Robots-Tag: nofollow, noindex, coveobot: all
(…)

The "nofollow anchors" override setting

Whereas the <meta> tag directives apply to an entire page, the Web source nofollow anchors directive override setting pertains to nofollow directives specified on individual anchor tags (that is, <a rel="nofollow" …​> tags).

Example

This link shouldn’t be followed by robots: <a href="signin.php" rel="nofollow">sign in</a>.

There’s no way to change the anchor tag to make the rel="nofollow" attribute target a specific crawler. However, if you’re using the rel="nofollow" attribute to prevent the target page from being indexed, you may want to consider using the mechanisms mentioned above instead (see Evolving "nofollow" - new ways to identify the nature of links).