The Coveo Web source crawler behaves similarly to bots of web search engines such as Google. The crawler only needs a Site URL and then discovers other web pages by following the site navigation and hyperlinks appearing on the pages. Bots (including the Coveo Web source crawler) can have a negative impact on the performance of the targeted website. This is why mechanisms (i.e., directives) were developed to let websites and web pages provide crawlers with indexing instructions.
By default, when indexing the content of a website, the Coveo crawler obeys all directives it encounters.
This can represent an obstacle if, for example, you want to use the Coveo Web source to index content that Google doesn’t.
Only as a last resort should you consider configuring the Web source to ignore a directive.
Instead, you should leverage the fact that directives can usually be given on a per-crawler basis and configure
The goal of this article is to provide information on the crawler directives the Web source can override and, for each, the actual way you should allow more freedom to the Coveo crawler.
The "Respect Robots.txt Directives" setting
The Web source Respect Robots.txt Directives setting pertains to the robots.txt file.
This file specifies
Disallow directives that tell a crawler which parts of the website it should or shouldn’t visit.
If you need to give greater access to the
coveobot crawler, consider adding a
User-agent: coveobot section to your
See The User-Agent Line and Simple Example for further details.
The Web source crawler is coded to request no more than one page per second.
If the site
The "Respect Noindex Directives" and "Respect Nofollow Directives" settings
The Web source Respect Noindex Directives and Respect Nofollow Directives settings are grouped here because their related directives are often used together in a website implementation.
nofollow directives apply to an entire HTML page.
They’re case-insensitive and can appear:
<head>section of an HTML page as the
contentattribute value of a
<meta name="robots" content="noindex">).
In the web server
X-Robots-TagHTTP response header following the request for a given page.
noindex directive instructs crawlers not to index the page.
nofollow directive instructs crawlers not to follow any of the links of the page.
<meta> tags or the
X-Robots-Tag HTTP response headers, there are shorthand
<meta name="robots" content="none">is equivalent to
<meta name="robots" content="noindex, nofollow">.
<meta name="robots" content="all">means that there are no indexing and no link following restrictions.
Meta directives with
name="robots" and HTTP response header directives without a specified crawler name apply to all crawlers.
However, if you want some directives to apply to the Coveo crawler only, you can set the
name property to
As a result, whenever your page contains directives specifically intended for the Coveo crawler, the crawler follows these instructions and ignores the general, all-robot directives.
With the following page
<meta> tag, all robots are instructed not to index the page and not to follow the links in it, except the
<meta name="robots" content="nofollow, noindex" /> <meta name="coveobot" content="all" />
With the following X-Robots-Tag HTTP header, all robots are instructed not to index the page and not to follow the links in it, except the
HTTP/1.1 200 OK Date: Mon, 10 April 2023 15:08:11 GMT (…) X-Robots-Tag: nofollow, noindex, coveobot: all (…)
The "Respect Nofollow Anchors" setting
Whereas <meta> tag directives apply to an entire page, the Web source Respect Nofollow Anchors setting pertains to
nofollow directives specified on individual anchor tags (i.e.,
<a rel="nofollow" …> tags).
This link shouldn’t be followed by robots:
<a href="signin.php" rel="nofollow">sign in</a>.
There’s no way to change the anchor tag to make the
rel="nofollow" attribute target a specific crawler.
However, if you’re using the
rel="nofollow" attribute to prevent the target page from being indexed, you may want to consider using the mechanisms mentioned above instead (see Evolving "nofollow" - new ways to identify the nature of links).