Understanding Crawling Performance

The crawling performance is defined by the number of items that a given connector processes per hour. Many factors influence the crawling speed, a small number of processed items is not necessarily a sign of slow crawling.

Transferred Data

Available Bandwidth

(Mostly relevant for Crawling Module and on-premises sources) Depending on your configurations, the network bandwidth available to the server hosting the connectors or targeted system can limit the crawling performance. If the connector is downloading large amounts of data, the bottleneck to performance could be the available bandwidth. The same issue can occur when the crawler is physically far from the server hosting the data it is indexing.

On a Windows server, if you want to know how much of the server bandwidth you are using, look at the Network section in Task Manager (Task Manager > Performance tab > Network section).

The network bandwidth could be lower when the connector is in New York and the server being indexed is in Los Angeles.

Solutions

  • Improve your internet connection.

  • Host the connector closer to the data source or vice versa.

Size of Items

Some repositories contain huge items that the connector has to download. If the connector retrieves videos, huge PDFs. and other files weighing several MBs, the connector will not crawl items efficiently, because all its time will be spent downloading these files. In such case, you can expect much smaller numbers of processed items per hour.

Solutions

Server Performance

Server Responsiveness and Latency

Connectors are crawling live systems that are installed and maintained by a variety of different companies. Two SharePoint 2016 servers installed at two different locations will most likely not offer the same performance.

The following factors can influence server performance:

If the server is not responsive, connector calls will take much more time to execute, thus significantly decreasing the crawling speed.

  • Load

    If a high number of users are using the system while the connector crawls it, the system performance can be impacted. The server can also be under high load when it contains more data than recommended for its infrastructure.

  • (Only for on-premises servers) Infrastructure

    Most systems are recommended to be installed on a dedicated and complete Windows servers or virtual machines.

Solutions

  • Validate your server infrastructure.

    (for Coveo On-Premises Crawling Module servers only) Ensure your server meets the hardware and software requirements (see Requirements).

  • Create source schedules to crawl outside of peak hours (see Edit a Source Schedule).

API Performance

The connector performance also heavily relies on the APIs performance of the systems it crawls, regardless of your server configuration. Certain APIs have limited capacity, and that limitation impacts the number of items per hour the related connectors can process.

Solution

There is no direct solution; the only thing you can do is to decrease the other factors affecting the crawling performance to lessen the impact of the system APIs.

Throttling

Cloud services and even installed servers have implemented throttling to protect their APIs. The effect of throttling on the connectors is drastic and there are no solutions to bypass it. Throttling means that some requests of the connectors are refused or slowed down.

Exchange Online has strict throttling policies that are not configurable, and this is one of the reasons why Coveo does not recommend crawling emails. The amount of email content is huge, and keeps increasing, but the crawling performance is poor due to Exchange throttling policies.

Solution

Some systems have throttling policies that you can modify to allow Coveo connectors to bypass them. Contact Coveo Support for more information.

The Web connector self-throttles to one request per second to follow internet politeness norms. If you want to crawl your own web site and do not mind increasing the load, you can unlock the crawling speed (see the Crawling Limit Rate parameter).

Frequency of Errors

Errors are also a source of slowness. When the connector hits an error, the connector identifies whether it can retry the failed operation. If the connector can retry, it usually waits a few seconds, and then tries the operation again to get all the content requested by the source configuration. However, because of all the factors mentioned in the Server Performance section as well as limiting system APIs, a high number of errors can lead to a lot of “waiting for retry” time, which ultimately impacts the crawling performance.

Solution

Depending on the type of errors:

  • When the errors persist or are internal to the Coveo Cloud platform: contact Coveo Support.

  • Otherwise, follow the instructions provided with error codes to fix errors in source configurations or server-side.

Source Configuration

The configuration of the source can also have a huge impact on the total crawling time as well as the crawling speed.

Number of Refresh Threads

Most connectors (sources) have a default value of 2 refresh threads. This value prevents the connector from being throttled. However, if you are crawling a system that can take more load (e.g., powerful infrastructure or light load), you can consider increasing this parameter value in order to increase performance.

Add the NumberOfRefreshThreads hidden parameter in the source JSON configuration, and incrementally increase its value up to 8 threads, which is the value usually providing the most significant performance benefits (see Source JSON Modification Examples).

If you increase the load on the server, this could lead to throttling and have an adverse effect. Thus, monitor your change to ensure its impact is advised.

Number of Sources

Similar to the number of threads, the number of concurrent sources targeting a system is also a good way to increase performance. If the content you want to index can be split into multiple sources, having multiple sources crawling at the same time will multiply total performance. However, it will also multiply the server load, throttling, etc.

There are also added benefits to configure source schedules with more granularity.

A given portion of your content requires a higher frequency of refresh, or some other content almost never changes, which requires to be refreshed less often.

Enabled Options

Some connectors have content options that can impact performance.

If you do not need certain content types, do not enable the related options.

Most source configurations allow you to Retrieve Comments (and other similar options), enabling these features can have a cost in terms of performances because the connector can have to perform additional API calls to retrieve this additional content.

Differentiating Crawling Operations

Once the source is created, you can update the source with different operations (see Refresh VS Rescan VS Rebuild for more details).

Rebuild and Rescan

Rebuild and rescan are operations that attempt to retrieve all the content requested by the source configuration. The difference is that rebuilds takes significantly more time to execute.

Rebuilding a source without a good reason since it takes longer than the rescan, which takes longer than the refresh.

Refresh

Refresh is the fastest operation, meant to retrieve only the delta of item changes since the last source operation. It is recommended to schedule this operation at the highest frequency when possible and supported by the crawled systems. For sources that do not fully support refresh operations, a scheduled rescan is recommended weekly.

Prepare your expectations

If you are planning to configure a big source, it is important to prepare your expectations. Rule of thumbs the higher number of items to index, the higher the time the initial rebuild will take.

Still Too Slow?

If you feel the performances you are getting are still too slow, you can consider opening a maintenance case so the Coveo Cloud team can investigate your source configurations and Coveo On-Premises Crawling Module servers. Ensure to include the actions that you took to fine-tune the crawling performance, since the first steps of investigation are listed in this article.