Understanding Crawling Speed

Crawling speed is defined by the number of items that a given connector processes per hour. Many factors influence the crawling speed, so a small number of processed items isn’t necessarily a sign of slow crawling.

Transferred Data

Available Bandwidth

Depending on your configuration, the network bandwidth available to the server hosting the connectors or targeted system can limit crawling performance. This is relevant for Crawling Module and on-premises sources. If the connector has to download large amounts of data, the available bandwidth could become a significant performance bottleneck. The same issue can occur when the crawler is physically far from the server hosting the data it’s indexing.

To know how much of the server bandwidth you’re using, on a Windows server, see the Network section in Task Manager (Task Manager > Performance tab > Network section).

The network bandwidth could be lower when the connector is in New York and the server being indexed is in Los Angeles.

Solutions

  • Improve your internet connection.

  • Host the connector closer to the data source or vice versa.

Size of Items

Some repositories contain large items that the connector has to download. For example, if the connector retrieves videos, large PDF files, and other files weighing several MBs, it won’t be able to crawl items efficiently because all its time will be spent downloading these files. In such cases, you can expect a much smaller number of processed items per hour.

Solutions

  • Evaluate if all of the items that the connector is downloading are desirable and useful in the index. Adjust content types to filter out these items or index them by reference.

    Videos and images are probably not useful in the index unless you have particular use cases.

  • If you still have available bandwidth, you can increase the number of refresh threads to allow more simultaneous downloads.

    You can monitor the total size of downloaded items in the Activity panel of the Administration Console.

Server Performance

Server Responsiveness and Latency

Connectors are crawling live systems that are installed and maintained by a variety of different companies. Two SharePoint 2019 servers installed at two different locations will most likely not offer the same performance.

The following factors can influence server performance:

If the server isn’t responsive, connector calls will take much more time to execute, thus significantly decreasing the crawling speed.

  • Load

    If a high number of users are using the system while the connector crawls it, the system performance can be impacted. The server can also be under high load when it contains more data than recommended for its infrastructure.

  • Infrastructure (only for on-premises servers)

    Most systems are recommended to be installed on a dedicated and complete Windows servers or virtual machines.

Solutions

  • Validate your server infrastructure.

    (For Coveo On-Premises Crawling Module servers only) Ensure that your server meets the hardware and software requirements.

  • Create source schedules to crawl outside of peak hours.

API Performance

The connector performance also heavily relies on the APIs performance of the systems it crawls, regardless of your server configuration. Certain APIs have limited capacity, and that limitation impacts the number of items per hour the related connectors can process.

Solution

There’s no direct solution; the only thing you can do is to decrease the other factors affecting the crawling performance to lessen the impact of the system APIs.

Throttling

Cloud services and even installed servers have implemented throttling to protect their APIs. The effect of throttling on the connectors is drastic and there are no solutions to bypass it. Throttling means that some requests of the connectors are refused or slowed down.

Exchange Online has strict throttling policies that aren’t configurable, and this is one of the reasons why Coveo doesn’t recommend crawling emails. The amount of email content is huge, and keeps increasing, but the crawling performance is poor due to Exchange throttling policies.

Solution

Some systems have throttling policies that you can modify to allow Coveo connectors to bypass them. Contact Coveo Support for more information.

The Web connector self-throttles to one request per second to follow internet politeness norms. If you want to crawl your own web site and don’t mind increasing the load, you can unlock the crawling speed.

Frequency of Errors

Errors are also a source of slowness. When the connector hits an error, the connector identifies whether it can retry the failed operation. If the connector can retry, it usually waits a few seconds, and then tries the operation again to get all the content requested by the source configuration. However, because of all the factors mentioned in the Server Performance section as well as limiting system APIs, a high number of errors can lead to a lot of “waiting for retry” time, which ultimately impacts the crawling performance.

Solution

Depending on the type of errors:

  • When the errors persist or are internal to the Coveo Platform, contact Coveo Support.

  • Otherwise, follow the instructions provided with error codes to fix errors in source configurations or server-side.

Source Configuration

The configuration of the source can also have a huge impact on the total crawling time as well as the crawling speed. A larger source size can cause the overall indexing process to slow down, especially if the source doesn’t support refreshes. If the source does support refreshes, it can still be slowed down if its content is constantly changing.

Manage Source Update Operations

A large source with many items to process can result in a rebuild that will take several hours or days. You can configure the source to stop the rebuild process after the necessary number of items have been indexed.

Increase the Number of Refresh Threads

Most connectors (sources) have a default value of 2 refresh threads. This value prevents the connector from being throttled. However, if you’re crawling a system that can take more load (e.g., powerful infrastructure or light load), you can consider increasing this parameter value to increase performance.

Add the NumberOfRefreshThreads hidden parameter in the parameters section of the source JSON configuration, and incrementally increase its value up to 8 threads, which is the value usually providing the most significant performance benefits (see Source JSON Modification Examples).

If you increase the load on the server, this could lead to throttling and have an adverse effect. Thus, monitor your change to ensure its impact is advised.

Increase the Number of Sources

Similar to the number of threads, splitting a source to increase the number of concurrent sources targeting a system is also a good way to increase performance. This especially applies to sources that are constantly changing in content and have become too large for a sufficient refresh.

If the content you want to index can be split, having many sources crawling at the same time will increase total performance. However, it will also increase the server load, throttling, etc.

A SharePoint source that sees thousands of new and modified documents every hour might not keep up and you’ll decide to split it. However, if your Sharepoint source is huge but rarely changes, then a split wouldn’t be necessary.

There are also added benefits to configure source schedules with more granularity. For instance, a given part of your content requires a higher frequency of refresh, or some other content almost never changes, which requires to be refreshed less often.

Enabled Options

Some connectors have content options that can impact performance.

If you don’t need certain content types, don’t enable the related options.

Most source configurations allow you to Retrieve Comments (and other similar options), enabling these features can have a cost in terms of performances because the connector can have to perform additional API calls to retrieve this additional content.

Differentiating Crawling Operations

Once the source is created, you can update the source with different operations (see Refresh VS Rescan VS Rebuild for more details).

Rebuild and Rescan

Rebuild and rescan operations try to retrieve all of the content requested by the source configuration. A rebuild will run longer then a rescan and you shouldn’t perform one without good reason.

Refresh

Refresh is the fastest operation, meant to retrieve only the delta of item changes since the last source operation. We recommend that you schedule this operation at the highest frequency when possible and supported by the crawled systems. For sources that don’t fully support refresh operations, a scheduled rescan is recommended weekly.

Set Your Expectations

If you’re planning to configure a large source, you have to set realistic expectations. The greater the number of items you want to index, the more time the initial build will take.

Still Too Slow?

If you feel the performances you’re getting are still too slow, you can consider opening a maintenance case so the Coveo Cloud team can investigate your source configurations and Coveo On-Premises Crawling Module servers. Ensure to include the actions that you took to fine-tune the crawling performance, since the first steps of investigation are listed in this article.

Recommended Articles