About Crawling Speed
About Crawling Speed
Depending on your configuration, the network bandwidth available to the server hosting the connectors or targeted system can limit crawling performance. This is relevant for Crawling Module and on-premises sources. If the connector has to download large amounts of data, the available bandwidth could become a significant performance bottleneck. The same issue can occur when the crawler is physically far from the server hosting the data it’s indexing.
To know how much of the server bandwidth you’re using, on a Windows server, see the Network section in Task Manager (Task Manager > Performance tab > Network section).
The network bandwidth could be lower when the connector is in New York and the server being indexed is in Los Angeles.
Improve your internet connection.
Host the connector closer to the data source or vice versa.
Size of Items
Some repositories contain large items that the connector has to download. For example, if the connector has to retrieve videos, large PDF files, and other large files of several MBs, the connector won’t be able to crawl items efficiently as the majority of time is spent downloading the large files. In such cases, you can expect a much smaller number of processed items per hour.
Server Responsiveness and Latency
Connectors crawl live systems that are installed and maintained by a variety of different companies. For instance, two SharePoint 2019 servers installed at two different locations will most likely not offer the same performance.
The following factors can influence server performance:
If the server isn’t responsive, connector calls will take longer to execute, therefore significantly decreasing the crawling speed.
If a high number of users are using the system while the connector crawls it, the system performance can be impacted. The server can also be under high load when it contains more data than is recommended for its infrastructure.
Infrastructure (only for on-premises servers)
Most systems are recommended to be installed on dedicated and complete Windows servers or virtual machines.
The connector performance also heavily relies on the API performance of the systems it crawls, regardless of your server configuration. Certain APIs have limited capacity, and that limitation impacts the number of items per hour the related connectors can process.
There’s no direct solution; the only thing you can do is decrease the other factors affecting the crawling performance to lessen the impact of the system API.
Cloud services and even installed servers have implemented throttling to protect their APIs. The effect of throttling on the connectors is drastic and there are no solutions to bypass it. Throttling means that some requests of the connectors are refused or slowed down.
Exchange Online has strict throttling policies that aren’t configurable, and this is one reason why Coveo doesn’t recommend crawling emails. The amount of email content is large, and keeps increasing, but the crawling performance is poor due to Exchange’s throttling policies.
Some systems have throttling policies that you can modify to allow Coveo connectors to bypass them. Contact Coveo Support for more information.
The Web connector self-throttles to one request per second to follow internet politeness norms. If you want to crawl your own web site and don’t mind increasing the load, you can unlock the crawling speed.
Frequency of Errors
Errors are also a source of reduced crawling speeds. When the connector encounters an error, the connector identifies whether it can retry the failed operation. If the connector can retry, it usually waits a few seconds, and then tries the operation again to get all the content requested by the source configuration.
Depending on the type of error:
When the errors persist or are internal to Coveo, contact Coveo Support.
Otherwise, follow the instructions provided with error codes to fix errors in source configurations or server-side.
The configuration of the source can also have a big impact on the total crawling time as well as the crawling speed. A larger source size can cause the overall indexing process to slow down, especially if the source doesn’t support refreshes. If the source does support refreshes, it can still be slowed down if its content is constantly changing.
Manage Source Update Operations
Increase the Number of Sources
Splitting a source to increase the number of concurrent sources targeting a system is a good way to increase performance. This especially applies to sources that are constantly changing in content and have become too large for a sufficient refresh.
If the content you want to index can be split, having many sources crawling at the same time will increase total performance. However, it will also increase the server load, throttling, etc.
A SharePoint source that sees thousands of new and modified documents every hour might not be able to keep up so you decide to split the source into multiple sources. However, if your Sharepoint source content is large but rarely changes, then a split wouldn’t be necessary.
There are also added benefits to scheduling source updates with more granularity. For instance, a portion of your content requires a higher refresh frequency, while some other content almost never changes, which requires to be refreshed less often.
Some connectors have content options that can impact performance.
If you don’t need certain content types, don’t enable the related options.
Most source configurations allow you to Retrieve Comments (and other similar options). Enabling these features can impact performances because the connector may have to perform additional API calls to retrieve this additional content.
Differentiating Crawling Operations
There are three source update operations you can run or schedule to keep the source content up to date.
Rebuild and Rescan
Rebuild and rescan operations try to retrieve all of the content requested by the source configuration. A rebuild will run longer then a rescan and you shouldn’t perform one without good reason.
Refresh is the fastest indexing operation, and it is meant to retrieve only the delta of item changes since the last source operation. We recommend that you schedule this operation at the highest frequency when possible and supported by the crawled systems. For sources that don’t fully support refresh operations, a scheduled weekly rescan is recommended.
Set Your Expectations
If you’re planning to configure a large source, you have to set realistic expectations. The greater the number of items you want to index, the more time the initial build will take.
Still Too Slow?
If you feel the performances you’re getting are still too slow, you can consider opening a maintenance case so the Coveo team can investigate your source configurations and Coveo On-Premises Crawling Module servers. Ensure to include the actions that you took to fine-tune the crawling performance, since the first steps of investigation are listed in this article.