- Content Retrieval Methods
- Content Security
- Source Item Types
- Source Credentials Leading Practices
- Refresh vs Rescan vs Rebuild
- Edit a Source Schedule
- Edit Source Extensions
- Add or Edit a Source Using One of the Available Connectors
- Security - Tab
- Manage Source Mappings
- Edit a Source JSON Configuration
- JSON Modification Examples
- Understanding Crawling Performance
- Enable Optical Character Recognition
- Limit the Indexing Process to a Certain Number of Items
Understanding Crawling Performance
The crawling performance is defined by the number of items that a given connector processes per hour. Many factors influence the crawling speed, a small number of processed items is not necessarily a sign of slow crawling.
(Mostly relevant for Crawling Module and on-premises sources) Depending on your configurations, the network bandwidth available to the server hosting the connectors or targeted system can limit the crawling performance. If the connector is downloading large amounts of data, the bottleneck to performance could be the available bandwidth. The same issue can occur when the crawler is physically far from the server hosting the data it is indexing.
On a Windows server, if you want to know how much of the server bandwidth you are using, look at the Network section in Task Manager (Task Manager > Performance tab > Network section).
The network bandwidth could be lower when the connector is in New York and the server being indexed is in Los Angeles.
Improve your internet connection.
Host the connector closer to the data source or vice versa.
Size of Items
Some repositories contain huge items that the connector has to download. If the connector retrieves videos, huge PDFs. and other files weighing several MBs, the connector will not crawl items efficiently, because all its time will be spent downloading these files. In such case, you can expect much smaller numbers of processed items per hour.
Evaluate if all items the connector is downloading are desirable and useful in the index. You should be adjusting content types to filter out these items or index them by reference (see Change Indexed Item Types and Indexing by Reference).
Videos and images are probably not useful in the index unless you have particular use cases.
If you still have available bandwidth, increasing the number of refresh threads will allow more simultaneous downloads (see Number of Refresh Threads).
You can monitor the ongoing total size of downloaded items in the Activity panel of the administration console (see Review Events Related to Specific Coveo Cloud Administration Console Resources).
Server Responsiveness and Latency
Connectors are crawling live systems that are installed and maintained by a variety of different companies. Two SharePoint 2019 servers installed at two different locations will most likely not offer the same performance.
The following factors can influence server performance:
If the server is not responsive, connector calls will take much more time to execute, thus significantly decreasing the crawling speed.
If a high number of users are using the system while the connector crawls it, the system performance can be impacted. The server can also be under high load when it contains more data than recommended for its infrastructure.
(Only for on-premises servers) Infrastructure
Most systems are recommended to be installed on a dedicated and complete Windows servers or virtual machines.
- Validate your server infrastructure.
(for Coveo On-Premises Crawling Module servers only) Ensure your server meets the hardware and software requirements (see Requirements).
- Create source schedules to crawl outside of peak hours (see Edit a Source Schedule).
The connector performance also heavily relies on the APIs performance of the systems it crawls, regardless of your server configuration. Certain APIs have limited capacity, and that limitation impacts the number of items per hour the related connectors can process.
There is no direct solution; the only thing you can do is to decrease the other factors affecting the crawling performance to lessen the impact of the system APIs.
Cloud services and even installed servers have implemented throttling to protect their APIs. The effect of throttling on the connectors is drastic and there are no solutions to bypass it. Throttling means that some requests of the connectors are refused or slowed down.
Exchange Online has strict throttling policies that are not configurable, and this is one of the reasons why Coveo does not recommend crawling emails. The amount of email content is huge, and keeps increasing, but the crawling performance is poor due to Exchange throttling policies.
Some systems have throttling policies that you can modify to allow Coveo connectors to bypass them. Contact Coveo Support for more information.
The Web connector self-throttles to one request per second to follow internet politeness norms. If you want to crawl your own web site and do not mind increasing the load, you can unlock the crawling speed (see the Crawling Limit Rate parameter).
Frequency of Errors
Errors are also a source of slowness. When the connector hits an error, the connector identifies whether it can retry the failed operation. If the connector can retry, it usually waits a few seconds, and then tries the operation again to get all the content requested by the source configuration. However, because of all the factors mentioned in the Server Performance section as well as limiting system APIs, a high number of errors can lead to a lot of “waiting for retry” time, which ultimately impacts the crawling performance.
Depending on the type of errors:
When the errors persist or are internal to the Coveo Cloud platform: contact Coveo Support.
Otherwise, follow the instructions provided with error codes to fix errors in source configurations or server-side.
The configuration of the source can also have a huge impact on the total crawling time as well as the crawling speed.
Number of Refresh Threads
Most connectors (sources) have a default value of 2 refresh threads. This value prevents the connector from being throttled. However, if you are crawling a system that can take more load (e.g., powerful infrastructure or light load), you can consider increasing this parameter value to increase performance.
NumberOfRefreshThreads hidden parameter in the source JSON configuration, and incrementally increase its value up to 8 threads, which is the value usually providing the most significant performance benefits (see Source JSON Modification Examples).
If you increase the load on the server, this could lead to throttling and have an adverse effect. Thus, monitor your change to ensure its impact is advised.
Number of Sources
Similar to the number of threads, the number of concurrent sources targeting a system is also a good way to increase performance. If the content you want to index can be split into many sources, having many sources crawling at the same time will multiply total performance. However, it will also multiply the server load, throttling, etc.
There are also added benefits to configure source schedules with more granularity.
A given part of your content requires a higher frequency of refresh, or some other content almost never changes, which requires to be refreshed less often.
Some connectors have content options that can impact performance.
If you do not need certain content types, do not enable the related options.
Most source configurations allow you to Retrieve Comments (and other similar options), enabling these features can have a cost in terms of performances because the connector can have to perform additional API calls to retrieve this additional content.
Differentiating Crawling Operations
Once the source is created, you can update the source with different operations (see Refresh VS Rescan VS Rebuild for more details).
Rebuild and Rescan
Rebuild and rescan are operations that attempt to retrieve all the content requested by the source configuration. The difference is that rebuilds takes significantly more time to execute.
Rebuilding a source without a good reason since it takes longer than the rescan, which takes longer than the refresh.
Refresh is the fastest operation, meant to retrieve only the delta of item changes since the last source operation. It is recommended to schedule this operation at the highest frequency when possible and supported by the crawled systems. For sources that do not fully support refresh operations, a scheduled rescan is recommended weekly.
Prepare Your Expectations
If you are planning to configure a big source, it is important to prepare your expectations. Rule of thumb the higher number of items to index, the higher the time the initial build will take.
Still Too Slow?
If you feel the performances you are getting are still too slow, you can consider opening a maintenance case so the Coveo Cloud team can investigate your source configurations and Coveo On-Premises Crawling Module servers. Ensure to include the actions that you took to fine-tune the crawling performance, since the first steps of investigation are listed in this article.