Coveo On-Premises Crawling Module

The Coveo On-Premises Crawling Module allows you to crawl on-premises content in order to make it searchable in a Coveo Cloud V2 powered search page (see Supported Content). Customers that cannot open an inbound port in their firewall for the Coveo cloud-hosted crawlers to access their on-premises content are the typical Crawling Module users.

If you want to crawl on-premises content and can open a port in your firewall to let a Coveo cloud-hosted crawler access your content, refer to the Coveo Cloud V2 administration console documentation (see Available Coveo Cloud V2 Source Types).

The cloud-hosted Coveo Cloud V2 crawlers pull content from cloud and on-premises secured enterprise systems to make it searchable. The Coveo On-Premises Crawling Module, however, runs outside of the Coveo Cloud V2 environment. It pulls your content from your on-premises systems, and then sends it to a push-type source, which serves as an intermediate to index the data.

Once the Coveo On-Premises Crawling Module is installed on your Windows machine, all communications are outbound, and no inbound ports to your secured enterprise system are required. Nevertheless, a Coveo Cloud V2 source fed by the On-Premises Crawling Module supports the same features as cloud source (see Sources Page).

Supported Content

The following on-premises content can be crawled using the Coveo On-Premises Crawling Module (see Creating a Crawling Module Source):

Your Coveo Cloud V2 license must include Crawling Module source types (see Requirements).

Workflow

The Coveo On-Premises Crawling Module indexing workflow is the following:

  1. Maestro authenticates to a Coveo Cloud V2 organization and receives an API key.
  2. Maestro provides the API key to the Crawling Module workers.
  3. The Crawling Module workers periodically poll the Coveo Cloud V2 platform for source update tasks (see Refresh, Rescan, and Rebuild). When an update is due, the next available worker will execute it.
  4. The MySQL Coveo Database provides the worker with information regarding the last update operation, such as the source state and the URI of indexed items. As a result, the worker knows what content was crawled during the last crawling operation. It therefore uses this information to provide refresh operations.
  5. The worker crawls the content and provides the Push API with the changes that have been made since the last update operation. The worker also authenticates with the API key received from Maestro.
  6. The Push API indexes the received changes so that Coveo Cloud V2 is up-to-date with your on-premises content.

  • To make deployment easier and increase worker scalability, your workers and local database are inside Docker containers. This ensures that they run smoothly, regardless of your environment configuration.
  • The workflow above does not take into account the option you have to index the permissions corresponding to your secured content. If you want to index secured content and take access permissions into account, contact Coveo Support.