Coveo On-Premises Crawling Module

This article applies to the new Crawling Module, which works without Docker. If you still use the Crawling Module with Docker, see Coveo On-Premises Crawling Module (Docker Version) instead. You might also want to read on the advantages of the new Crawling Module.

To identify the Crawling Module you’re currently using, on the Crawling Modules page of the Coveo Cloud Administration Console, look at the Maestro reported version:

  • Versions > 1: new Crawling Module

  • Versions < 1: Crawling Module with Docker

The Coveo On-Premises Crawling Module allows you to index on-premises content in order to make it searchable in a Coveo-powered search interface. Customers that can’t open an inbound port in their firewall for the Coveo cloud-hosted connectors to access their on-premises content are the typical Crawling Module users.

If you can open a port in your firewall to let a cloud-hosted connector access your on-premises content, you don’t need to install the Crawling Module. Instead, you can create On-Premises sources in the Coveo Cloud Administration Console. See the connector documentation for detailed instructions.

The cloud-hosted Coveo Cloud connectors pull content from cloud and on-premises secured enterprise systems to make it searchable. The Coveo On-Premises Crawling Module, however, runs outside of the Coveo Platform. It pulls your content from your on-premises systems, and then sends it to a Coveo Cloud push-type source, which serves as an intermediate to index your data.

Once the Coveo On-Premises Crawling Module is deployed on your Windows server, all communications are outbound, and no inbound ports to your secured enterprise system are required. Nevertheless, you can manage a Coveo Cloud source fed by the Crawling Module just like you would manage a cloud source.

Supported Content

The following on-premises content can be indexed using the Coveo On-Premises Crawling Module:

Your Coveo Cloud license must include Crawling Module connectors. See the license requirements for details.

Components

The Coveo On-Premises Crawling Module has three components:

  • Maestro, a software managing and monitoring your local workers and the State Store.

  • One or more workers, which are responsible for executing content update tasks requested by the Coveo Platform. Each worker can only handle one task at a time, so you may need more than one, depending on the content to index. See Number of Workers for details.

  • The State Store, which stores information regarding the last update operations, such as the source state and the URI of indexed items. As a result, the workers know what has been indexed during the last update operation, and therefore what needs to be indexed next time.

Workflow

The Coveo On-Premises Crawling Module indexing workflow is the following:

Crawling Module Workflow

  1. The Crawling Module workers periodically poll the Coveo Platform for source update tasks. When an update is due, the next available content worker executes it.

  2. The State Store provides the worker with information regarding the last update operation. The content worker uses this information to execute the update task.

  3. The worker crawls the content and updates the State Store information.

  4. The worker provides the Push API with the changes that have been made since the last update operation. The worker authenticates with the API key Maestro received at the end of its installation process.

  5. The Push API indexes the received changes so that the content in your Coveo-powered search interface reflects your actual on-premises data. See Coveo Cloud Indexing Pipeline for details on the indexing process.

  6. Should you make changes the Crawling Module configuration, Maestro applies them to the workers and the State Store.

If you have a source that indexes permissions, your security worker follow a similar workflow when indexing permissions.

What’s Next?

Deploy the Crawling Module.

Recommended Articles