Coveo Crawling Module

The Coveo Crawling Module allows you to index on-premises content in order to make it searchable in a Coveo-powered search interface. Customers that can’t open an inbound port in their firewall for the Coveo cloud-hosted connectors to access their on-premises content are the typical Crawling Module users.

Note

If you can open a port in your firewall to let a cloud-hosted connector access your on-premises content, you don’t need to install the Crawling Module. Instead, you can create on-premises sources in the Coveo Administration Console. Similarly, if you want to index cloud content, you must use the dedicated source. See the connector documentation for detailed instructions.

The cloud-hosted Coveo connectors pull content from cloud and on-premises secured enterprise systems to make it searchable. The Coveo Crawling Module, however, runs outside of Coveo. It pulls your content from your on-premises systems, and then sends it to a Coveo push-type source, which serves as an intermediate to index your data.

Once the Coveo Crawling Module is deployed on a Windows server (either on your premises or running in Azure), all communications are outbound, and no inbound ports to your secured enterprise system are required. Nevertheless, you can manage a Coveo source fed by the Crawling Module just like you would manage a cloud source.

Supported content

When your Coveo license allows it, the Crawling Module can index the following content:

Components

The Coveo Crawling Module has three components:

  • Maestro, a software managing and monitoring your local workers and the State Store.

  • One or more workers, which are responsible for executing content update tasks requested by Coveo. Each worker can only handle one task at a time, so you may need more than one, depending on the content to index. See Number of workers for details.

  • The State Store, which stores information regarding the last update operations, such as the source state and the URI of indexed items. As a result, the workers know what has been indexed during the last update operation, and therefore what needs to be indexed next time.

Workflow

The Coveo Crawling Module indexing workflow is the following:

Crawling Module Workflow
  1. The Crawling Module workers periodically poll Coveo for source update tasks. When an update is due, it is assigned to the next available content worker.

  2. The State Store provides the worker with information regarding the last source update operation.

  3. Based on the information provided by the State Store, the worker crawls the source content.

  4. The worker provides the State Store with information on the task it just completed.

  5. If applicable, the worker applies a pre-push extension to the crawled content.

  6. The worker provides the Push API with the items that have changed since the last source update operation. The worker authenticates with the API key Maestro received at the end of its installation process.

  7. The Push API indexes the received items so that the content in your Coveo-powered search interface reflects your actual on-premises data. See Coveo indexing pipeline for details on the indexing process.

  8. Should you make changes to the Crawling Module configuration, Maestro applies them to the workers and the State Store.

If you have a source that indexes permissions, your security worker follow a similar workflow when indexing permissions.

What’s next?