Coveo On-Premises Crawling Module

The Coveo On-Premises Crawling Module allows you to crawl on-premises content in order to make it searchable in a Coveo Cloud V2 powered search page (see Supported Content). Customers that cannot open an inbound port in their firewall for the Coveo cloud-hosted crawlers to access their on-premises content are the typical Crawling Module users.

If you want to crawl on-premises content and can open a port in your firewall to let a Coveo cloud-hosted crawler access your content, refer to the Coveo Cloud administration console documentation (see Available Coveo Cloud V2 Connectors).

The cloud-hosted Coveo Cloud V2 crawlers pull content from cloud and on-premises secured enterprise systems to make it searchable. The Coveo On-Premises Crawling Module, however, runs outside of the Coveo Cloud V2 environment. It pulls your content from your on-premises systems, and then sends it to a push-type source, which serves as an intermediate to index the data.

Once the Coveo On-Premises Crawling Module is deployed on your Windows machine, all communications are outbound, and no inbound ports to your secured enterprise system are required (see Deployment Overview). Nevertheless, a Coveo Cloud V2 source fed by the On-Premises Crawling Module supports the same features as a cloud source (see Sources Page).

On January 1st, 2020, Python 2 will be deprecated. This means that the Python 2 pre-push extension scripts you use with the Crawling Module will need to be translated to Python 3 by the end of 2019. For further information, see Python 2 End-Of-Life.

Supported Content

The following on-premises content can be crawled using the Coveo On-Premises Crawling Module (see Creating a Crawling Module Source):

Your Coveo Cloud V2 license must include Crawling Module connectors (see Requirements).

Components

The Coveo On-Premises Crawling Module consists of three components:

  • Maestro, a software managing your local workers and database. Maestro also acts a bridge between the workers and database, and the Coveo Cloud platform.
  • The MySQL Coveo Database, which stores information regarding the last update operations, such as the source state and the URI of indexed items. As a result, the workers know what content was crawled during the last crawling operation, and therefore what needs to be crawled during the next update operation.
  • One or more workers, which are responsible for executing content update tasks requested by the Coveo Cloud platform (see Refresh, Rescan, and Rebuild). Each worker can only handle one task at a time, so you may need more than one, depending on your content (see Number of Workers).

Workflow

The Coveo On-Premises Crawling Module indexing workflow is the following:

  1. Maestro authenticates to a Coveo Cloud organization and receives an API key.
  2. Maestro provides the API key to the Crawling Module workers.
  3. The Crawling Module workers periodically poll the Coveo Cloud platform for source update tasks (see Refresh, Rescan, and Rebuild). When an update is due, the next available worker will execute it (see Number of Workers).
  4. The MySQL Coveo Database provides the worker with information regarding the last update operation. The worker uses this information to execute the update task.
  5. The worker crawls the content and provides the Push API with the changes that have been made since the last update operation. The worker authenticates with the API key received from Maestro.
  6. The Push API indexes the received changes so that the content in your Coveo Cloud powered search page reflects your actual on-premises data.

Crawling Module Workflow

  • To make deployment easier and increase worker scalability, your workers and local database are inside Docker containers. This ensures that they run smoothly, regardless of your environment configuration.
  • The workflow above does not take into account the option you have to index the permissions corresponding to your secured content. If you want to index secured content and take access permissions into account, contact Coveo Support.

What’s Next?

To deploy the Crawling Module, see Coveo On-Premises Crawling Module Deployment Overview.