Custom connector planning guide

This is for:

In this article

Project planning
Technology
Capabilities
Typical workflow of a crawler

If none of Coveo’s options fit your needs, you can build your own connector, that is, write a crawler that retrieves your content and provides it to Coveo for indexing. To do so, you must send your content items to a Push source using the Push API.

This article is a high-level guide for developers who want to build their own custom connector. Read this article thoroughly before you start writing your crawler code, as it will help you make important decisions that can heavily impact the scope of your project.

Project planning

A crawler is a program that automatically discovers content items in a repository. For instance, Coveo’s native connectors as well as the Crawling Module all rely on a crawler to discover your data.

Project guidelines:

Determine which capabilities you want your crawler to have. This will help you determine the scope of your project.
Explore the APIs or SDKs of the repository you want to index and validate that it can support the desired crawler features.
Create a test organization and a Push source, and then test the Push API requests.

Use the resources at your disposal to streamline your testing and development process.
Start writing your crawler code. If applicable, begin with the authentication capability. See the Typical workflow of a crawler section below for details.
Write the other desired capabilities and features.

Technology

To choose the technology with which you will conduct your development project, you must understand how the repository you want to crawl works, and you must have established the capabilities you want to give your crawler.

First, consider the availability of libraries or SDKs in the programming language of your choice. For instance, Microsoft has some .NET SDKs for Microsoft Dynamics CRM, so it might be a good idea to use C# and .NET.

You can interact with the Push API directly or via the Coveo C# Platform SDK.

Many APIs are RESTful, therefore any language that allows you to easily perform HTTP requests should be a good choice.

In addition, consider the availability of libraries for the operations you want to perform, that is, the capabilities of your crawler. For instance, to create a crawler that manipulates JSON, the Newtonsoft Json.NET library should be helpful.

Capabilities

Only implement the capabilities that are relevant to your use case for a custom crawler. By keeping your project as simple as possible, you will make it easier to develop, use, and maintain.

Typical projects implement the following capabilities:

Capability	Reason
Build	Essential for the crawler to work properly.
Configurability	Allows you to adapt your crawler as your repository settings change.
Authentication	Required if the content to index is secured in its original repository.
Item differentiation	The capability to recognize different types of item is typically useful and fairly easy to implement.
Metadata retrieval	If you plan on making your content searchable through a search interface using facets, metadata is crucial. Since keywords can also match metadata, its retrieval ensures that your content will appear in the end users' search results.

Capability

Reason

Build

Essential for the crawler to work properly.

Configurability

Allows you to adapt your crawler as your repository settings change.

Authentication

Required if the content to index is secured in its original repository.

Item differentiation

The capability to recognize different types of item is typically useful and fairly easy to implement.

Metadata retrieval

If you plan on making your content searchable through a search interface using facets, metadata is crucial. Since keywords can also match metadata, its retrieval ensures that your content will appear in the end users' search results.

A comprehensive list of typical crawler capabilities is provided in the following sections.

Build/rebuild

A crawler that can conduct a build or rebuild operation can obtain the expected content from the repository and provide it to Coveo. This is the only capability that all crawlers should have, as is it essential for the crawler to work properly.

For further information on the build operation, see Rebuild.

Configurability

It is often useful to have a configurable crawler, as it allows adapting to the settings of the repository without editing and testing the crawler code again.

Take a look at the options available in the source creation panels in the Coveo Administration Console to get an overview of the configuration capabilities Coveo offers.

Implementing some crawler options may require considerable work. Therefore, you could opt to start by making a non configurable crawler, and then add configuration options in a subsequent iteration of your project. However, if you choose to do so, you must ensure that the hard-coded configuration values in the original crawler are appropriately grouped.

Leading practice

Program your crawler so that you can securely enter crawling credentials. Do not hardcode your passwords.

Authentication

Depending on the content to index, the crawler may need to authenticate to access it. Some repositories may offer more than one authentication option for this crucial step. Determining how to connect early in your project will help you configure your crawler down the line.

Item differentiation

Selecting the item types you want to crawl and possibly making this selection configurable is another important decision to make. Create a crawler that retrieves only the content that will be relevant to your search interface end users. Not only does this makes the crawler easier to develop, but its work will be more efficient.

For instance, if you crawl the content of a blog, you could choose to index posts only and to ignore comments.

Metadata retrieval

Retrieving metadata is likely to be a crucial task of your crawler. With Coveo, the crawled items are meant to become searchable, and search results can be filtered based on their metadata. It’s therefore important to index this information properly.

If you choose to retrieve metadata, you’ll also have to decide whether you want to retrieve everything or some specific pieces of information. The latter is more efficient, but retrieving all the metadata might make it easier to handle the retrieved items further down the line. In other words, if you don’t index all the available metadata, you might have fewer options when writing indexing pipeline extensions, for instance, which might lead you to change the crawler code.

Permissions

Coveo can secure many content sources by replicating the repository permission system. As a result, through a Coveo-powered search interface, authenticated users only see the items that they’re allowed to access within the indexed repository.

Retrieving permissions and supporting secured search can be complex. If you choose to do so, determine in which format the crawler will retrieve permissions and how they will be used down the line.

For more information on how Coveo handles permissions, see Coveo management of security identities and item permissions.

Error handling

When programming your crawler, you must also determine how robust it should be, that is, how many exceptions encountered while crawling it should handle.

If you intend to develop a simple crawler, a crawler for a very specific use case, or a crawler for a one-time use, it could make sense to handle fewer exceptions. However, to distribute a connector that can connect to multiple configurations of a system, you must plan time and tests to discover the exceptions susceptible to occur and adjust your crawler accordingly.

State keeping

A few crawler capabilities depend on the capability to keep a state of the crawled items, that is, information regarding the last update operation. Upon the next update operation, the crawler uses this information to execute the task.

State keeping allows refresh operations and pausing the update process, among other things.

Refresh

A refresh operation only updates a fraction of the indexed content. Typically, the items targeted by a refresh are the items that have been added, modified, or deleted since the last update operation. The state kept after the last operation is compared to the most recent content, and the crawler ignores all items that haven’t changed since the last operation.

Since not all items are crawled, a refresh is often much faster than a rebuild operation. If your data is frequently updated, you should consider running refresh operations between rebuild operations to keep your index up to date. As a result, the content returned in your Coveo-powered search interface will be closer to your actual repository content.

However, to implement the refresh capability, ensure that the API of the crawled repository offers requests that support it and that the state you keep allows the crawler to leverage these requests. For instance, you could keep a timestamp of the last update operation if the API requests support using date filters to obtain new content.

For further information on the refresh operation, see Refresh.

Pause/resume

A crawler that has the capability to pause its work can pick up where it left off should an issue arise. Implementing the pause/resume capability therefore requires keeping a state.

The capability to pause a crawling task can be beneficial if you have a large number of items to crawl or if the crawler is running in a fragile environment. If your rebuild operation takes longer than a day, you might want your crawler to be able to resume its work after an unexpected server reboot, for instance, rather than to restart from the beginning.

Item deletion

Discovering content that has been deleted and updating the index accordingly can be a useful capability of your crawler project, especially if it’s done as part of a refresh operation.

However, most APIs don’t provide a list of deleted items, therefore it can be complex to derive it from other data. A state can also be helpful to keep track of the items that have been crawled and list what is now gone, but this process may require a lot of computing.

Typical workflow of a crawler

The typical workflow of a crawler can be broken down into five steps:

Configuration retrieval
Authentication
Iteration
Content structure
Sending the content

Step 1: configuration retrieval

First, the crawler must obtain and validate the configuration data. It will likely need user credentials to authenticate to your content repository. Implement a mechanism that lets the user know if the configuration is invalid or incomplete. It is best to have the operation fail immediately rather than to crawl and fail later due to an invalid configuration.

Step 2: authentication

Once the crawler has validated the configuration information, it should authenticate and connect to the content repository you want to crawl. Implementing the authentication step may be complex if there are multiple scenarios to handle, that is, multiple ways to log in to your repository. As a result, testing how the crawler authenticates early in the project is a good idea.

Step 3: iteration

Once the crawler has access to the content to index, it can operate iteratively, going through each item container, one after the other.

Step 4: content structure

You should also structure the content to index. For example, you might want to prepare the data to send to Coveo’s Push API. Alternatively, you could need to aggregate metadata from the item’s parent container, for instance.

Step 5: sending the content

The last task that your connector has to accomplish is to provide Coveo with the content to index. This can be done through Coveo’s Push API, a folder on the computer hosting your crawler, etc.