Connector Building Best Practices
Connector Building Best Practices
This article is a high-level guide for developers who want to build their own custom connector. We recommend thoroughly reading this article before starting to write your crawler code, as it will help you make important decisions that can heavily impact the scope of your project.
A crawler is a program that automatically discovers content items in a repository. For instance, Coveo’s native connectors as well as the Crawling Module all rely on a crawler to discover your data. The Easy File Pusher is the most simple example of a crawler.
Determine what capabilities you want your crawler to have. This will help you determine the scope of your project.
Explore the APIs or SDKs of the repository you want to index and validate that it can support the desired crawler features.
Write the other desired capabilities and features.
To choose the technology with which you will conduct your development project, you must understand how the repository you want to crawl works, and you must have established the capabilities you want to give your crawler.
First, consider the availability of libraries or SDKs in the programming language of your choice. For instance, Microsoft has some .NET SDKs for Microsoft Dynamics CRM, so it might be a good idea to use C# and .NET. Many APIs are RESTful, therefore any language that allows you to easily perform HTTP requests should be a good choice.
In addition, consider the availability of libraries for the operations you want to perform, i.e., the capabilities of your crawler. For instance, if you want to create a crawler that manipulates JSON, the Newtonsoft Json.NET library should be helpful.
The following are typical capabilities of a crawler. We recommend that you only implement the capabilities that are relevant to your use case for a custom crawler. By keeping your project as simple as possible, you will make it easier to develop, use, and maintain.
It has the following capabilities:
Build: it is able to retrieve all specified files.
Configurability: you can specify the files to send and the target Push source.
Item deletion: if the Easy File Pusher never deleted old content, unwanted documents would only accumulate forever, thus cluttering your index. However, the C# Platform SDK allows you to leverage a mechanic available on the Push that removes old items without requiring any state keeping.
Typically, we recommend implementing the following capabilities:
|Build||Essential for the crawler to work properly.|
|Configurability||Making your crawler configurable allows you to adapt it as your repository settings change.|
|Authentication||Required if the content to index is secured in its original repository.|
|Item differentiation||The capability to recognize different types of item is typically useful and fairly easy to implement.|
|Metadata retrieval||If you plan on making your content searchable through a search interface using facets, metadata is crucial. Since keywords can also match metadata, its retrieval ensures that your content will appear in the end users' search results.|
A crawler that can conduct a build or rebuild operation can obtain the expected content from the repository and provide it to Coveo. This is the only capability that all crawlers should have, as is it essential for the crawler to work properly.
For further information on the build operation, see Rebuild.
It is often useful to have a configurable crawler, as it allows adapting to the settings of the repository without editing and testing the crawler code again.
Implementing some crawler options may require considerable work. Therefore, you could opt to start by making a non configurable crawler, and then add configuration options in a subsequent iteration of your project. However, if you choose to do so, you must ensure that the hard-coded configuration values in the original crawler are appropriately grouped.
Program your crawler so that you can securely enter crawling credentials. Do not hardcode your passwords.
Depending on the content to index, the crawler may need to authenticate to access it. Some repositories may offer more than one authentication option for this crucial step. Determining how to connect early in your project will help you configure your crawler down the line.
Selecting the item types you want to crawl and possibly making this selection configurable is another important decision to make. We recommend creating a crawler that retrieves only the content that will be relevant to your search interface end users. Not only does this makes the crawler easier to develop, but its work will be more efficient.
For instance, if you crawl the content of a blog, you could choose to index posts only and to ignore comments.
Retrieving metadata is likely to be a crucial task of your crawler. With Coveo, the crawled items are meant to become searchable, and search results can be filtered based on their metadata. It is therefore important to index this information properly.
If you choose to retrieve metadata, you’ll also have to decide whether you want to retrieve everything or some specific pieces of information. The latter is more efficient, but retrieving all the metadata might make it easier to handle the retrieved items further down the line. In other words, if you don’t index all the available metadata, you might have fewer options when writing indexing pipeline extensions, for instance, which might lead you to change the crawler code.
Coveo can secure many content sources by replicating the repository permission system. As a result, through a Coveo-powered search interface, authenticated users only see the items that they’re allowed to access within the indexed repository.
Retrieving permissions and supporting secured search can be complex. If you choose to do so, you must determine in which format the crawler will retrieve permissions and how they will be used down the line.
For more information on how Coveo handles permissions, see Coveo management of security identities and item permissions.
When programming your crawler, you must also determine how robust it should be, i.e., how many exceptions encountered while crawling it should handle.
If you intend to develop a simple crawler, a crawler for a very specific use case, or a crawler for a one-time use, it could make sense to handle fewer exceptions. However, if you want to distribute a connector that can connect to multiple configurations of a system, you need to plan time and tests to discover the exceptions susceptible to occur and adjust your crawler accordingly.
A few crawler capabilities depend on the capability to keep a state of the crawled items, i.e., information regarding the last update operation. Upon the next update operation, the crawler uses this information to execute the task.
A refresh operation only updates a fraction of the indexed content. Typically, the items targeted by a refresh are the items that have been added, modified, or deleted since the last update operation. The state kept after the last operation is compared to the most recent content, and the crawler ignores all items that haven’t changed since the last operation.
Since not all items are crawled, a refresh is often much faster than a rebuild operation. If your data is frequently updated, you should consider running refresh operations between rebuild operations to keep your index up to date. As a result, the content returned in your Coveo-powered search interface will be closer to your actual repository content.
However, to implement the refresh capability, you must ensure that the API of the crawled repository offers calls that support it and that the state you keep allows the crawler to leverage these calls. For instance, you could keep a timestamp of the last update operation if the API calls support using date filters to obtain new content.
For further information on the refresh operation, see Refresh.
A crawler that has the capability to pause its work can pick up where it left off should an issue arise. Implementing the pause/resume capability therefore requires keeping a state.
The capability to pause a crawling task can be beneficial if you have a large number of items to crawl or if the crawler is running in a fragile environment. If your rebuild operation takes longer than a day, you might want your crawler to be able to resume its work after an unexpected server reboot, for instance, rather than to restart from the beginning.
Discovering content that has been deleted and updating the index accordingly can be a useful capability of your crawler project, especially if it’s done as part of a refresh operation.
However, most APIs don’t provide a list of deleted items, therefore it can be complex to derive it from other data. A state can also be helpful to keep track of the items that have been crawled and list what is now gone, but this process may require a lot of computing.
Typical Workflow of a Crawler
Step 1: Configuration Retrieval
First, the crawler must obtain and validate the configuration data. It will likely need user credentials to authenticate to your content repository. We recommend implementing a mechanism that lets the user know if the configuration is invalid or incomplete. It is best to have the operation fail immediately rather than to crawl and fail later due to an invalid configuration.
Step 2: Authentication
Once the crawler has validated the configuration information, it should authenticate and connect to the content repository you want to crawl. Implementing the authentication step may be complex if there are multiple scenarios to handle, i.e., multiple ways to log in to your repository. As a result, testing how the crawler authenticates early in the project is a good idea.
Step 3: Iteration
Once the crawler has access to the content to index, it can operate iteratively, going through each item container, one after the other.
Step 4: Content Structure
You should also structure the content to index. For example, you might want to prepare the data to send to Coveo’s Push API. Alternatively, you could need to aggregate metadata from the item’s parent container, for instance.
Step 5: Sending the Content
The last task that your connector has to accomplish is to provide Coveo with the content to index. This can be done through Coveo’s Push API, a folder on the computer hosting your crawler, etc.