Indexing pipeline extension overview

Your Coveo organization uses an indexing pipeline with various stages to process source items from enterprise repositories and make them searchable (see Coveo indexing pipeline).

An extension consists of a Python script used to customize the way source items are indexed. You must define the extension in your Coveo organization, and then apply it to one or more sources. When items are processed in the pipeline, your extension is applied at the pre-conversion or post-conversion stage.

Example

You make a site searchable using a Web source type. The last modification date of the page content is only available in a <meta> element in the pages, but as a text string in the local time (for example, March 12, 2017 03:11:32 PM). You want to use this date text string and properly include it in an index date field in UTC format.

You (or a developer) create an extension with a Python script that converts the date text to a UTC date value and sets a metadata with the converted value.

As shown in the following diagram, when a crawled or pushed item enters the indexing pipeline, the Document Processing Manager (DPM) adds a pre-conversion or post-conversion stage for each extension that’s applied to a source.

Indexing Pipeline Extension Schema | Coveo

An extension can be conditionally applied for each source item based on the item type (Common versus Specific extension application), or based on a condition expression.

Deploying an indexing pipeline extension

The following procedure outlines the steps to get an indexing pipeline extension to work its magic.

  1. Create or adapt a script to use in your extension.

    You need at least basic developer skills to write or adapt a sample script that does the custom processing you need for one or more of your Coveo organization sources.

    Note

    Indexing pipeline extensions provide great flexibility to process items, but since executing a script for each item of a source can notably increase indexing time, they should be used as a last resort when other available indexing customization tools don’t allow to perform the specific task (see Indexing pipeline customization tools overview).

    Consult the following references to help you write your script:

  2. From the Coveo Administration Console, create an extension to host your script (see Add or edit an indexing pipeline extension).

  3. Test your indexing pipeline extension (see Indexing pipeline extension testing strategies and good practices).

    What you validate depends entirely on what the script does. Verify that the extension performed as expected for all applicable items.

    Extension testing suggestions:

  4. Apply your extension to your production source (see Apply an extension to a source).

  5. Rebuild your source (see Refresh, rescan, or rebuild sources).

  6. Validate that your extension processed the source items as expected.

    Verify that the extension performed as expected for all applicable items of your production source.

Managing credentials in indexing pipeline extensions

In many cases, your indexing pipeline extension will need to handle credentials, API keys, or other sensitive data (for example to connect to external services or databases). To ensure the security of such information, you can use vault parameters in your indexing pipeline extension. By storing your credentials in vault entries, you’re able to securely retrieve them within your indexing pipeline extension. This guarantees that sensitive information, like passwords and API keys, is not hardcoded directly in your extension scripts, thus reducing the risk of exposing them in logs or source code. See Create a vault entry for instructions on how to achieve this.

What are possible extension script purposes?

In short, here are types of action an indexing pipeline extension script can perform for each item of a source:

  • Add, modify, clear metadata.

  • Get the metadata value and data streams from any indexing pipeline stage by using the origin attribute (identified by the stage or extension name).

  • Modify binary streams:

    • Body text

    • Body HTML

    • Thumbnail

    • Original file

  • Reject items (exclude them from the index).

  • Add, modify, delete permissions.

What’s the extension script environment?

An indexing pipeline extension Python 3 script:

  • Runs in a separate non-persistent isolated OS instance for each item.

  • Can import common Python libraries (such as Requests) available in the OS instance (see Python modules available to indexing pipeline extensions).

  • Can read and write to the local folder, but without persistence between extension instances for each source item.

  • Can access the Internet.

Pre-conversion versus post-conversion

The following table provides some criteria indicating when each indexing pipeline stage is more appropriate. In doubt, favor adding your extensions as post-conversion stage.

Pre-conversion Post-conversion

Use when:

  • The script purpose is to reject items, and therefore prevent wasting resources on further indexing pipeline stages.

    Example

    You want to create separate sources for the oldest and newest items of a repository. In the source for the newest items, you add a script that rejects items with a last modification date older than your splitting date.

  • The script modifies the original Item data content and you want the Processing stage to process your changes.

    Note

    Metadata added in the pre-conversion stage isn’t automatically mapped to a field with a matching name. You must add a mapping to the source(s) for which you want to leverage the metadata (see Manage source mappings).

Use when:

  • The script needs to get the Body text or the Body HTML data stream processed by the Processing stage.

  • You want to ensure that your metadata changes won’t be altered by another stage.

    Note

    When more than one post-conversion extensions are applied to a source, another extension could execute after.

  • When you want to create a script to discover all available metadata from all previous stages.

Usage limits

By default, the following indexing pipeline extension usage limits apply to all organizations:

  • Number of extensions per organization: 10

  • Number of extensions that can be applied to a source: 20

    Note

    You can apply the same extension two times to a source (that is, one time in pre-conversion and one time in post-conversion).

  • Extension execution timeout: 5 seconds

    Most common indexing pipeline extension applications only modify the item metadata and typically execute within significantly less than a second. An extension can take significantly longer when getting and processing items Body text or when calling an external service to process the items, particularly for large items. The extension execution can also have a significant impact on the crawling performance for sources containing many items.

Note

You can review your extension and other usage limits in the Coveo Administration Console Settings page, under License > Limits (platform-ca | platform-eu | platform-au).

Contact Coveo Support if you would like to upgrade your Coveo license with an increased number of extension limit.