Indexing Pipeline Extension Overview

Your Coveo Cloud organization uses an indexing pipeline with various stages to process source items from enterprise repositories and make them searchable (see Coveo Cloud V2 Indexing Pipeline).

An extension consists in a Python script used to customize the way source items are indexed. You must define the extension in your Coveo Cloud organization, and then apply it to one or more sources. When items are processed in the pipeline, your extension is applied at the pre-conversion or post-conversion stage.

You make a website searchable using a Web source type. The last modification date of the page content is only available in a <meta> element in the pages, but as a text string in the local time (e.g., March 12, 2017 03:11:32 PM). You want to use this date text string and properly include it in an index date field in UTC format.

You (or a developer) create an extension with a Python script that converts the date text to a UTC date value and sets a metadata with the converted value.

As shown in the following diagram, when a crawled or pushed item enters the indexing pipeline, the Document Processing Manager (DPM) adds a pre-conversion or post-conversion stage for each extension that’s applied to a source.

IndexingPipelineExtensionSchema3

An extension can be conditionally applied for each source item based on the item type (Common versus Specific extension application), or based on a condition expression.

Deploying an Indexing Pipeline Extension

The following procedure outlines the steps to get an indexing pipeline extension to work its magic.

  1. Create or adapt a script to use in your extension.

    You need at least basic developer skills to write or adapt a sample script that does the custom processing you need for one or more of your Coveo Cloud organization sources.

    Indexing pipeline extensions provide great flexibility to process items, but since executing a script for each item of a source can notably increase indexing time, they should be used as a last resort when other available indexing customization tools don’t allow to perform the specific task (see Indexing Pipeline Customization Tools Overview).

    Consult the following references to help you write your script:

  2. From the Coveo Cloud Administration Console, create an extension to host your script (see Add or Edit an Indexing Pipeline Extension).

  3. Test your indexing pipeline extension (see Indexing Pipeline Extension Testing Strategies and Good Practices).

    What you validate depends entirely on what the script does. Verify that the extension performed as expected for all applicable items.

    Extension testing suggestions:

  4. Apply your extension to your production source (see Applying an Extension to a Source).

  5. Rebuild your source (see Refresh, Rescan, or Rebuild Sources).

  6. Validate that your extension processed the source items as expected.

    Verify that the extension performed as expected for all applicable items of your production source.

What Are Possible Extension Script Purposes?

In short, here are types of action an indexing pipeline extension script can perform for each item of a source:

  • Add, modify, clear metadata.

  • Get the metadata value and data streams from any indexing pipeline stage by using the origin attribute (identified by the stage or extension name).

  • Modify binary streams:

    • Body text

    • Body HTML

    • Thumbnail

    • Original file

  • Reject items (exclude them from the index).

  • Add, modify, delete permissions.

What’s the Extension Script Environment?

An indexing pipeline extension Python 3 script:

  • Runs in a separate non-persistent isolated OS instance for each item.

  • Can import common Python libraries (such as Requests) available in the OS instance (see Python Modules Available to Indexing Pipeline Extensions).

  • Can read and write to the local folder, but without persistence between extension instances for each source item.

  • Can access the Internet.

Pre-Conversion Versus Post-Conversion

The following table provides some criteria indicating when each indexing pipeline stage is more appropriate. In doubt, favor adding your extensions as post-conversion stage.

Pre-Conversion Post-Conversion

Use when:

  • The script purpose is to reject items, and consequently prevent wasting resources on further indexing pipeline stages.

    You want to create separate sources for the oldest and newest items of a repository. In the source for the newest items, you add a script that rejects items with a last modification date older than your splitting date.

  • The script modifies the original Item data content and you want the Processing stage to process your changes.

Metadata added in the pre-conversion stage isn't automatically mapped to a field with a matching name. You must add a mapping to the source(s) for which you want to leverage the metadata (see Adding and Managing Source Mappings).

Use when:

  • The script needs to get the Body text or the Body HTML data stream processed by the Processing stage.

  • You want to ensure that your metadata changes won't be altered by another stage.

    When more than one post-conversion extensions are applied to a source, another extension could execute after.

  • When you want to create a script to discover all available metadata from all previous stages.

Usage Limits

By default, the following indexing pipeline extension usage limits apply to all organizations:

  • Number of extensions per organization: 10

  • Number of extensions per source: 20

    You can apply the same extension two times on a source (in pre- and post-conversion).

  • Extension execution timeout: 5 seconds

    Most common indexing pipeline extension applications only modify the item metadata and typically execute within significantly less than a second. An extension can take significantly longer when getting and processing items Body text or when calling an external service to process the items, particularly for large items. The extension execution can also have a significant impact on the crawling performance for sources containing many items.

You can review your extension and other usage limits by clicking the Settings icon in the upper-right corner of the Administration Console (see Content Limits).

Contact Coveo Support if you would like to upgrade your Coveo Cloud license with an increased number of extension limit.

Recommended Articles