Indexing pipeline extension overview
Indexing pipeline extension overview
Your Coveo organization uses an indexing pipeline with various stages to process source items from enterprise repositories and make them searchable (see Coveo indexing pipeline).
An extension consists of a Python script used to customize the way source items are indexed. You must define the extension in your Coveo organization, and then apply it to one or more sources. When items are processed in the pipeline, your extension is applied at the pre-conversion or post-conversion stage.
You make a site searchable using a Web source type.
The last modification date of the page content is only available in a <meta>
element in the pages, but as a text string in the local time (for example, March 12, 2017 03:11:32 PM
).
You want to use this date text string and properly include it in an index date field in UTC format.
You (or a developer) create an extension with a Python script that converts the date text to a UTC date value and sets a metadata with the converted value.
As shown in the following diagram, when a crawled or pushed item enters the indexing pipeline, the Document Processing Manager (DPM) adds a pre-conversion or post-conversion stage for each extension that’s applied to a source.
An extension can be conditionally applied for each source item based on the item type (Common versus Specific extension application), or based on a condition expression.
Deploying an indexing pipeline extension
The following procedure outlines the steps to get an indexing pipeline extension to work its magic.
-
Create or adapt a script to use in your extension.
You need at least basic developer skills to write or adapt a sample script that does the custom processing you need for one or more of your Coveo organization sources.
NoteIndexing pipeline extensions provide great flexibility to process items, but since executing a script for each item of a source can notably increase indexing time, they should be used as a last resort when other available indexing customization tools don’t allow to perform the specific task (see Indexing pipeline customization tools overview).
Consult the following references to help you write your script:
-
From the Coveo Administration Console, create an extension to host your script (see Add or edit an indexing pipeline extension).
-
Test your indexing pipeline extension (see Indexing pipeline extension testing strategies and good practices).
What you validate depends entirely on what the script does. Verify that the extension performed as expected for all applicable items.
Extension testing suggestions:
-
Create and use a temporary source with a small number of representative items of your production source to test your extension.
-
Use logs in your script while debugging (see Logging messages from an indexing pipeline extension).
-
-
Apply your extension to your production source (see Apply an extension to a source).
-
Rebuild your source (see Refresh, rescan, or rebuild sources).
-
Validate that your extension processed the source items as expected.
Verify that the extension performed as expected for all applicable items of your production source.
What are possible extension script purposes?
In short, here are types of action an indexing pipeline extension script can perform for each item of a source:
-
Add, modify, clear metadata.
-
Get the metadata value and data streams from any indexing pipeline stage by using the
origin
attribute (identified by the stage or extension name). -
Modify binary streams:
-
Body text
-
Body HTML
-
Thumbnail
-
Original file
-
-
Reject items (exclude them from the index).
-
Add, modify, delete permissions.
What’s the extension script environment?
An indexing pipeline extension Python 3 script:
-
Runs in a separate non-persistent isolated OS instance for each item.
-
Can import common Python libraries (such as
Requests
) available in the OS instance (see Python modules available to indexing pipeline extensions). -
Can read and write to the local folder, but without persistence between extension instances for each source item.
-
Can access the Internet.
Pre-conversion versus post-conversion
The following table provides some criteria indicating when each indexing pipeline stage is more appropriate. In doubt, favor adding your extensions as post-conversion stage.
Pre-conversion | Post-conversion |
---|---|
Use when: * The script purpose is to reject items, and therefore prevent wasting resources on further indexing pipeline stages. + .Example You want to create separate sources for the oldest and newest items of a repository. In the source for the newest items, you add a script that rejects items with a last modification date older than your splitting date.
|
Use when:
|
Usage limits
By default, the following indexing pipeline extension usage limits apply to all organizations:
-
Number of extensions per organization: 10
-
Number of extensions that can be applied to a source: 20
NoteYou can apply the same extension two times to a source (that is, one time in pre-conversion and one time in post-conversion).
-
Extension execution timeout: 5 seconds
Most common indexing pipeline extension applications only modify the
item
metadata and typically execute within significantly less than a second. An extension can take significantly longer when getting and processing itemsBody text
or when calling an external service to process the items, particularly for large items. The extension execution can also have a significant impact on the crawling performance for sources containing many items.
Note
You can review your extension and other usage limits in the Coveo Administration Console Settings page, under License > Limits (platform-ca | platform-eu | platform-au). Contact Coveo Support if you would like to upgrade your Coveo license with an increased number of extension limit. |