Using pre-push extensions

After you create a Crawling Module source, you may need to customize the way source items are indexed. One way to do this is to use an extension, a Python script that you write and that runs for every item crawled by your source.

Coveo lets you apply extensions at two distinct stages of the indexing process:

This article details how to create and apply a pre-push extension to a Crawling Module source.

Important
  • Before creating extensions, make sure the source configuration doesn’t already provide the functionality you need.

  • Allowlist https://pypi.org in your security solution to enable the download of required Python modules.

  • Applying an extension to a source can significantly slow down content crawling.

Apply a pre-push extension to a source

To apply a pre-push extension to a Crawling Module source:

  • Write a Python script that implements the logic you want to apply to crawled items.

  • Reference the script in the PrePushExtension parameter of your source’s JSON configuration.

    Instructions to set the PrePushExtension parameter
    1. On the Sources (platform-ca | platform-eu | platform-au) page, click the desired source, and then click More > Edit configuration with JSON in the Action bar.

    2. Click the Parameters tab located above the JSON configuration box.

    3. Add a comma (,) after the last parameter configuration, and then add the PrePushExtension parameter configuration with the value set to your script file name. For example, if your script file name is MyPrePushExtension.py, you must add the following to the parameters section of your source JSON configuration:

      "PrePushExtension": {
          "sensitive": false,
          "value": "MyPrePushExtension.py"
      }

      The bottom of the JSON configuration box should now be similar to the following:

      The pre-push configuration in the source JSON | Coveo
  • Add your script’s external dependencies to the requirements.txt file in the C:\ProgramData\Coveo\Maestro\Python3PrePushExtensions folder.

  • Implement logging in your script, ideally to a subfolder under C:\ProgramData\Coveo\Maestro\Logs, to help with debugging. See the provided script examples for logging logic.

  • Allowlist https://pypi.org in your security solution to enable the download of required Python modules.

Write the Python script

A pre-push extension script must meet these requirements:

  • It must be a Python 3 script.

  • Save it in the C:\ProgramData\Coveo\Maestro\Python3PrePushExtensions folder (ProgramData is hidden by default).

  • Define a do_extension function that accepts the body argument and returns the modified body. This argument is a JSON representation of the crawled item, for example:

    {
      "DocumentId": "file:///c:/tmp/testdata/sample.txt",
      "CompressionType": "ZLIB",  1
      "CompressedBinaryData": "eAELycgsVgCiRIWS1OISPQAplwUk",  1
      "clickableuri": "file:///C:/Tmp/TestData/sample.txt",
      "date": "2022-05-04 15:07:39",
      "Permissions": [{  2
          "PermissionSets": [{
              "AllowAnonymous": false,
              "AllowedPermissions": [{
                  "IdentityType": "GROUP",
                  "SecurityProvider": "Email Security Provider",
                  "Identity": "*@*",
                  "AdditionalInfo": {}
                }
              ]
            }
          ]
        }
      ],
      "fileextension": "txt",
      "connectortype": "FileCrawler",
      "source": "File",
      "collection": "File Collection",
      "generateexcerpt": true,
      "contenttype": "",
      "originaluri": "file:///c:/tmp/testdata/sample.txt",
      "printableuri": "file:///C:/Tmp/TestData/sample.txt",
      "filename": "sample.txt",
      "permanentid": "9a3a317a4e49c31962b969967c15e51477b0fd9ca33dceac76c94982593b",
      "size": 15,
      "compressedsize": 21,
      "creationdate": "2022-05-04 15:07:23",
      "lastaccessdate": "2023-07-04 19:07:49",
      "folder": "C:\\Tmp\\TestData",
      "fileowner": "COVEO\\Bob",
      "lastwritedate": "2022-05-04 15:07:39",
      "parents": "<?xml version=\"1.0\" encoding=\"utf-16\"?><parents><parent name=\"C:\" uri=\"file:///C:/\" /><parent name=\"Tmp\" uri=\"file:///C:/Tmp/\" /><parent name=\"TestData\" uri=\"file:///C:/Tmp/TestData/\" /><parent name=\"sample.txt\" uri=\"file:///C:/Tmp/TestData/sample.txt\" /></parents>",
      "coveo_metadatasampling": 1
    }
    1 To modify item data, base64-encode and compress the content, then set the CompressionType and CompressedBinaryData properties. See the Add item data example.
    2 Avoid modifying the Permissions property using a pre-push extension, as it may allow unauthorized access to content in your search interface.
Tip

The properties in the body JSON may vary by source type and configuration. To assist with script development, you can log the current input JSON to a file for review.

The following shows a simple pre-push extension script template:

# Import required Python libraries.
import sys
...

# Set up logging or other initial configurations.
log_folder = os.path.join(os.getenv('COVEO_LOGS_ROOT'), 'Extensions', os.getenv('SOURCE_ID','unknown')) 1
...

# ------------------------------------------------------------------------
# Entry point for the extension. The do_extension function must be defined.
# ------------------------------------------------------------------------
def do_extension(body):
    # Apply transformation logic and log actions.
    ...

    return body
1 The Coveo Crawling Module sets environment variables you can access in your script.

An extension script runs automatically during a source content update. You can apply only one pre-push extension script per source. However, that script can call multiple functions, including those in other Python files.

Coveo provides sample scripts covering common use cases. Use them as templates to build your own.

Coveo pre-push extension environment variables

The Crawling Module sets the following environment variables:

Variable Description

COVEO_LOGS_ROOT

Root folder for logging (usually C:\ProgramData\Coveo\Maestro\Logs).

ORGANIZATION_ID

The unique identifier of your Coveo organization. For example, contosovequep8c.

SOURCE_ID

The unique identifier of the source currently being processed. For example, contosovequep8c-rki3drt6rgruxenppqst5kydxq.

OPERATION_TYPE

Type of content operation currently being performed. For example, Rebuild.

OPERATION_ID

Unique identifier of the current operation. It’s displayed in the activity details on the Activity Browser (platform-ca | platform-eu | platform-au) page. For example, b21f6712-f630-4b45-az94-158c1a9a05e3.

CRAWLING_MODULE_ID

Unique identifier of the Crawling Module. It’s displayed on the Crawling Modules (platform-ca | platform-eu | platform-au) page. For example, contosovequep8c-47273014-aaa4-4fb3-a562-e08c2d761a31.

Precautions when using pre-push extensions

Extensions can affect the performance of your source crawling. If a script runs too long or encounters errors, items will be indexed without applying the script, which may cause unexpected results in your search interface.

Tip
Leading practice

Apply the extension to a duplicate of your production source with a name that clearly indicates it’s for testing purposes only. In this test source, crawl only a small subset of content for faster debugging and to limit the log file size.

Only after fully testing and validating the pre-push extension in the test source should you apply it to your production source.

What’s next?

Review the pre-push extension examples to start writing your own extensions.