Pre-Push Example for Merging Item Data and Metadata

This article applies to the new Crawling Module, which works without Docker. If you still use the Crawling Module with Docker, see Pre-Push Example for Merging Item Data and Metadata (Docker Version) instead. You might also want to read on the advantages of the new Crawling Module.

To identify the Crawling Module you’re currently using, on the Crawling Modules page of the Coveo Administration Console, look at the Maestro reported version:

  • Versions > 1: new Crawling Module

  • Versions < 1: Crawling Module with Docker

If you want to index items whose metadata and data are stored in different locations, the best practice is to create a pre-push extension. This extension then applies to every item crawled, merging its data and metadata before the item is pushed into the cloud.

For instance, you may be using a database (ODBC) connector, where the database contains item metadata, including a link to the file containing the item data.

To follow the link to an item, extract the data of the item, and then add that data in the item body, you would use a pre-push extension such as the following:

import base64
import datetime
import os.path
import subprocess
import zlib
# Prepush logs will be sent as "prepushlog" metadata
log = []
def do_extension(body):
    global log
    log = ['BEGIN %s' %(datetime.datetime.now().time())]
    full_path = 'C:/Data/sample.pdf'
    if os.path.isfile(full_path):
        # Open and read the file as a binary (`rb`)
        with open(full_path,'rb') as f:
            file_data = f.read()
            if len(file_data) > 0:
                # Compress and encode the file using `zlib` and `base64` modules
                body['CompressionType'] = 'ZLIB'
                body['CompressedBinaryData'] = base64.b64encode(zlib.compress(file_data)).decode()
                #log.append('file_data: ' + base64.b64encode(zlib.compress(file_data)).decode())
                body['prepushflag'] = 'true'
            else:
                log.append('file_data is empty for document: %s' % full_path)
    else:
        log.append('file not found: %s' % full_path)
    body['prepushlog']=';'.join(log)
    return body
Recommended Articles