Pre-push example for merging item data and metadata

If you want to index items whose metadata and data are stored in different locations, the best practice is to create a pre-push extension. This extension then applies to every item crawled, merging its data and metadata before the item is pushed into the cloud.

For example, you may be using a Database source, where the database contains item metadata, including a link to the file containing the item data.

To follow the link to an item, extract the data of the item, and then add that data in the item body, you would use a pre-push extension such as the following:

import base64
import datetime
import os.path
import subprocess
import zlib

# Prepush logs will be sent as "prepushlog" metadata
log = []

def do_extension(body):
    global log
    log = ['BEGIN %s' %(datetime.datetime.now().time())]

    full_path = 'C:/Data/sample.pdf'

    if os.path.isfile(full_path):
        # Open and read the file as a binary (`rb`)
        with open(full_path,'rb') as f:
            file_data = f.read()
            if len(file_data) > 0:
                # Compress and encode the file using `zlib` and `base64` modules
                body['CompressionType'] = 'ZLIB'
                body['CompressedBinaryData'] = base64.b64encode(zlib.compress(file_data)).decode()
                #log.append('file_data: ' + base64.b64encode(zlib.compress(file_data)).decode())
                body['prepushflag'] = 'true'
            else:
                log.append('file_data is empty for document: %s' % full_path)
    else:
        log.append('file not found: %s' % full_path)

    body['prepushlog']=';'.join(log)
    return body