Document Object Python API Reference

Creating an indexing pipeline extension implies writing Python code that uses the document object to manipulate item properties (see Creating an Indexing Pipeline Extension with the API and Coveo Cloud V2 Indexing Pipeline).

This topic provides reference information describing the object methods and their parameters.

Log

You use this method in your extension script to tag source items with a relevant indexing message that is sent to the Coveo Cloud V2 source logs. Log messages are useful when you want to edit, debug, or troubleshoot an extension script.

Use the try/catch or try/except block to log an error as a string in the source logs.

Use the Log method since outputting text to a field as a form of logging can be a serious index bloat.

For instance, you should avoid using the Get a Metadata Value method to output the metadata content to a field.

Syntax:

log(message, severity)

Parameters

Parameter Type Description
message Required: string

The message that you want to log when applying an extension script.

severity string

Optionally used to indicate the message severity type.

Default value is Normal.

The allowed case insensitive severity values are:

  • Debug
  • Detail
  • Error
  • Fatal
  • Important
  • Normal
  • Notification
  • Warning
fulltitle = document.get_meta_data_value('titleselection', 'crawler', True)
try:
    # modifying fulltitle variable
    fulltitle = fulltitle[0]
    # logging a meaningful success message
    log('added metadata value to title: ' + fulltitle)
# catching all exceptions and logging them as a string for debugging purposes
except Exception as e:
    log(str(e), 'Error')

This script example uses the Log method in two different ways.

  1. First, the try block modifies the metadata and logs a success message only when the script runs without raising an error. In this particular case, the second argument is missing as the default value Normal defines the log message severity.
  2. On the other hand, when the try block fails, the except block catches the exception and sends a log containing the error message.

Applying an extension populates the documentLogEntries.meta.logs field that contains all log messages and severity type strings. This field length is limited to approximately 4K characters, after which the content is truncated. When the added length of multiple log messages gets over the limit, it is still possible to view all the messages that fits within the limit but the log message that sits on the limit is replaced with a truncated... mention as the following messages are ignored. For instance, when a very long string gets over the 4K limit, even if it represents the one and only log that applies to your extension, the whole string is replaced with the truncated... mention. The log message generated by an extension script can be seen in an added subsection of the JSON response named documentLogEntries.meta.logs as well as in the Log Browser.

{
  "documentLogEntries": [
    {
      "id": "http://www.example.com/",
      "organizationId": "myorganization",
      "sourceId": "qqotfbbttohttrnva4ebwykbe4-myorganization",
      "resourceId": "myorganization-tb5qadfyqqv2mrdtn2gde5kcpi",
      "task": "EXTENSION",
      "operation": "ADD",
      "result": "COMPLETED",
      "datetime": "2017-08-17T13:01:36.852Z",
      "requestId": "976520a8-f569-45d9-b252-48e6aea544d5",
      "meta": {
        "duration": "0.0559999",
        "logs": "truncated..."
      }
    }
  ]
}

Get URI

You use this method to get the item URI.

Syntax:

document.uri

You can easily output an item uri in the Log Browser by adding those lines in your extension:

myVariable = document.uri
log(myVariable)

This method returns a unicode value (see Python Unicode).

Get All Metadata

You use this method to get all item metadata.

Syntax:

document.get_meta_data()

Since unmapped metadata is not indexed, using this method makes all metadata available before the final indexing step. The following extension script makes it possible to consult a list of all custom metadata:

import json
# setting up a key-value dictionary named 'values'
values = dict()
# populating dictionary with available metadata_name and metadata_value
for m in document.get_meta_data():
    for metadata_name, metadata_value in m.values.iteritems():
        values[metadata_name] = metadata_value
# Adding the allmetadatavalues metadata
document.add_meta_data({"allmetadatavalues": json.dumps(values)})

You must map allmetadatavalues metadata to a field in order not to lose the populated values while indexing the item.

This method returns a list of MetaDataValue objects (see Document Object JSON Schema). MetaDataValue objects are unicode values (see Python Unicode).

Get a Metadata Value

You use this method to get a metadata value for a given metadata name and origin.

Syntax:

document.get_meta_data_value(name, origin, reverse)

Parameters

Parameter Type Description
name Required: string The name of the metadata
origin string

The unique identifier of the Coveo Cloud V2 indexing pipeline step where to retrieve a metadata value:

Name Description
crawler

The metadata value set during the Crawling stage

preconversion script name

The metadata value set during a specific preconversion script

Sample value: 'mypreconversionscript'

converter The metadata value set during the Processing stage
mapping

The metadata value set during the Mapping stage

postconversion script name

The metadata value set during a specific postconversion script

Sample value: 'mypostconversionscript'

If no value is supplied for the origin parameter, the most recent origin is considered, i.e., crawler in preconversion and mapping in postconversion.

reverse boolean

Whether to scan the metadata origin in reverse order or not. The default value is True, meaning that the value is fetched from the latest indexing pipeline stage with a non-empty value.

This method returns a list of unicode string values (see Python Unicode).

Unless the metadata contains multiple values, this list will contain a single element.

# Get original title from the crawling module in a log message
originalTitle = document.get_meta_data_value('title', 'crawler')  # Remember, this method returns a list
log(originalTitle[0], 'Normal')

Add Metadata

You use this method to add an item metadata key and its associated value. You can also use this method to unset or override an item metadata.

Syntax:

document.add_meta_data({metadataKey: metadataValue})
document.add_meta_data({'allmetadatavalues': ['{}={}'.format(k, ';'.join(map(str, v))) for m in document.get_meta_data() for k, v in m.values.iteritems()]})

You must map allmetadatavalues metadata to a field in order not to lose the populated values while indexing the item.

# Unsetting the author metadata value
document.add_meta_data({'Author': ''})

For all sources except for push sources, if you add a metadata before the mapping stage, you must map the metadata to a field in order to be indexed. For instance, if you add a metadata in a postconversion extension script, the metadata is indexed only when the index contains a field whose name matches the metadata key.

Get All Permissions

You use this method to get all item permissions.

Syntax:

document.get_permissions()
# Get item permissions in a log message
import json
myPermissions = json.dumps(document.get_permissions())
log(str(myPermissions))

This method returns a list of PermissionLevel objects.

{  
  "PermissionSets": [  
    {  
      "AllowAnonymous": false,
      "DeniedPermissions": [],
      "Name": "",
      "AllowedPermissions": []
    }
  ],
  "Name": ""
},
{
  "PermissionSets": [  
    {  
      "AllowAnonymous": false,
      "DeniedPermissions": [],
      "Name": "View All Data Members",
      "AllowedPermissions": [  
        {  
          "SecurityProvider": "SALESFORCE-00Df40000000SAbEAM",
          "IdentityType": "virtualgroup",
          "Identity": "ViewAll:Irrelevant:",
          "AdditionalInfo": {}
        },
        {  
          "SecurityProvider": "SALESFORCE-00Df40000000SAbEAM",
          "IdentityType": "virtualgroup",
          "Identity": "ObjectAccess:ViewAllRecordsProfiles:Solution",
          "AdditionalInfo": {}
        },
        {  
          "SecurityProvider": "SALESFORCE-00Df40000000SAbEAM",
          "IdentityType": "virtualgroup",
          "Identity": "ObjectAccess:ViewAllRecordsPermissionSets:Solution",
          "AdditionalInfo": {}
        }
      ]
    }
  ],
  "Name": "View All Data"
},
{
  "PermissionSets": [  
    {  
      "AllowAnonymous": false,
      "DeniedPermissions": [],
      "Name":"Read access members",
      "AllowedPermissions": [  
        {  
          "SecurityProvider": "SALESFORCE-00Df40000000SAbEAM",
          "IdentityType": "virtualgroup",
          "Identity": "ObjectAccess:ReadRecordsProfiles:Solution",
          "AdditionalInfo": {}
        },
        {  
          "SecurityProvider": "SALESFORCE-00Df40000000SAbEAM",
          "IdentityType": "virtualgroup",
          "Identity": "ObjectAccess:ReadRecordsPermissionSets:Solution",
          "AdditionalInfo": {}
        }
      ]
    }
  ],
  "Name": "Read Access & Sharing"
}

Clear All Permissions

You use this method to clear all item permissions.

Syntax:

document.clear_permissions()

Use the clear_permissions method very carefully because any user could gain access to potentially sensitive information from originally secured items.

Add Allowed Permission

You use this method to add an allowed security identity.

Syntax:

document.add_allowed(identity, identity_type, security_provider, {additional_info})

Parameters

Parameter Type Description
identity Required: string The allowed security identity name to add.
identity_type Required: string

Allowed values are:

  • user
    An individual user.
  • group
    A group, which can have users or other groups/virtual groups as members.
  • virtualgroup
    A virtual group, which is a group that does not exist in the indexed secured enterprise system.
  • unknown
    An entity that does not fit any of the aforementioned types.
security_provider Required: string

The security identity provider name.

Sample value: 'Email Security Provider'

additional_info dictionary of string A collection of key value pairs that can be used to uniquely identify the security identity.
# Allowing access to all users logging in with Coveo account
document.add_allowed('*@coveo.com', 'user', 'Email Security Provider', {})

Add Denied Permission

You use this method to add a denied security identity.

Syntax:

document.add_denied(identity, identity_type, security_provider, {additional_info})

Parameters

Parameter Type Description
identity Required: string The denied security identity name to add.
identity_type Required: string

Allowed values are:

  • user
    An individual user.
  • group
    A group, which can have users or other groups/virtual groups as members.
  • virtualgroup
    A virtual group, which is a group that does not exist in the indexed secured enterprise system.
  • unknown
    An entity that does not fit any of the aforementioned types.
security_provider Required: string

The name of the security identity provider.

Sample value: 'Email Security Provider'

additional_info dictionary of string A collection of key value pairs that can be used to uniquely identify the security identity.
# Denying access to all users logging in with hotmail account
document.add_denied('*@hotmail.com', 'user', 'Email Security Provider', {})

Set Permissions

You use this method to set an item permissions. To set permissions, you must define at least one permission level, at least one permission set and then, one or more permissions.

Syntax:

document.set_permissions([PermissionLevel])

PermissionLevel

Parameter Type Description
level_name String The name of the permission level.
permission_sets Array of PermissionSet Array of permission sets

PermissionSet

Parameter Type Description
set_name Required: String The name of the permission set.
allow_anonymous Required: boolean Whether to allow anonymous access.
allowed_permissions Array of Permission Array of allowed permissions
denied_permissions Array of Permission Array of denied permissions

Permission

Parameter Type Description
identity Required:String

The name of the security identity.

Sample value: '*@coveo.com' to allow access to all users logging in with coveo email

identity_type Required:String

Allowed values are:

  • user
    An individual user.
  • group
    A group, which can have users or other groups/virtual groups as members.
  • virtualgroup
    A virtual group, which is a group that does not exist in the indexed secured enterprise system.
  • unknown
    An entity that does not fit any of the aforementioned types.
security_provider Required:String

The name of the security identity provider

Sample value: 'Email Security Provider'

additional_info dictionary of string A collection of key value pairs that can be used to uniquely identify the security identity.

The permission model complexity can range from allowing full anonymous access to requiring the resolution of permissions for several permission levels, each containing one or more permissions sets.

import json
# defining security levels
# TopLevel allows ceo@coveo.com and denies Accountants
TopLevel = document.PermissionLevel('CEO', [document.PermissionSet('TopSet', False, 
    [document.Permission('ceo@coveo.com', 'user', 'Email Security Provider')],
    [document.Permission('Accountants', 'group', 'Email Security Provider')])])
# LowerLevel allows myGroup1 and denies myGroup2 and myGroup3
LowerLevel = document.PermissionLevel('Employees', [document.PermissionSet('LowerSet', False, 
    [document.Permission('myGroup1', 'group', 'Email Security Provider')],
    [document.Permission('myGroup2', 'group', 'Email Security Provider'),
    document.Permission('myGroup3', 'group', 'Email Security Provider')])])
# Set item permission levels
document.set_permissions([TopLevel, LowerLevel])
# Get item permissions in a log message
myPermissions = json.dumps(document.get_permissions())
log(str(myPermissions))

Get All Data Streams

You use this method to get access to an item data streams in cases where you need to read or modify these streams.

Syntax:

document.get_data_streams()

The method returns a list of ReadOnlyDataStream objects. ReadOnlyDataStream object is a BytesIO value, a stream of in-memory bytes (see Python Buffered Streams).

[
    <extension_runner.ReadOnlyDataStream object at 0x7f88fc0665d0>, 
    <extension_runner.ReadOnlyDataStream object at 0x7f88fc0662d0>, 
    <extension_runner.ReadOnlyDataStream object at 0x7f88fc066590>
]
  1. In the Edit an extension window, you must select at least one of the checkbox associated with each item data in order for the get_data_streams() method to return something.

    A user must specify that their extension requires access to an item binary data in order for the data to be downloaded and passed along to the extension runner.

    In order to optimize indexing performances, you should only access a data stream when absolutely necessary.
  2. Use get_data_streams() method in your Python extension script.

     # logging the list first data stream in the Log Browser
     myDataStreams = document.get_data_streams()
     log(str(myDataStreams[1].read()))
    

    The preceding code has visible effects in the Log Browser.

Get a Data Stream

You use this method to get a data stream for a given name and origin.

Syntax:

document.get_data_stream(name, origin, reverse)
# Get document body text data stream appear in a log message
# You must select the Body Text checkbox because this indexing pipeline extension script needs to access it
myDataStream = (document.get_data_stream('body_text')).read()
log(str(myDataStream))

The method returns a single ReadOnlyDataStream object that is a BytesIO value, a stream of in-memory bytes (see Python Buffered Streams).

For Web and Sitemap type sources, it is recommended to use the web scrapping feature rather than extensions to do common HTML content processing such as excluding sections and extracting metadata (see Web Scraping Configuration).

Parameters

Parameter Type Description
name Required: string

The available item data streams are:

  • documentdataThe complete item binary content extracted by the Crawling stage of the indexing pipeline (see Coveo Cloud V2 Indexing Pipeline).

    Example:

    The documentdata of a PDF file is the actual PDF file.

    The documentdata of a web page is the page HTML markup.

    You may want to retrieve an item documentdata in a Preconversion extension to modify the original item content.

    Example:
    You want to extract the text content from scanned items that are saved as image files. You use a preconversion extension to send each image documentdata to a third party optical character recognition (OCR) service. You save the returned text back in the documentdata so that the Processing stage can prepare the text content for the Indexing stage.

    Getting the documentdata can significantly degrade indexing performances because each item binary data has to be fetched, decompressed, and decrypted.
    There is generally no point to get and modify the documentdata in a postconversion extension because the Indexing stage does not process it.

    In the Coveo Cloud administration console Add/Edit an Extension panel, the documentdata is referred to as the Original file.
  • body_text
    The complete textual content of an item extracted by the converter in the Processing stage of the indexing pipeline (see Coveo Cloud V2 Indexing Pipeline ).
    You can get the body_text of each item in a postconversion extensions for rare cases where you want to access and possibly modify the item text content.
    There is no point in getting and modifying the body_text in a preconversion extension because the Processing stage would overwrite it.

    Note:

    For index size and performance optimization, the body_text is limited in size to 50 MB. This means that for rare items with larger body_text, the exceeding text will not be indexed, and therefore not searchable.

  • body_html
    The complete HTML representation of an item created by the converter in the Processing stage of the indexing pipeline (see Coveo Cloud V2 Indexing Pipeline). The body_html appears in the Quick View of a search result item. You can get the body_html of each item in a postconversion extension for cases where you want to access and possibly modify the item text content.

    Example:

    Your source indexes a question and answer website. Each question and each answer is indexed as a separate item even if they can come from the same HTML page. Your indexed items do not have the <head> elements from the original HTML page and therefore are missing resources such as CSS. Consequently, the Quick View for these items does not look good.

    You get the body_html in an extension and inject the appropriate <head> elements.

    There is no point in getting and modifying the body_html in a preconversion extension because the Processing stage would overwrite it.

    Notes:

    When you can define your desired body_html content as a static HTML markup containing metadata placeholders, it is generally simpler to use a mapping on the body field (see Add/Edit Body Mapping).
    For index size and performance optimization, the body_html is limited in size to 10 MB. This means that the Quick View of items with larger body_html will be truncated.
  • $thumbnail$
    The thumbnail image created by the converter in the Processing stage of the indexing pipeline for specific file types ( Microsoft Word, Excel, PowerPoint, and Visio as well as many image file types such as JPG, BMP, GIF, TIF, PSD, PNG... ).
    You can get the $thumbnail$ in a postconversion extension in the rare cases where you want to modify the thumbnail or extract information from the thumbnail image. Your thumbnail image can have any size, resolution or format (as long as a browser can display it), but you should stick to a normalized image size and resolution for most cases.

    If you want to create or overwrite a thumbnail, you do not need to previously get the $thumbnail$ datastream.

origin string

The unique identifier of the Coveo Cloud V2 indexing pipeline step where to retrieve a stream:

Name Description
crawler The stream value set during the Crawling stage
preconversion script name The stream value set during a specific preconversion script
converter The stream value set during the Processing stage
mapping The stream value set during the Mapping stage
postconversion script name The stream value set during a specific postconversion script

If no origin value is supplied for the origin parameter, the most recent origin is considered, i.e., crawler in preconversion and mapping in postconversion. If you have two different postconversion scripts that modify a stream, if you don't specify the origin in the second script, the output of the first preconversion script is used in the second postconversion extension script since the most recent origin is considered.

reverse boolean Whether to scan the metadata origin in reverse order or not. The default value is True, meaning that the value is fetched from the latest indexing pipeline stage with a non-empty value.

DataStream Attribute Setter

You use this method to access and set a DataStream object for a given name and origin.

Syntax:

document.DataStream(name, origin, reverse)

The parameters are the same as listed above for the get_meta_data method (see get\_meta\_data method Parameters).

# Override the item body text
text = document.DataStream('body_text')
text.write('This is a test')
document.add_data_stream(text)

The method returns a single DataStream modifiable object that is a BytesIO value, a stream of in-memory bytes (see Python Buffered Streams). When applicable, the extension runner is responsible to write back the item binary data after the script execution.

Add Data Stream

You use this method to add or override an item data stream.

Syntax:

document.add_data_stream(stream)
# Import the requests library to perform API calls
import requests
extracted_text = [x.strip('\r\n\t') for x in document.get_data_stream('body_text', 'converter').readlines() if x.strip('\r\n\t')]
 
# Override item html with perdu.com
html = document.DataStream('Body_HTML')
html.write(requests.get('http://perdu.com').text)
# Override the text with part of the original item
text = document.DataStream('body_text')
text.write('This is a test.')
text.write(extracted_text[0])
 
# Override the thumbnail of the item with Coveo logo
thumbnail = document.DataStream('$thumbnail$')
thumbnail.write(requests.get('https://careers.coveo.com/assets/images/opengraph.png').content)
document.add_data_stream(html)
document.add_data_stream(text)
document.add_data_stream(thumbnail)

Reject Item

You use this method to set the item state as rejected.

Syntax:

document.reject()

Document Object JSON Schema

The Document object can be represented with the following JSON schema.

{  
  "$schema": "http://json-schema.org/draft-04/schema#",
  "definitions": {  
    "MetaDataValue": {  
      "type": "object",
      "properties": {  
        "origin": {  
          "type": "string"
        },
        "values": {  
          "type": "object",
          "properties": {  
            "key": {  
              "type": "string"
            },
            "value": {  
              "type": "array",
              "items": {  
                "type": "string"
              }
            }
          }
        }
      }
    },
    "Permission": {  
      "type": "object",
      "properties": {  
        "identity": {  
          "type": "string"
        },
        "identity_type": {  
          "type": "string"
        },
        "security_provider": {  
          "type": "string",
          "enum": ["user", "group", "virtualgroup", "unknown"]
        },
        "additional_info": {  
          "type": "object",
          "properties": {  
            "key": {  
              "type": "string"
            },
            "value": {  
              "type": "string"
            }
          }
        }
      }
    },
    "PermissionSet": {  
      "type": "object",
      "properties": {  
        "name": {  
          "type": "string"
        },
        "allow_anonymous": {  
          "type": "boolean"
        },
        "allowed_permissions": {  
          "type": "array",
          "items": {  
            "$ref": "#/definitions/Permission"
          }
        },
        "denied_permissions": {  
          "type": "array",
          "items": {  
            "$ref": "#/definitions/Permission"
          }
        }
      }
    },
    "PermissionLevel": {  
      "type": "object",
      "properties": {  
        "name": {  
          "type": "string"
        },
        "permission_sets": {  
          "type": "array",
          "items": {  
            "$ref": "#/definitions/PermissionSet"
          }
        }
      }
    },
    "DataStream": {  
      "type": "object",
      "properties": {  
        "name": {  
          "type": "string"
        },
        "origin": {  
          "type": "string"
        }
      }
    },
    "Document": {  
      "type": "object",
      "properties": {  
        "uri": {  
          "type": "string"
        },
        "meta_data": {  
          "type": "array",
          "items": {  
            "$ref": "#/definitions/MetaDataValue"
          }
        },
        "permissions": {  
          "type": "array",
          "items": {  
            "$ref": "#/definitions/PermissionLevel"
          }
        },
        "data_streams": {  
          "type": "array",
          "items": {  
            "$ref": "#/definitions/DataStream"
          }
        }
      }
    }
  }
}

To consult a single item document object just before indexing time, using this script as the last executed postconversion script populates a documentobject metadata.

import json
# Get document object JSON into a metadata
documentObject = json.dumps(document)
document.add_meta_data({'documentobject': documentObject}) 

You must map documentobject metadata to a field in order to index the content and consult the document object values.

The preceding extension script returns the document object JSON:

{
  "DataStream": [  
    {  
      "Origin": "converter",
      "Name": "body_html"
    },
    {  
      "Origin": "mypostconversionextension",
      "Name": "body_html"
    },
    {  
      "Origin": "converter",
      "Name": "body_text"
    },
    {  
      "Origin": "mypostconversionextension",
      "Name": "body_text"
    },
    {  
      "Origin": "mypostconversionextension",
      "Name": "$thumbnail$"
    },
    {  
      "Origin": "crawler",
      "Name": "documentdata"
    }
  ],
  "Permissions": [  
    {  
      "PermissionSets": [  
        {  
          "AllowAnonymous": false,
          "DeniedPermissions": [],
          "Name": "",
          "AllowedPermissions": [  
            {  
              "SecurityProvider": "Email Security Provider",
              "IdentityType": "user",
              "Identity": "*@coveo.com",
              "AdditionalInfo": {}
            }
          ]
        }
      ],
      "Name": ""
    }
  ],
  "URI": "http://www.example.com/",
  "MetaData": [  
    {  
      "Origin": "crawler",
      "Values": {  
        "originaluri": [  
          "http://www.example.com/"
        ],
 
             [ ... ]
        "permanentid": [  
          "f1777111f5d0f1c81ffa04de75112889e6a0649e06d83370cdf2cbfb05f3"
        ],
        "content-type": [  
          "text/html; charset=utf-8"
        ]
      }
    },
    {  
      "Origin": "mypreconversionextension",
      "Values": {  
        "title": [  
          "Brand New Title"
        ]
      }
    },
    {  
      "Origin": "converter",
      "Values": {  
        "conversionstate": [  
          0
        ],
        "detectedtitle": [  
          "Example Domain"
        ],
        "language": [  
          "English"
        ],
 
            [ ... ]
        "originalhtmlcharset": [  
          65001
        ],
        "extractedsize": [  
          420
        ]
      }
    },
    {  
      "Origin": "mapping",
      "Values": {  
        "sourcetype": [  
          "Web"
        ],
        "language": [  
          "English"
        ],
        "title": [  
          "Example Domain"
        ],
 
             [ ... ]
        "date": [  
          1376092475
        ],
        "permanentid": [  
          "f1777111f5d0f1c81ffa04de75112889e6a0649e06d83370cdf2cbfb05f3"
        ],
        "size": [  
          1270
        ]
      }
    },
    {  
      "Origin": "mypostconversionextension",
      "Values": {  
        "author": [  
          "Coveo Documentation Team"
        ]
      }
    }
  ]
}