Modifying Item Bodies

When you need to alter item bodies through an indexing pipeline extension (IPE), you should typically use a pre-conversion script to modify the documentdata stream.

While you could be tempted to use a post-conversion script to modify the body_html and body_text streams instead, doing so can lead to inconsistencies in search results. This is mainly due to the fact that item summaries and excerpts are extracted during the processing stage of the indexing pipeline and can no longer be altered thereafter. Modifying the documentdata stream in a pre-conversion script is therefore always preferable.

Important

If you’re creating your pre-conversion IPE through the Coveo Administration Console, ensure that you select the Original file additional item data. If you’re using the Extension API directly, include the DOCUMENT_DATA string in the requiredDataStreams array property of your request payload.

Note

YouTube items are an exception to the above recommendations. Their bodies are mapped from the coveo_description and coveo_videoid metadata fields, which you can modify through a pre-conversion IPE.

Basic Recipe

The following script shows a typical basic recipe for modifying item bodies through pre-conversion IPEs.

# 1. Get a read-only stream
original_data = document.get_data_stream('documentdata')
# 2. Read/parse the read-only stream data to a workable format
modified_data = original_data.read().decode()
# 3. Make all necessary data alterations
modified_data = modified_data.replace('foo', 'bar')
# 4. Get a modifiable stream
modified_stream = document.DataStream('documentdata')
# 5. Overwrite the modifiable stream data with the previously altered data
modified_stream.write(modified_data)
# 6. Add the modified stream to the item
document.add_data_stream(modified_stream)
Note

When you modify the content type of documentdata, and not just its content, you must also specify the new content type if it’s one of the following:

  • TYPE_HTML: HTML document

  • TYPE_DOCX: Microsoft Word 2007 Document (Zipped XML)

  • TYPE_PPTX: Microsoft PowerPoint 2007 Document (Zipped XML)

  • TYPE_XLSX: Microsoft Excel 2007 Document (Zipped XML)

  • TYPE_PDF: PDF (Portable Document Format)

  • TYPE_RTF: Rich Text Format

  • TYPE_TXT: Text (ASCII)

by performing the step below:

# 7. Specify new content type
document.add_meta_data({'detectedfileenum': ['<NEW_CONTENT_TYPE>']})

where <NEW_CONTENT_TYPE> is new content type.

Example

This script provides a slightly more concrete example where HTML item bodies are modified through a pre-conversion IPE.

from bs4 import BeautifulSoup

read_only_stream = document.get_data_stream('documentdata')
modified_data = BeautifulSoup(read_only_stream.read().decode(), 'html.parser')

# Remove a node
modified_data.find(id='my-node-to-remove').decompose()

# Add a new node
new_node = BeautifulSoup('<p>Hello world!</p>', 'html.parser')
parent_node = modified_data.find(id='my-parent-node')
parent_node.append(new_node)

modified_stream = document.DataStream('documentdata')
modified_stream.write(str(modified_data))
document.add_data_stream(modified_stream)

Modifying YouTube Item Bodies

To modify YouTube item bodies, modify the coveo_description and coveo_videoid metadata fields through a pre-conversion IPE, as you would other fields (see Add Metadata).

The following is a sample pre-conversion script to remove the occurrences of certain strings from YouTube item bodies.

old_description = document.get_meta_data_value("coveo_description")
new_description = [old.replace("Sentence to remove.", "") for old in old_description]
document.add_meta_data({ "coveo_description": new_description })

old_id = document.get_meta_data_value("coveo_videoid")
new_id = [old.replace("String to remove", "") for old in old_id]
document.add_meta_data({ "coveo_videoid": new_id })