Remove HTML Sections From Indexed Sitecore Items

Using the CleanHtml Coveo for Sitecore processor is a simple method to remove web page content you don’t want to index. However, it may require that you add exclusion tags in many layouts that your Sitecore items are linked to. Another powerful solution is to use a Coveo Platform indexing pipeline extension (IPE) which leverages the powerful and full-featured BeautifulSoup library.

This article provides instructions to create an indexing pipeline extension and provides a Python code sample which removes the following from your web pages:

  • tags whose class attribute value is coveo-no-index, and

  • <footer> tags.

Creating the Extension

  1. In the Extensions section of the Coveo Administration Console, click Add extension.

  2. In the Add an Extension panel, give your extension a name and a description.

  3. For this example, you need the extension to access the Original file data of your items.

    Extension Access Original File option
  4. In the Extension script field, paste the following code:

    from bs4 import BeautifulSoup
     
    read_only_stream = document.get_data_stream('documentdata')
    if read_only_stream is not None:
        modified_data = BeautifulSoup(read_only_stream.read().decode(), 'html.parser')
     
        # Remove matching nodes
        for element in modified_data.select('.coveo-no-index, footer'):
            element.decompose()
     
        modified_stream = document.DataStream('documentdata')
        modified_stream.write(str(modified_data))
        document.add_data_stream(modified_stream)
  5. Click Add extension.

For the list of CSS selectors BeautifulSoup supports, see Quick Start - Soup Sieve.

Associate the Extension to Your Source

To link your extension to a source, you can proceed as follows:

  1. Under Content in the Administration Console left menu, select Sources.

  2. Select the source you want to run your indexing pipeline extension on and select More > Manage extensions.

  3. In the Edit Source Extensions panel, with the Common tab selected, click Add > Extension.

  4. In the Edit Source Extensions panel, select your extension in the drop-down.

  5. Select the PRE-CONVERSION stage.

  6. Select the SKIP EXTENSION action on error.

  7. In the Conditions to apply field, type %[haslayout]==1. Since the haslayout field is Sitecore-specific, this condition ensures the extension is only applied to Sitecore items that have a layout.

  8. Click Apply extension.

  9. Back in the Edit Source Extensions panel, click Save.

  10. Rebuild that source from your Sitecore instance, so that the extension gets executed and removes the targeted content.

For general guidance on adding indexing pipeline extensions, see Manage Indexing Pipeline Extensions.

Recommended Articles