Remove HTML Sections From Indexed Sitecore Items

Using the CleanHtml Coveo for Sitecore processor is a simple method to remove web page content you don’t want to index. However, it may require that you add exclusion tags in many layouts that your Sitecore items are linked to. Another powerful solution is to use a Coveo indexing pipeline extension (IPE) which leverages the powerful and full-featured BeautifulSoup library.

This article provides instructions to create an indexing pipeline extension and provides a Python code sample which removes the following from your web pages:

  • tags whose class attribute value is coveo-no-index, and

  • <footer> tags.

Creating the Extension

  1. In the Extensions (platform-ca | platform-eu | platform-au) section of the Coveo Administration Console, click Add extension.

  2. In the Add an Extension panel, give your extension a name and a description.

  3. For this example, you need the extension to access the Original file data of your items.

    Extension Access Original File option | Coveo
  4. In the Extension script field, paste the following code:

    from bs4 import BeautifulSoup
    
    read_only_stream = document.get_data_stream('documentdata')
    if read_only_stream is not None:
        modified_data = BeautifulSoup(read_only_stream.read().decode(), 'html.parser')
    
        # Remove matching nodes
        for element in modified_data.select('.coveo-no-index, footer'):
            element.decompose()
    
        modified_stream = document.DataStream('documentdata')
        modified_stream.write(str(modified_data))
        document.add_data_stream(modified_stream)
  5. Click Add extension.

Note

For the list of CSS selectors BeautifulSoup supports, see Quick Start - Soup Sieve.

Associate the Extension to Your Source

To link your extension to a source, you can proceed as follows:

  1. Under Content in the Administration Console left menu, select Sources.

  2. Click the source you want to run your indexing pipeline extension on, and then click More > Add extensions in the Action bar.

  3. In the Add extensions panel, with the Common tab selected, click Add > Extension.

  4. In the Apply an Extension on Source Items panel, select your extension in the dropdown menu.

  5. Select the PRE-CONVERSION stage.

  6. Select the SKIP EXTENSION action on error.

  7. In the Conditions to apply field, type %[haslayout] == "1". Since the haslayout field is Sitecore-specific, this condition ensures the extension is only applied to Sitecore items that have a layout.

  8. Click Apply extension.

  9. Back in the Add extensions panel, click Save.

  10. Rebuild that source from your Sitecore instance, so that the extension gets executed and removes the targeted content.

Note

For general guidance on adding indexing pipeline extensions, see Manage Indexing Pipeline Extensions.