Filtering Content

Your Adobe Experience Manager HTML pages typically contain sections with irrelevant content that you want to exclude from your Coveo organization indexed items (e.g., nav, header, footer tag content).

Similarly, configuring Coveo Adobe Experience Manager Connector sources with multiple root paths might not give you all the granularity you need. To index only relevant documents, you might want to exclude documents that actually match one of your root paths.

A powerful solution to filter content is to use a Coveo indexing pipeline extension (IPE).

This article provides general information on the indexing process, and on indexing pipeline extensions more specifically. The article guides you on how to create indexing pipeline extensions with Python code samples you may adapt to filter your Adobe Experience Manager content.

Indexing Pipeline Extensions in the Indexing Process

The Coveo Adobe Experience Manager Connector sends HTML pages to the Coveo Document Processing Manager (DPM). The DPM contains extension stages (i.e., preconversion and postconversion) where you can apply custom code to modify the way your documents are indexed.

Indexing process diagram

Indexing pipeline extension scripts are in Python and we recommend you leverage the powerful and full-featured BeautifulSoup library.

Note

For general guidance on adding indexing pipeline extensions, see Manage Indexing Pipeline Extensions.

Example 1: Removing Sections From Your Pages

This example shows how to configure an indexing pipeline extension that strips the following from your HTML pages:

  • <footer> tags

  • Content in tags whose class attribute contains the navbar value.

To Create the Extension

  1. On the Extensions (platform-eu | platform-au) page of the Coveo Administration Console, click Add extension.

  2. In the Add an Extension panel, give your extension a name and a description.

  3. For this example, you need the extension to access the Original file data of your items.

    Extension Access Original File option
  4. In the Extension script field, paste the following code:

    from bs4 import BeautifulSoup
     
    read_only_stream = document.get_data_stream('documentdata')
    if read_only_stream is not None:
        modified_data = BeautifulSoup(read_only_stream.read().decode(), 'html.parser')
     
        # Remove matching nodes
        for element in modified_data.select('.navbar, footer'):
            element.decompose()
     
        modified_stream = document.DataStream('documentdata')
        modified_stream.write(str(modified_data))
        document.add_data_stream(modified_stream)
    Note

    This script uses the document object get_data_stream, DataStream, and add_data_stream methods.

    For the list of CSS selectors BeautifulSoup supports, see Quick Start - Soup Sieve.

  5. Click Add extension.

  1. On the Sources (platform-eu | platform-au) page, click the source you want to run your indexing pipeline extension on, and then click More > Manage extensions in the Action bar.

  2. In the Add Source Extensions panel, with the Common tab selected, click the Add drop-down menu, and then select Extension.

  3. In the Apply an Extension on Source Items panel, select your extension in the drop-down menu.

  4. Select the PRE-CONVERSION stage.

  5. Select the SKIP EXTENSION action on error.

  6. In the Conditions to apply field, enter %[primarytype] == 'Page'.

    Note

    This condition ensures the extension is only applied to your pages.

  7. Click Apply extension.

  8. Back in the Edit Source Extensions panel, click Save.

  9. Perform an indexing action so that the extension gets executed and removes the targeted content.

Example 2: Excluding Documents

This example shows how to configure an indexing pipeline extension that excludes documents from being indexed, though they match one of your root paths. The example excludes the following:

  • Documents with content/we-retail/us/en/community or content/we-retail/us/en/user in their URI.

  • Documents based on the section-page template.

To Create the Extension

  1. On the Extensions (platform-eu | platform-au) page, click Add extension.

  2. In the Add an Extension panel, give your extension a name and a description.

  3. For this example, you need the extension to access the Original file data of your items.

  4. In the Extension script field, paste the following code:

    uri = document.uri
    template = document.get_meta_data_value("template")
    
    if ('content/we-retail/us/en/community' in uri) or ('content/we-retail/us/en/user' in uri):
      document.reject()
    
    if (template):
      t = template[0]
      if (t == 'section-page'):
        document.reject()
    Note

    This script uses the document object uri, get_meta_data_value, and reject methods.

  5. Click Add extension.

  1. On the Sources (platform-eu | platform-au) page, click the source you want to run your indexing pipeline extension on, and then click More > Manage extensions in the Action bar.

  2. In the Add Source Extensions panel, with the Common tab selected, click the Add drop-down menu, and then select Extension.

  3. In the Apply an Extension on Source Items panel, select your extension in the drop-down menu.

  4. Select the PRE-CONVERSION stage.

  5. Select the SKIP EXTENSION action on error.

  6. Click Apply extension.

  7. Back in the Edit Source Extensions panel, click Save.

  8. Perform an indexing action so that the extension gets executed and removes the targeted content.

What's next for me?