THIS IS ARCHIVED DOCUMENTATION

Filtering Content

Your Adobe Experience Manager HTML pages typically contain sections with irrelevant content that you want to exclude from your Coveo organization indexed items (for example, nav, header, footer tag content).

Similarly, certain documents in your website may not contain any valuable information. To index only relevant documents, you might want to exclude documents that match specific rules.

A powerful solution to filter content is to use a Coveo indexing pipeline extension (IPE).

This article provides general information on the indexing process, and on indexing pipeline extensions more specifically. The article guides you on how to create indexing pipeline extensions with Python code samples you may adapt to filter your Adobe Experience Manager content.

Indexing Pipeline Extensions in the Indexing Process

Coveo connectors send HTML pages to the Document Processing Manager (DPM). The DPM contains extension stages (that is, preconversion and postconversion) where you can apply custom code to modify the way your documents are indexed.

Indexing process diagram

Indexing pipeline extension scripts are in Python and we recommend you leverage the powerful and full-featured BeautifulSoup library.

Note

For general guidance on adding indexing pipeline extensions, see Manage Indexing Pipeline Extensions.

Example 1: Removing Sections From Your Pages

This example shows how to configure an indexing pipeline extension that strips the following from your HTML pages:

  • <footer> tags

  • Content in tags whose class attribute contains the navbar value.

To Create the Extension

  1. On the Extensions (platform-ca | platform-eu | platform-au) page of the Coveo Administration Console, click Add extension.

  2. In the Add an Extension panel, give your extension a name and a description.

  3. For this example, you need the extension to access the Original file data of your items.

    Extension Access Original File option
  4. In the Extension script field, paste the following code:

    from bs4 import BeautifulSoup
    
    read_only_stream = document.get_data_stream('documentdata')
    if read_only_stream is not None:
        modified_data = BeautifulSoup(read_only_stream.read().decode(), 'html.parser')
    
        # Remove matching nodes
        for element in modified_data.select('.navbar, footer'):
            element.decompose()
    
        modified_stream = document.DataStream('documentdata')
        modified_stream.write(str(modified_data))
        document.add_data_stream(modified_stream)
    Note

    This script uses the document object get_data_stream, DataStream, and add_data_stream methods.

    For the list of CSS selectors BeautifulSoup supports, see Quick Start - Soup Sieve.

  5. Click Add extension.

  1. On the Sources (platform-ca | platform-eu | platform-au) page, click the source you want to run your indexing pipeline extension on, and then click More > Manage extensions in the Action bar.

  2. In the Add Source Extensions panel, with the Common tab selected, click the Add dropdown menu, and then select Extension.

  3. In the Apply an Extension on Source Items panel, select your extension in the dropdown menu.

  4. Select the PRE-CONVERSION stage.

  5. Select the SKIP EXTENSION action on error.

  6. In the Conditions to apply field, enter %[primarytype] == 'Page'.

    Note

    This condition ensures the extension is only applied to your pages.

  7. Click Apply extension.

  8. Back in the Edit Source Extensions panel, click Save.

  9. Perform an indexing action so that the extension gets executed and removes the targeted content.

Example 2: Excluding Documents

This example shows how to configure an indexing pipeline extension that excludes specified documents from being indexed. The example excludes the following:

  • Documents with content/we-retail/us/en/community or content/we-retail/us/en/user in their URI.

  • Documents based on the section-page template.

To Create the Extension

  1. On the Extensions (platform-ca | platform-eu | platform-au) page, click Add extension.

  2. In the Add an Extension panel, give your extension a name and a description.

  3. For this example, you need the extension to access the Original file data of your items.

  4. In the Extension script field, paste the following code:

    uri = document.uri
    template = document.get_meta_data_value("template")
    
    if ('content/we-retail/us/en/community' in uri) or ('content/we-retail/us/en/user' in uri):
      document.reject()
    
    if (template):
      t = template[0]
      if (t == 'section-page'):
        document.reject()
    Note

    This script uses the document object uri, get_meta_data_value, and reject methods.

  5. Click Add extension.

  1. On the Sources (platform-ca | platform-eu | platform-au) page, click the source you want to run your indexing pipeline extension on, and then click More > Manage extensions in the Action bar.

  2. In the Add Source Extensions panel, with the Common tab selected, click the Add dropdown menu, and then select Extension.

  3. In the Apply an Extension on Source Items panel, select your extension in the dropdown menu.

  4. Select the PRE-CONVERSION stage.

  5. Select the SKIP EXTENSION action on error.

  6. Click Apply extension.

  7. Back in the Edit Source Extensions panel, click Save.

  8. Perform an indexing action so that the extension gets executed and removes the targeted content.

What’s Next?

The next step is to build search interfaces that will tap into the content you have indexed.