Filtering Content
Filtering Content
Your Adobe Experience Manager HTML pages typically contain sections with irrelevant content that you want to exclude from your Coveo organization indexed items (for example, nav
, header
, footer
tag content).
Similarly, certain documents in your website may not contain any valuable information. To index only relevant documents, you might want to exclude documents that match specific rules.
A powerful solution to filter content is to use a Coveo indexing pipeline extension (IPE).
This article provides general information on the indexing process, and on indexing pipeline extensions more specifically. The article guides you on how to create indexing pipeline extensions with Python code samples you may adapt to filter your Adobe Experience Manager content.
Indexing Pipeline Extensions in the Indexing Process
Coveo connectors send HTML pages to the Document Processing Manager (DPM). The DPM contains extension stages (that is, preconversion and postconversion) where you can apply custom code to modify the way your documents are indexed.
Indexing pipeline extension scripts are in Python and we recommend you leverage the powerful and full-featured BeautifulSoup library.
Note
For general guidance on adding indexing pipeline extensions, see Manage Indexing Pipeline Extensions. |
Example 1: Removing Sections From Your Pages
This example shows how to configure an indexing pipeline extension that strips the following from your HTML pages:
-
<footer>
tags -
Content in tags whose
class
attribute contains thenavbar
value.
To Create the Extension
-
On the Extensions (platform-ca | platform-eu | platform-au) page of the Coveo Administration Console, click Add extension.
-
In the Add an Extension panel, give your extension a name and a description.
-
For this example, you need the extension to access the Original file data of your items.
-
In the Extension script field, paste the following code:
from bs4 import BeautifulSoup read_only_stream = document.get_data_stream('documentdata') if read_only_stream is not None: modified_data = BeautifulSoup(read_only_stream.read().decode(), 'html.parser') # Remove matching nodes for element in modified_data.select('.navbar, footer'): element.decompose() modified_stream = document.DataStream('documentdata') modified_stream.write(str(modified_data)) document.add_data_stream(modified_stream)
NoteThis script uses the document object get_data_stream, DataStream, and add_data_stream methods.
For the list of CSS selectors BeautifulSoup supports, see Quick Start - Soup Sieve.
-
Click Add extension.
To Link Your Extension to Your Source
-
On the Sources (platform-ca | platform-eu | platform-au) page, click the source you want to run your indexing pipeline extension on, and then click More > Manage extensions in the Action bar.
-
In the Add Source Extensions panel, with the Common tab selected, click the Add dropdown menu, and then select Extension.
-
In the Apply an Extension on Source Items panel, select your extension in the dropdown menu.
-
Select the PRE-CONVERSION stage.
-
Select the SKIP EXTENSION action on error.
-
In the Conditions to apply field, enter
%[primarytype] == 'Page'
.NoteThis condition ensures the extension is only applied to your pages.
-
Click Apply extension.
-
Back in the Edit Source Extensions panel, click Save.
-
Perform an indexing action so that the extension gets executed and removes the targeted content.
Example 2: Excluding Documents
This example shows how to configure an indexing pipeline extension that excludes specified documents from being indexed. The example excludes the following:
-
Documents with
content/we-retail/us/en/community
orcontent/we-retail/us/en/user
in their URI. -
Documents based on the
section-page
template.
To Create the Extension
-
On the Extensions (platform-ca | platform-eu | platform-au) page, click Add extension.
-
In the Add an Extension panel, give your extension a name and a description.
-
For this example, you need the extension to access the Original file data of your items.
-
In the Extension script field, paste the following code:
uri = document.uri template = document.get_meta_data_value("template") if ('content/we-retail/us/en/community' in uri) or ('content/we-retail/us/en/user' in uri): document.reject() if (template): t = template[0] if (t == 'section-page'): document.reject()
NoteThis script uses the document object uri, get_meta_data_value, and reject methods.
-
Click Add extension.
To Link Your Extension to Your Source
-
On the Sources (platform-ca | platform-eu | platform-au) page, click the source you want to run your indexing pipeline extension on, and then click More > Manage extensions in the Action bar.
-
In the Add Source Extensions panel, with the Common tab selected, click the Add dropdown menu, and then select Extension.
-
In the Apply an Extension on Source Items panel, select your extension in the dropdown menu.
-
Select the PRE-CONVERSION stage.
-
Select the SKIP EXTENSION action on error.
-
Click Apply extension.
-
Back in the Edit Source Extensions panel, click Save.
-
Perform an indexing action so that the extension gets executed and removes the targeted content.
What’s Next?
The next step is to build search interfaces that will tap into the content you have indexed.