Indexing pipeline customization tools overview
Indexing pipeline customization tools overview
This article provides an overview of the Coveo tools and features that you can use to customize how each candidate item is processed through the Coveo indexing pipeline. You can sometimes choose between a few tools to achieve the same indexing customization goal, but some of them may impact performance, or may only be available for certain connectors.
As a developer, your first choice might be to use indexing pipeline extensions (IPEs) because they’re scripts and offer great flexibility. However, IPEs can also decrease indexing performance, and there’s often another tool or feature that can achieve the same goal with less overhead.
The article lists customization tools starting with the ones that are the most appropriate to use either because of their effectiveness, ease of use, or performance optimization.
URL filters
Indexing goal
Control the indexing scope, choosing which repository pages or sections to include in your source.
How to use
-
For Web sources, configure from the Administration Console (see Add a Web source).
-
For other URL based source types:
-
Edit the source JSON configuration (see Add source filters).
-
Create/update a source from the API (see Sources: Source Resource).
-
Advantages
-
Efficiently processed by the Crawling stage.
-
Wildcard or regex flexibility.
-
For Web sources, easy configuration from the Administration Console (see Add a Web source).
Disadvantages
-
Only applicable to URL based source types (for example, Web and Sitemap).
-
For Sitemap sources, configuration from the source JSON can be challenging.
Mappings
Indexing goal
How to use
-
Configure from the Coveo Administration Console (see Manage source mappings).
-
Create/update mappings from the API (see Mappings: Mapping Resource).
Advantages
-
Efficiently processed by Mapping stage.
-
Conditional mappings based on item type.
-
Can concatenate one or more metadata and include personalized text with the Literal option.
-
Can edit body content (see Add or edit a body mapping).
-
Can get metadata values from a specific stage with the origin suffix (see Mapping rules syntax).
Disadvantages
-
Can’t programmatically process metadata values.
Web scraping configuration
Indexing goal
-
Exclude specific web page sections.
-
Extract specific content to create metadata.
-
Create sub-items.
How to use
-
For Web sources, configure from the Coveo Administration Console (see Add a Web source).
-
For a Sitemap source, edit the source JSON configuration.
-
For both source types, create/update source from the API (see Update a source from simple configuration).
Advantages
-
Efficiently processed by the Crawling stage.
-
Coveo Labs Chrome extension is available to easily create web scraping configurations (see web-scraper-helper).
-
Exclusion of repeating web page parts from index (for example, header, sidebar, footer) (see Elements to exclude).
-
Extraction of content from HTML elements with XPATH and CSS locators to enrich metadata (see Metadata to extract).
-
Splitting web page parts into multiple index items (see SubItems).
Disadvantages
-
Only available for Web and Sitemap sources.
-
Requires developers skills to create the JSON web scraping configuration and take full advantage of XPATH and CSS expressions.
Indexing pipeline extension
Indexing goal
-
When it’s not possible with the above tools to:
-
Add/modify metadata.
-
Add/modify data streams.
-
Reject items (in pre-conversion scripts).
-
Exclude specific web page sections.
-
-
Use external resources and services (for example, use an image recognition API to inject metadata).
-
Add/modify item permissions (for example, for a Push source for which the crawler doesn’t associate permissions).
How to use
-
Modify/create appropriate Python scripts (see Document Object Python API Reference) and Indexing Pipeline Extension Script Samples.
-
Add a script to the organization as an extension (see Add or edit an indexing pipeline extension).
-
Apply an extension to a source (see Edit source extension).
Advantages
-
Accessibility to third party services and databases.
-
Flexibility of Python language and available libraries (see Python modules available to IPEs).
-
Extension code reuse with conditional execution and extension parameters.
-
Index item processing to:
-
Manage metadata.
-
Manage permissions.
-
Manage security providers.
-
Manage data streams.
-
Retrieve the URI.
-
Set log messages (see Logging messages from an indexing pipeline extension).
-
Disadvantages
-
Requires developer skills to create the Python scripts.
-
Each extension execution affects indexing performance.
-
Limit of 10 indexing pipeline extensions per organization.
-
Extension script execution limited to 5 seconds.