Indexing Pipeline Customization Tools Overview

This article provides an overview of Coveo Cloud tools and features that you can use to customize how each candidate item is processed through the indexing pipeline (see Coveo Cloud V2 Indexing Pipeline). You can sometimes use more than one tool to achieve a specific indexing customization goal, but some tools may impact performances or be available only for specific connectors.

As a developer, indexing pipeline extensions Python scripts may appear to you as the first choice because with code you can do what you want. While you can do a lot of things with indexing pipeline extensions, they can impact indexing performances and there is often another tool or feature that can do the same.

The following table lists customization tools starting with the ones that are the most appropriate to use either because of their effectiveness, ease of use, or performance optimization.

Tool / feature Indexing goal How to use Advantages Disadvantages
URL filters
  • Control the indexing scope, choosing which repository pages or sections to include in your source.
  • Efficiently processed by the Crawling stage.
  • Wildcard or regex flexibility.
  • For Web sources, easy configuration from the administration console (see Add or Edit a Web Source).
  • Applicable only to URL based source types (e.g., Web and Sitemap).
  • For Sitemap sources, less easy configuration from the source JSON.
  • Extract original document metadata values to populate specific Coveo index fields.
  • Customize or create an index item body for object based source types such as Salesforce or a database.
  • Efficiently processed by Mapping stage.
  • Conditional mappings based on item type.
  • Can concatenate one or more metadata and include personalized text with the Literal option.
  • Can edit body content (see Add or Edit a Body Mapping).
  • Can get metadata value from a specific stage with the origin suffix (see Mapping Rules Syntax Reference).
  • Cannot programmatically process metadata values.

Web scraping configuration
  • Exclude specific web page sections.
  • Extract specific content to create metadata.
  • Create sub-items.
  • Efficiently processed by the Crawling stage.
  • Coveo Labs Chrome extension available to easily create web scraping configurations (see web-scraper-helper).
  • Exclusion of repeating web page parts from index (e.g., header, sidebar, footer) (see Exclusion).
  • Extraction of content from HTML elements with XPATH and CSS locators to enrich metadata (see Selectors ).
  • Splitting parts of a web page into more than on index items (see SubItems).
  • Available only for Web and Sitemap sources.
  • Requires developers skills to create the JSON web scraping configuration and take full advantage of XPATH and CSS expressions.
Indexing Pipeline Extension
  • When not possible with the above tools:
    • Add/modify metadata.
    • Add/modify data streams.
    • Reject items (in pre-conversion scripts).
    • Exclude specific web page sections.
  • Use external resources and services (e.g., image recognition API to inject metadata).
  • Add/modify item permissions (e.g., for a Push source for which the crawler does not associate permissions).
  • Requires developer skills to create the Python scripts.
  • Extension execution for each index item affect indexing performances.
  • Limit of 10 indexing pipeline extensions per organization.
  • Extension script execution limited to 5 seconds.