Applying Indexing Techniques
Keeping a clean index should be a top priority throughout the entire lifecycle of your search solution. A clean index yields faster, more relevant search, and prevents you from having to maintain complex rules to filter out useless data at query time. Keeping a clean index is also important to ensure that a search solution doesn’t exceed the maximum number of items allowed by its underlying Coveo organization license (see Entitlements (Coveo Platform)).
The first steps are to:
Select appropriate source types.
Determine what data/metadata to index from each content repository.
Once this is done, there are several indexing techniques you can use to refine/enhance content prior to, and during the indexing process (see About the Indexing Process). Depending on the connector a given source is based on, you may have to apply one or more of those techniques.
In general, the earlier you can make content changes, the better. The following diagram illustrates the various indexing techniques as the optional steps of a funnel-like process.
This article explains each of those techniques.
Refine/Enhance Original Content Directly in Its Repository
Crawling messy/incomplete content may require substantial source configuration efforts, and may ultimately consume significant indexing time and computing resources.
If at all possible, you should therefore consider making changes in the original content repository before applying any subsequent indexing technique.
You plan to use a Sitemap source to index pages from an internally managed web site (see Sitemap Source).
While determining what data and metadata you require on each item, you realize that the HTML of the pages you want to index lacks some
meta tags you need, and that some current
meta tags don’t have human-readable
You contact the team or person responsible for the internal web site, and request that they make the necessary changes.
Set the Crawling Scope and Refine/Enhance Crawled Content
Several source connectors offer inclusion, exclusion, and/or filter configuration options to set their crawling scope, and possibly refine/enhance their crawled content (see Crawling).
If you can discard superfluous data and make necessary data alterations before the content even reaches the document processing manager (DPM), the indexing process becomes both lighter and easier to troubleshoot.
You can use the Web Scraping configuration of a Web or Sitemap source to specify that the crawler should ignore repetitive/irrelevant item data such as page headers and footers (see Web Scraping Configuration).
The Content to Include configuration of a Confluence Cloud source allows you to specify what spaces and what types of content to crawl.
It’s possible to configure a Google Drive for Work source such that it only crawls content owned by specific users (see Add or Edit a Google Drive for Work Source).
You can also create crawling filters directly in the JSON configuration of a source (see Add Source Filters).
Fully Leverage the Crawling Flexibility of Push API-Based Sources
Push sources, and sources relying on the Coveo On-Premises Crawling Module, offer greater crawling flexibility than other types of sources (at the cost of increased configuration complexity).
With a Push source, you have complete control over the crawling process (see Push Source). When using such a source, you should therefore ensure that the content you send to the indexing pipeline is entirely relevant, and doesn’t require any further enhancement/refinement.
A source relying on the Coveo On-Premises Crawling Module allows you to use pre-push extensions to alter or reject items before they reach the DPM (see Using Pre-Push Extensions).
The above statements don’t imply that you should use Push API-based sources all the time.
As a matter of fact:
You should typically use a source relying on the Coveo On-Premises Crawling Module only to index content behind a firewall.
Using a Push source should constitute a last resort when no other specific or generic connectors suits your needs.
The following diagram shows the funnel-like process of applying indexing techniques when using Push API-based sources.
As you can see, you should typically not have to use indexing pipeline extensions (IPEs) with those types of sources, since you can perform all necessary data enhancements/refinements prior to sending the items to the DPM.
Although they also rely on the Push API, you should consider Sitecore sources as “standard” sources for the purpose of applying indexing techniques.
This section assumes that the reader has some technical knowledge about relational databases.
Prior to, or while indexing content from a database, you should resolve all foreign keys targeting the data and metadata you want to retrieve.
If you’re using a Push source (see Push Source), your own code is responsible for executing the required statements, and formatting the data before sending it to the indexing pipeline using the Push API (see Push API).
If you’re using a Database source (see Database Source), its XML configuration allows you to specify SQL statements to execute during the crawling stage of the indexing pipeline (see Complement Information Retrieval Using Subqueries).
You want to index items representing books from a normalized database. You select a single row from the
book table and analyze the data.
You decide it would be relevant to index metadata about the publisher and authors of each book. You query the
publisher table to resolve the previously retrieved
publisher_id foreign key.
The publisher name is relevant, but you choose not to resolve the
country_id foreign key for book items.
A book can have many authors, and an author can have written many books. In the normalized database, this many-to-many relationship is represented in the
books_authors junction table. You query that table to list the authors of the previously retrieved
You can now resolve the
author_id foreign key. You query the
author table accordingly.
You decide to concatenate the
deceased values of each row into a string, and then concatenate those strings together, separating them with the
The denormalized data/metadata for a single book item now looks like this:
|“9781501192272”||“The Talisman”||…||“Gallery Books”||“Stephen King (b. 1947);Peter Straub (b. 1943)”|
In this context, the
@authors field would likely be a multi-value field using
; as a tokenizer (see Multi-Value Fields).
Define Custom Mapping Rules to Populate Fields
Most standard sources have a default mapping configuration that gets applied at the mapping stage of the indexing pipeline (see Mapping). You can also create your own custom mapping rules to populate standard or custom fields as needed (see Manage Source Mappings). Among other things, fields can be leveraged in result templates, facets, and query ranking expressions (QREs) (see About Fields).
Push sources have a peculiar mapping behavior (see About Push Source Item Metadata).
The mapping rule syntax allows you to reference metadata values at any origin, and concatenate one or more of those values together with arbitrary strings (see Mapping Rule Syntax Reference).
In a given source, you want to use three metadata to populate the
You define the following
@fullname mapping rule for that source:
%[last_name], %[first_name] (%[personal_title]).
This populates the
@fullname field with values such as:
Smith, Alice (Mrs.) and
Jones, Bob (Mr.).
When indexing content that doesn’t have an actual body, such as items retrieved from a database, you can also define mapping rules to populate the
body field with an assembly of relevant metadata (see Add or Edit a Body Mapping).
In a Database source indexing book items, you want to assemble several metadata to generate HTML item bodies.
You define the following
body mapping rule for that source:
<html> <body> <p><span>%[title]</span><span> - </span><span>%[author]</span></p> <img src='%[image]'> <p><b>%[summary]</b></p> <p><em>%[review]</em></p> </body> </html>
Populate Standard and Source-Specific Fields
Provisioning an organization (i.e., creating its very first source) automatically generates a set of standard fields in the index (e.g.,
@filetype, etc.). Those generic fields are intended to be populated with item metadata from various sources.
Creating a source based on a given connector for the first time in an organization also typically generates a set of prefixed fields meant to be populated with item metadata from sources based on the same connector only.
You create a YouTube source for the first time in your organization. This generates a set of
yt-prefixed fields in your index (e.g.,
@ytviewcount, etc.). Those fields are intended to be populated with item metadata from YouTube sources only.
Create and Populate Custom Fields
A given field may only contain values of a single type (integer, string, decimal, or date) determined at creation time. A field also has a set of customizable options (e.g., facet, sortable, displayable in results, etc.).
If necessary, you can modify existing fields, or create your own custom fields (see Add or Edit a Field).
Fields are index-wide containers. This implies that mapping configurations populating a given field should follow the same semantics across all sources (e.g., a field whose purpose is to uniquely identify an item should uniquely identify items across all sources populating it).
For example, you create a custom field called
In one source, you create a mapping rule to populate this field with metadata representing the number of times an item was opened since its creation.
In another source, you create a mapping rule to populate the same field with metadata representing the number of times an item was opened since its last modification.
@mynumberofviews field contains inconsistent data across those two sources, making it unreliable.
Further Refine/Enhance Content Using Indexing Pipeline Extensions
While several connectors offer a high degree of crawling flexibility, others only offer limited crawling options, or simply don’t support any means of setting their crawling scope, or refining/enhancing crawled content.
In such cases, you may want to consider using indexing pipeline extensions (IPEs) to alter or reject candidate items going through the DPM (see Applying Extensions). IPEs can complicate and slow the indexing process down, but they can also prevent undesired content from reaching your index. As such, using IPEs only when required is an advisable indexing technique.
While IPEs allow you to modify crawled items, you can’t use IPEs to generate new items.
Dropbox sources have no configuration options for scoping the crawling; you may therefore consider using IPEs to reject unwanted candidate items with those sources.
Khoros Community sources expose several options for setting their crawling scope. However, you could use IPEs with such sources to modify the data or metadata of certain candidate items.
Use Pre-Conversion IPEs
Pre-conversion IPEs are executed on candidate items before they reach the processing stage of the indexing pipeline (see Processing). You will typically use pre-conversion IPEs to:
Reject undesired candidate items that couldn’t be filtered out earlier.
Modify raw candidate item data, and ensure that those modifications are taken into account when the processing stage occurs.
In the configuration of a source, you decide to include pre-conversion IPEs to:
Reject outdated pages (e.g., pages that have not been modified in five years).
Reject confidential pages (e.g., pages whose URL query string contains
Append a disclaimer section to specific pages (e.g., pages whose URL query string contains
Use Post-Conversion IPEs
Post-conversion IPEs are executed on candidate items after the processing and mapping stages of the indexing pipeline (see Processing and Mapping). Therefore, those IPEs have access to the fully converted candidate item. You will typically use post-conversion IPEs to:
Modify candidate item metadata.
Adding a new metadata through an IPE is only useful if:
The source mapping configuration already contains a rule that deals with the added metadata, in which case, this rule applies at the indexing stage of the indexing pipeline (see Indexing), or
The index contains a field whose name is identical to the added metadata key (in which case the field is automatically populated with the metadata value at the indexing stage).
Interact with converted candidate item bodies.
List metadata values at different origins.
In the configuration of a source, you decide to include post-conversion IPEs to:
Normalize metadata values (e.g., convert values such as
Dostoevsky, Fyodor M.).
Call a third-party service to get sentiment analysis on the content of each page (e.g., MeaningCloud).
Discover metadata retrieved by the source, along with their values at previous stages (e.g., after the crawling stage, after a specific pre-conversion IPE has been applied, etc.).
The next article in this section offers guidelines on scheduling and triggering source updates, and monitoring the overall indexing process (see Keeping an Up-To-Date Index).