Apply indexing techniques

Keeping a clean index should be a top priority throughout the entire lifecycle of your search solution. A clean index yields faster, more relevant search, and prevents you from having to maintain complex rules to filter out useless data at query time. Keeping a clean index is also important to ensure that a search solution doesn’t exceed the maximum number of items allowed by its underlying Coveo organization license.

The first steps are to:

  • Locate content.

  • Select appropriate source types.

  • Determine what data or metadata to index from each content repository.

Once this is done, there are several indexing techniques that you can use to refine or enhance content both prior to and during the indexing process. Depending on the connector a given source is based on, you may have to apply one or more of those techniques.

As a rule, the earlier you can make content changes, the better. The following diagram illustrates the various indexing techniques as the optional steps of a funnel-like process.

This article explains each of these techniques.

Indexing techniques funnel | Coveo

Refine or enhance original content directly in its repository

Crawling messy or incomplete content may require substantial source configuration efforts, and may consume significant indexing time and computing resources.

If at all possible, you should consider making changes in the original content repository before applying any other indexing techniques.

Example

You plan to use a Sitemap source to index pages from an internally managed web site.

While determining what data and metadata you require on each item, you realize that the HTML of the pages you want to index lacks some meta tags you need, and that some current meta tags don’t have human-readable name or content values.

You contact the team or person responsible for the internal web site, and request that they make the necessary changes.

Set the crawling scope and refine or enhance crawled content

Several source connectors offer inclusion, exclusion, or filter configuration options to set their crawling scope, and possibly refine or enhance their crawled content.

If you can discard superfluous data and make necessary data alterations before the content even reaches the document processing manager (DPM), the indexing process becomes both lighter and easier to troubleshoot.

Examples
Note

You can also create crawling filters directly in the JSON configuration of a source.

Fully leverage the crawling flexibility of Push API-based sources

Push sources, and sources relying on the Coveo Crawling Module, offer greater crawling flexibility than other types of sources (at the cost of increased configuration complexity).

  • With a Push source, you have complete control over the crawling process. When using such a source, you should therefore ensure that the content you send to the indexing pipeline is entirely relevant, and doesn’t require any further enhancement or refinement.

  • A source that relies on the Coveo Crawling Module lets you use pre-push extensions to alter or reject items before they reach the DPM.

Important

The above statements don’t imply that you should use Push API-based sources all the time. In fact:

  • You should only use a source that relies on the Coveo Crawling Module to index content behind a firewall.

  • Using a Push source should constitute a last resort when no other specific or generic connectors suits your needs.

The following diagram shows the funnel-like process of applying indexing techniques when using Push API-based sources.

Indexing techniques funnel for Push API-based sources | Coveo

As you can see, you should typically not have to use indexing pipeline extensions (IPEs) with those types of sources, since you can perform all necessary data enhancements or refinements prior to sending the items to the DPM.

Note

Although they also rely on the Push API, you should consider Sitecore sources as "standard" sources for the purpose of applying indexing techniques.

Denormalize data

Note

This section assumes that the reader has some technical knowledge about relational databases.

Prior to, or while indexing content from a database, you should resolve all foreign keys targeting the data and metadata you want to retrieve.

  • If you’re using a Push source, your own code is responsible for executing the required statements and formatting the data before sending it to the indexing pipeline using the Push API.

  • If you’re using a Database source, its XML configuration lets you specify SQL statements to execute during the crawling stage of the indexing pipeline.

Example

You want to index items representing books from a normalized database. You select a single row from the book table and analyze the data.

book_id title summary publisher_id

9781501192272

"The Talisman"

…​

123

You decide it would be relevant to index metadata about the publisher and authors of each book. You query the publisher table to resolve the previously retrieved publisher_id foreign key.

publisher_id name country_id

123

"Gallery Books"

789

The publisher name is relevant, but you choose not to resolve the country_id foreign key for book items.

A book can have many authors, and an author can have written many books. In the normalized database, this many-to-many relationship is represented in the books_authors junction table. You query that table to list the authors of the previously retrieved book_id.

author_id book_id

456

9781501192272

789

9781501192272

You can now resolve the author_id foreign key. You query the author table accordingly.

author_id last_name first_name born deceased

456

"King"

"Stephen"

1947

null

789

"Straub"

"Peter"

1943

null

You decide to concatenate the last_name, first_name, born, and deceased values of each row into a string, and then concatenate those strings together, separating them with the ; character.

The denormalized data and metadata for a single book item now looks like this:

isbn title summary publisher authors

"9781501192272"

"The Talisman"

…​

"Gallery Books"

"Stephen King (b.1947);Peter Straub (b.1943)"

You choose to {map} the summary to the item body field, and all other metadata to fields with corresponding names in your index (see Define custom mapping rules to populate fields).

Note

In this context, the @authors field would likely be a multi-value field which uses ; as a tokenizer.

Define custom mapping rules to populate fields

Most standard sources have a default mapping configuration that gets applied at the mapping stage of the indexing pipeline. You can also create your own custom mapping rules to populate standard or custom fields as needed. Among other things, fields can be leveraged in result templates, facets, and query ranking expressions (QREs).

Note

Push sources have a peculiar mapping behavior.

The mapping rule syntax lets you reference metadata values at any origin, and concatenate one or more of those values together with arbitrary strings.

Example

In a given source, you want to use three metadata to populate the @fullname field:

  • personal_title

  • first_name

  • last_name

You define the following @fullname mapping rule for that source: %[last_name], %[first_name] (%[personal_title]).

This populates the @fullname field with values such as: Smith, Alice (Mrs.) and Jones, Bob (Mr.).

Note

When indexing content that doesn’t have an actual body, such as items retrieved from a database, you can also define mapping rules to populate the body field with an assembly of relevant metadata.

Example

In a Database source indexing book items, you want to assemble several metadata to generate HTML item bodies.

You define the following body mapping rule for that source:

<html>
  <body>
    <p><span>%[title]</span><span> - </span><span>%[author]</span></p>
    <img src='%[image]'>
    <p><b>%[summary]</b></p>
    <p><em>%[review]</em></p>
  </body>
</html>

Populate standard and source-specific fields

Provisioning an organization (that is, creating its very first source) automatically generates a set of standard fields in the index (for example, @author, @date, @filetype, etc.). Those generic fields are intended to be populated with item metadata from various sources.

Creating a source based on a given connector for the first time in an organization also typically generates a set of prefixed fields meant to be populated with item metadata from sources based on the same connector only.

Example

You create a YouTube source for the first time in your organization. This generates a set of yt-prefixed fields in your index (for example, @ytcategory, @ytchanneltitle, @ytviewcount, etc.). These fields are only intended to be populated with item metadata from YouTube sources.

Create and populate custom fields

A given field may only contain values of a single type (integer, string, decimal, or date) determined at creation time. A field also has a set of customizable options (for example, facet, sortable, displayable in results, etc.).

Important

Fields are index-wide containers. This implies that mapping configurations populating a given field should follow the same semantics across all sources (for example, a field whose purpose is to uniquely identify an item should uniquely identify items across all sources populating it).

For example, you create a custom field called @mynumberofviews:

  • In one source, you create a mapping rule to populate this field with metadata representing the number of times an item was opened since its creation.

  • In another source, you create a mapping rule to populate the same field with metadata representing the number of times an item was opened since its last modification.

Therefore, the @mynumberofviews field contains inconsistent data across those two sources, making it unreliable.

Further refine or enhance content using indexing pipeline extensions

While several connectors offer a high degree of crawling flexibility, others only offer limited crawling options, or simply don’t support any means of setting their crawling scope, or refining or enhancing crawled content.

In such cases, you may want to consider using indexing pipeline extensions (IPEs) to alter or reject candidate items going through the DPM. Using IPEs can complicate and slow down the indexing process, but they can also prevent undesired content from reaching your index. As such, using IPEs only when required is an advisable indexing technique.

Note

While IPEs let you modify crawled items, you can’t use IPEs to generate new items.

Examples
  • Dropbox sources have no configuration options for scoping the crawling, so you might consider using IPEs to reject unwanted candidate items with those sources.

  • Khoros Community sources expose several options for setting their crawling scope. However, you could use IPEs with such sources to modify the data or metadata of certain candidate items.

Use pre-conversion IPEs

Pre-conversion IPEs are executed on candidate items before they reach the processing stage of the indexing pipeline. You will typically use pre-conversion IPEs to:

  • Reject undesired candidate items that couldn’t be filtered out earlier.

  • Modify raw candidate item data, and ensure that those modifications are taken into account when the processing stage occurs.

Example

In the configuration of a source, you decide to include pre-conversion IPEs to:

  • Reject outdated pages (for example, pages that have not been modified in five years).

  • Reject confidential pages (for example, pages whose URL query string contains confidential=true).

  • Append a disclaimer section to specific pages (for example, pages whose URL query string contains pilotFeature=true).

Use post-conversion IPEs

Post-conversion IPEs are executed on candidate items after the processing and mapping stages of the indexing pipeline. Therefore, these IPEs have access to the fully converted candidate item. You will typically use post-conversion IPEs to:

  • Modify candidate item metadata.

    Note

    Adding new metadata through an IPE is only useful if one of the following applies:

    • The source mapping configuration already contains a rule that deals with the added metadata. In this case, the rule applies at the indexing stage of the indexing pipeline.

    • The index contains a field whose name is identical to the added metadata key. In this case, the field is automatically populated with the metadata value at the indexing stage.

  • Interact with converted candidate item bodies.

  • List metadata values at different origins.

    Examples

    In the configuration of a source, you decide to include post-conversion IPEs to:

    • Normalize metadata values (for example, convert values such as dostoevsky|FyodorMikhailovich to Dostoevsky, Fyodor M.).

    • Call a third-party service to get sentiment analysis on the content of each page (for example, MeaningCloud).

    • Discover metadata retrieved by the source, along with their values at previous stages (for example, after the crawling stage, after a specific pre-conversion IPE has been applied, etc.).

What’s next?

The Keep an index up to date article offers guidelines on scheduling and triggering source updates, and monitoring the indexing process.