Apply indexing techniques
Apply indexing techniques
Keeping a clean index should be a top priority throughout the entire lifecycle of your search solution. A clean index yields faster, more relevant search, and prevents you from having to maintain complex rules to filter out useless data at query time. Keeping a clean index is also important to ensure that a search solution doesn’t exceed the maximum number of items allowed by its underlying Coveo organization license.
The first steps are to:
Once this is done, there are several indexing techniques that you can use to refine or enhance content both prior to and during the indexing process. Depending on the connector a given source is based on, you may have to apply one or more of those techniques.
As a rule, the earlier you can make content changes, the better. The following diagram illustrates the various indexing techniques as the optional steps of a funnel-like process.
This article explains each of these techniques.
Refine or enhance original content directly in its repository
Crawling messy or incomplete content may require substantial source configuration efforts, and may consume significant indexing time and computing resources.
If at all possible, you should consider making changes in the original content repository before applying any other indexing techniques.
You plan to use a Sitemap source to index pages from an internally managed web site.
While determining what data and metadata you require on each item, you realize that the HTML of the pages you want to index lacks some meta
tags you need, and that some current meta
tags don’t have human-readable name
or content
values.
You contact the team or person responsible for the internal web site, and request that they make the necessary changes.
Set the crawling scope and refine or enhance crawled content
Several source connectors offer inclusion, exclusion, or filter configuration options to set their crawling scope, and possibly refine or enhance their crawled content.
If you can discard superfluous data and make necessary data alterations before the content even reaches the document processing manager (DPM), the indexing process becomes both lighter and easier to troubleshoot.
-
You can use the web scraping configuration of a Web or Sitemap source to specify that the crawler should ignore repetitive or irrelevant item data such as page headers and footers.
-
The Confluence Cloud source lets you specify which spaces and types of content to crawl.
-
It’s possible to configure a Google Drive for Work source such that it only crawls content owned by specific users.
Note
You can also create crawling filters directly in the JSON configuration of a source. |
Fully leverage the crawling flexibility of Push API-based sources
Push sources, and sources relying on the Coveo Crawling Module, offer greater crawling flexibility than other types of sources (at the cost of increased configuration complexity).
-
With a Push source, you have complete control over the crawling process. When using such a source, you should therefore ensure that the content you send to the indexing pipeline is entirely relevant, and doesn’t require any further enhancement or refinement.
-
A source that relies on the Coveo Crawling Module lets you use pre-push extensions to alter or reject items before they reach the DPM.
The above statements don’t imply that you should use Push API-based sources all the time. In fact:
|
The following diagram shows the funnel-like process of applying indexing techniques when using Push API-based sources.
As you can see, you should typically not have to use indexing pipeline extensions (IPEs) with those types of sources, since you can perform all necessary data enhancements or refinements prior to sending the items to the DPM.
Note
Although they also rely on the Push API, you should consider Sitecore sources as "standard" sources for the purpose of applying indexing techniques. |
Denormalize data
Note
This section assumes that the reader has some technical knowledge about relational databases. |
Prior to, or while indexing content from a database, you should resolve all foreign keys targeting the data and metadata you want to retrieve.
-
If you’re using a Push source, your own code is responsible for executing the required statements and formatting the data before sending it to the indexing pipeline using the Push API.
-
If you’re using a Database source, its XML configuration lets you specify SQL statements to execute during the crawling stage of the indexing pipeline.
You want to index items representing books from a normalized database.
You select a single row from the book
table and analyze the data.
book_id |
title |
summary |
publisher_id |
---|---|---|---|
9781501192272 |
"The Talisman" |
… |
123 |
You decide it would be relevant to index metadata about the publisher and authors of each book.
You query the publisher
table to resolve the previously retrieved publisher_id
foreign key.
publisher_id |
name |
country_id |
---|---|---|
123 |
"Gallery Books" |
789 |
The publisher name is relevant, but you choose not to resolve the country_id
foreign key for book items.
A book can have many authors, and an author can have written many books.
In the normalized database, this many-to-many relationship is represented in the books_authors
junction table.
You query that table to list the authors of the previously retrieved book_id
.
author_id |
book_id |
---|---|
456 |
9781501192272 |
789 |
9781501192272 |
You can now resolve the author_id
foreign key.
You query the author
table accordingly.
author_id |
last_name |
first_name |
born |
deceased |
---|---|---|---|---|
456 |
"King" |
"Stephen" |
1947 |
null |
789 |
"Straub" |
"Peter" |
1943 |
null |
You decide to concatenate the last_name
, first_name
, born
, and deceased
values of each row into a string, and then concatenate those strings together, separating them with the ;
character.
The denormalized data and metadata for a single book item now looks like this:
isbn |
title |
summary |
publisher |
authors |
---|---|---|---|---|
"9781501192272" |
"The Talisman" |
… |
"Gallery Books" |
"Stephen King (b.1947);Peter Straub (b.1943)" |
You choose to map the summary
to the item body
field, and all other metadata to fields with corresponding names in your index (see Define custom mapping rules to populate fields).
Note
In this context, the |
Define custom mapping rules to populate fields
Most standard sources have a default mapping configuration that gets applied at the mapping stage of the indexing pipeline. You can also create your own custom mapping rules to populate standard or custom fields as needed. Among other things, fields can be leveraged in result templates, facets, and query ranking expressions (QREs).
Note
Push sources have a peculiar mapping behavior. |
The mapping rule syntax lets you reference metadata values at any origin, and concatenate one or more of those values together with arbitrary strings.
In a given source, you want to use three metadata to populate the @fullname
field:
-
personal_title
-
first_name
-
last_name
You define the following @fullname
mapping rule for that source: %[last_name], %[first_name] (%[personal_title])
.
This populates the @fullname
field with values such as: Smith, Alice (Mrs.)
and Jones, Bob (Mr.)
.
Note
When indexing content that doesn’t have an actual body, such as items retrieved from a database, you can also define mapping rules to populate the |
In a Database source indexing book items, you want to assemble several metadata to generate HTML item bodies.
You define the following body
mapping rule for that source:
<html>
<body>
<p><span>%[title]</span><span> - </span><span>%[author]</span></p>
<img src='%[image]'>
<p><b>%[summary]</b></p>
<p><em>%[review]</em></p>
</body>
</html>
Populate standard and source-specific fields
Provisioning an organization (that is, creating its very first source) automatically generates a set of standard fields in the index (for example, @author
, @date
, @filetype
, etc.).
Those generic fields are intended to be populated with item metadata from various sources.
Creating a source based on a given connector for the first time in an organization also typically generates a set of prefixed fields meant to be populated with item metadata from sources based on the same connector only.
You create a YouTube source for the first time in your organization.
This generates a set of yt
-prefixed fields in your index (for example, @ytcategory
, @ytchanneltitle
, @ytviewcount
, etc.).
These fields are only intended to be populated with item metadata from YouTube sources.
Create and populate custom fields
A given field may only contain values of a single type (integer, string, decimal, or date) determined at creation time. A field also has a set of customizable options (for example, facet, sortable, displayable in results, etc.).
If necessary, you can modify existing fields or create your own custom fields.
Fields are index-wide containers. This implies that mapping configurations populating a given field should follow the same semantics across all sources (for example, a field whose purpose is to uniquely identify an item should uniquely identify items across all sources populating it). For example, you create a custom field called
Therefore, the |
Further refine or enhance content using indexing pipeline extensions
While several connectors offer a high degree of crawling flexibility, others only offer limited crawling options, or simply don’t support any means of setting their crawling scope, or refining or enhancing crawled content.
In such cases, you may want to consider using indexing pipeline extensions (IPEs) to alter or reject candidate items going through the DPM. Using IPEs can complicate and slow down the indexing process, but they can also prevent undesired content from reaching your index. As such, using IPEs only when required is an advisable indexing technique.
Note
While IPEs let you modify crawled items, you can’t use IPEs to generate new items. |
-
Dropbox sources have no configuration options for scoping the crawling, so you might consider using IPEs to reject unwanted candidate items with those sources.
-
Khoros Community sources expose several options for setting their crawling scope. However, you could use IPEs with such sources to modify the data or metadata of certain candidate items.
Use pre-conversion IPEs
Pre-conversion IPEs are executed on candidate items before they reach the processing stage of the indexing pipeline. You will typically use pre-conversion IPEs to:
-
Reject undesired candidate items that couldn’t be filtered out earlier.
-
Modify raw candidate item data, and ensure that those modifications are taken into account when the processing stage occurs.
In the configuration of a source, you decide to include pre-conversion IPEs to:
-
Reject outdated pages (for example, pages that have not been modified in five years).
-
Reject confidential pages (for example, pages whose URL query string contains
confidential=true
). -
Append a disclaimer section to specific pages (for example, pages whose URL query string contains
pilotFeature=true
).
Use post-conversion IPEs
Post-conversion IPEs are executed on candidate items after the processing and mapping stages of the indexing pipeline. Therefore, these IPEs have access to the fully converted candidate item. You will typically use post-conversion IPEs to:
-
Modify candidate item metadata.
NoteAdding new metadata through an IPE is only useful if one of the following applies:
-
The source mapping configuration already contains a rule that deals with the added metadata. In this case, the rule applies at the indexing stage of the indexing pipeline.
-
The index contains a field whose name is identical to the added metadata key. In this case, the field is automatically populated with the metadata value at the indexing stage.
-
-
Interact with converted candidate item bodies.
-
List metadata values at different origins.
ExamplesIn the configuration of a source, you decide to include post-conversion IPEs to:
-
Normalize metadata values (for example, convert values such as
dostoevsky|FyodorMikhailovich
toDostoevsky, Fyodor M.
). -
Call a third-party service to get sentiment analysis on the content of each page (for example, MeaningCloud).
-
Discover metadata retrieved by the source, along with their values at previous stages (for example, after the crawling stage, after a specific pre-conversion IPE has been applied, etc.).
-
What’s next?
The Keep an index up to date article offers guidelines on scheduling and triggering source updates, and monitoring the indexing process.