Customize the Indexing Process
Customize the Indexing Process
When a source is added, there are specific item types for indexing that are selected by default. These available item types can vary for each source. In the source JSON configuration, you can modify either the items that you want indexed or the item extensions as a means to strategize your content.
This practice is useful when indexing larger sources such as Sharepoint, Dropbox, or Google Drive. In those sources, there may be certain items that aren’t useful for indexing since they wouldn’t contribute to building a relevant search experience. By fine-tuning the items to be indexed, not only does it result in retrieving only the necessary content, but it also results in a faster and a more efficient source update.
Edit Index Items
Edit Item Types
Coveo crawlers can index the content of many items of various formats and sizes. However, by default, they don’t index certain formats or very large items. To include large or unsupported items in the search results, Coveo sources index these items by reference, which means that a source only contains their file information, such as URI, file name, and other metadata.
You index your company Dropbox account, which contains a Microsoft Publisher item (.pub
) created by a user named John Smith.
Because Publisher file content isn’t indexed by default for this source, the item is indexed by reference.
The Dropbox source includes the following metadata:
-
filename:
Retirement_Announcement_Letter
-
title:
Early Retirement
-
author:
John Smith
-
date of last modification:
April 1, 2017
-
URI:
www.dropbox.com/yourcompany/jsmith/foldername/RetirementAnnouncementLetter
After the item is indexed, John Smith queries retirement letter
to retrieve it.
Because his query contains keywords which match the above metadata, the item appears in the search results.
He can then review the item metadata in the search interface or use the URI to open it directly from his Dropbox folder.
However, if John Smith uses keywords in his query which match the content of the item rather than its metadata (for example, dear colleagues
), the file wouldn’t appear in the search results.
Edit Item Extensions
Based on their file extensions, each connector is configured by default to index several item types that can be found in the source. In the source JSON configuration, you can see the list of supported file extensions and the associated settings determining how they’re processed by the source. You can then modify which items Coveo indexes when crawling the repository.
By default, an Amazon S3 source indexes many item types.
You index an Amazon S3 bucket that contains both .html
and .pdf
files, but you only want to index .html
files.
In the documentConfig
section of the source JSON configuration, you identify the extensions
sub-sections containing the .pdf
extension type, and change the action
and actionOnError
values from Retrieve
to Ignore
, and then rebuild your source to reject .pdf
files.
{
"extensions": [
".pdf"
],
"extensionSetting": {
"action": "Ignore",
"actionOnError": "Ignore",
"converter": "Detect",
"useContentType": false,
"indexContainer": true,
"fileTypeValue": "",
"generateThumbnail": true,
"useExternalHTMLGenerator": false,
"convertDirectlyToHtml": false
}
}
Handling File Formats in Source JSON
You can browse the JSON configuration of a source to review and change how Coveo handles file formats or extensions that it encounters when crawling a system. Under extensionSetting
for the desired file format, two parameters determine how an item is handled:
Parameter | Description |
---|---|
|
Default action to take when encountering an item of the above file format. |
|
Action to take when an error occurs while crawling an item of the above file format. |
For both parameters, there are three action values possible. However, only a single value can be applied for each parameter:
Value | Description |
---|---|
|
Index the item by content and metadata. |
|
Index the item metadata only. |
|
Skip the item (that is, don’t index it). |
For file formats that are common within a given source, action
and actionOnError
are typically set to Retrieve
and Reference
, respectively. For uncommon formats, or those associated with larger file sizes, both parameters are often set to Reference
.