Customize the Indexing Process

In this article

When a source is added, there are specific item types for indexing that are selected by default. These available item types can vary for each source. In the source JSON configuration, you can modify either the items that you want indexed or the item extensions as a means to strategize your content.

This practice is useful when indexing larger sources such as Sharepoint, Dropbox, or Google Drive. In those sources, there may be certain items that aren’t useful for indexing since they wouldn’t contribute to building a relevant search experience. By fine-tuning the items to be indexed, not only does it result in retrieving only the necessary content, but it also results in a faster and a more efficient source update.

Edit Index Items

Edit Item Types

Coveo crawlers can index the content of many items of various formats and sizes. However, by default, they don’t index certain formats or very large items. To include large or unsupported items in the search results, Coveo sources index these items by reference, which means that a source only contains their file information, such as URI, file name, and other metadata.

Example

You index your company Dropbox account, which contains a Microsoft Publisher item (.pub) created by a user named John Smith. Because Publisher file content isn’t indexed by default for this source, the item is indexed by reference. The Dropbox source includes the following metadata:

  • filename: Retirement_Announcement_Letter

  • title: Early Retirement

  • author: John Smith

  • date of last modification: April 1, 2017

  • URI: www.dropbox.com/yourcompany/jsmith/foldername/RetirementAnnouncementLetter

After the item is indexed, John Smith queries retirement letter to retrieve it. Because his query contains keywords which match the above metadata, the item appears in the search results. He can then review the item metadata in the search interface or use the URI to open it directly from his Dropbox folder. However, if John Smith uses keywords in his query which match the content of the item rather than its metadata (for example, dear colleagues), the file wouldn’t appear in the search results.

Edit Item Extensions

Based on their file extensions, each connector is configured by default to index several item types that can be found in the source. In the source JSON configuration, you can see the list of supported file extensions and the associated settings determining how they’re processed by the source. You can then modify which items Coveo indexes when crawling the repository.

Example

By default, an Amazon S3 source indexes many item types. You index an Amazon S3 bucket that contains both .html and .pdf files, but you only want to index .html files.

In the documentConfig section of the source JSON configuration, you identify the extensions sub-sections containing the .pdf extension type, and change the action and actionOnError values from Retrieve to Ignore, and then rebuild your source to reject .pdf files.

{
    "extensions": [
        ".pdf"
    ],
    "extensionSetting": {
        "action": "Ignore",
        "actionOnError": "Ignore",
        "converter": "Detect",
        "useContentType": false,
        "indexContainer": true,
        "fileTypeValue": "",
        "generateThumbnail": true,
        "useExternalHTMLGenerator": false,
        "convertDirectlyToHtml": false
    }
}

Handling File Formats in Source JSON

You can browse the JSON configuration of a source to review and change how Coveo handles file formats or extensions that it encounters when crawling a system. Under extensionSetting for the desired file format, two parameters determine how an item is handled:

Parameter Description

action

Default action to take when encountering an item of the above file format.

actionOnError

Action to take when an error occurs while crawling an item of the above file format.

For both parameters, there are three action values possible. However, only a single value can be applied for each parameter:

Value Description

Retrieve

Index the item by content and metadata.

Reference

Index the item metadata only.

Ignore

Skip the item (that is, don’t index it).

For file formats that are common within a given source, action and actionOnError are typically set to Retrieve and Reference, respectively. For uncommon formats, or those associated with larger file sizes, both parameters are often set to Reference.