Source JSON modification options

This article covers options for the source aspects that you can modify by editing the source JSON configuration.

Add source filters

By default, Coveo indexes all the items in your source URL. However, if you want to index only certain items or ignore unwanted items, you can add filters to your source configuration. Coveo source configurations support inclusion and exclusion filters.

To fine-tune the items to index or ignore, you must define source filters in your source JSON configuration. Alternatively, with a Web source, you can define rules directly in the source configuration panel. In any case, use the reference below as a guide.

Once you’ve saved your configuration changes, launch a source rescan to apply them. A rebuild may be suggested on the Administration Console Sources (platform-ca | platform-eu | platform-au) page, but isn’t required.

JSON filter reference

Your source URL and filters appear at the top of your JSON configuration as follows:

"startingAddresses": [
  "http://www.example.com/sitemap.xml"
],
"addressPatterns": [
  { 1
    "expression": "*",
    "patternType": "Wildcard",
    "allowed": true
  },
  {
    "expression": "<YOUR_FILTERING_EXPRESSION>",
    "patternType": "<EXPRESSION_TYPE>",
    "allowed": <BOOLEAN>
  }
]
1 The default addressPatterns array object (that is, the all-inclusive filter).

startingAddresses (Array, Required)

This array contains the source URL(s) that Coveo crawls to retrieve the content to index.

You must encode space and special characters in your source URL.

addressPatterns (Array, Required)

This array contains your source filters. Each filter is represented by an object grouping the three mandatory filter parameters. These parameters are: expression, patternType, and allowed.

Important

By default, the addressPatterns array of a newly created source only contains the all-inclusive filter.

"addressPatterns": [
  {
    "expression": "*",
    "patternType": "Wildcard",
    "allowed": true
  }
]

With this default configuration, Coveo doesn’t filter at all (that is, it crawls all document paths it finds using the startingAddresses).

Importantly, the addressPatterns filters are also applied to the startingAddresses themselves. If you submit an empty addressPatterns array (or if you remove the addressPatterns array altogether), the startingAddresses won’t match any allowed addressPatterns filter and Coveo will return a No Items Indexed error.

Ensure you have at least one allowed addressPatterns that matches each of your startingAddresses. Also ensure you don’t have any exclusion filters that match your startingAddresses.

expression (String, Required)

This parameter determines the wildcard or regular expression that defines your source filter. Items at URIs matching this pattern will be indexed or ignored by Coveo.

Examples
  • With a wildcard: "expression": "http://career.MyCompany.com/jobs/*"

  • With a regular expression: "expression": ".*\\.(zip|rar|tar|7z|png|jpg)"

You must encode space and special characters in your expression. In addition, you must escape all backslash characters by adding a backslash in front of them. Slash characters do not need to be escaped.

For example, if your desired regular expression is:

^https?://docs\.coveo\.com/en/7\d/$+

The expression to provide in the source JSON is:

"expression": "^https?://docs\\.coveo\\.com/en/7\\d/$",+

patternType (String Enum, Required)

This parameter determines the type of expression used. Allowed values are Wildcard and RegEx.

Example

You have an AWS S3 source, where the bucket contains PDFs, compressed files, and images. You want to index only PDFs, so you add the following filter:

{
    "expression": ".*\\.(zip|rar|tar|7z|png|jpg)",
    "patternType": "RegEx",
    "allowed": false
}

Note that in the second expression value above, the second . character is escaped twice: once for the regular expression and once for the JSON.

allowed (Boolean, Required)

This parameter determines whether the filter is an inclusion filter or an exclusion filter, that is, whether the items at URIs matching the pattern should be indexed or ignored.

Allowed values are true for an inclusion filter and false for an exclusion filter.

Example

By default, a Sitemap source indexes all web pages listed in a Sitemap. Many listed web pages contain JPG images, but you only want the text to be indexed. So, you add the following filter:

{
    "expression": "*.jpg",
    "patternType": "Wildcard",
    "allowed": false
}

Change indexed item types

By default, each connector is configured to index several item types (based on their file extension) that can typically be found in the specific system. In the source JSON configuration, you can see the list of supported file extensions and the associated settings determining how they’re processed by the source. In particular, you can easily change which item types are indexed or not.

Example

By default, an Amazon S3 source indexes many item types. You index an Amazon S3 bucket that contains .html and .pdf files, but you only want to index the HTML files.

In the documentConfig section of the source JSON configuration, you identify the extensions sub-sections containing the .pdf extension type, and change the action and actionOnError values from Retrieve to Ignore, and then rebuild your source to reject the PDF files.

{
    "extensions": [
        ".pdf"
    ],
    "extensionSetting": {
        "action": "Ignore",
        "actionOnError": "Ignore",
        "converter": "Detect",
        "useContentType": false,
        "indexContainer": true,
        "fileTypeValue": "",
        "generateThumbnail": true,
        "useExternalHTMLGenerator": false,
        "convertDirectlyToHtml": false
    }
}

Add conditional indexing for a Salesforce source

The Salesforce source lets you index items only when they meet specific conditions, which can reduce the size of your index (see Introducing conditional indexing).

Change the request rate of a REST API or GraphQL API source

REST API and GraphQL API sources make requests to an API to index the desired content. By default, these requests are made immediately one after the other. If Coveo’s requests reach the throttling limit of your API, you can increase the delay between each request.

To do so, in the source JSON configuration, in the parameters object, add the following configuration. Then, replace <NUMBER_OF_MILLISECONDS> with the number of milliseconds Coveo should wait between each request. The default value is 0, that is, there is no delay between requests.

"RequestsIntervalInMs": {
  "value": "<NUMBER_OF_MILLISECONDS>"
  }

Enable Coveo Personalization-as-you-go for a source

When using a Salesforce, REST API, Database, or GraphQL API source to index commerce-specific content, such as products, variants, and availabilities, you have to undergo a catalog configuration process to benefit from all commerce-related capabilities.

Note that additional configuration is required. Contact your Customer Success Manager to discuss your options. Coveo Machine Learning tools include Coveo Personalization-as-you-go (PAYG) capabilities for commerce use cases. This suite of advanced features learns from a user’s intent and reacts within a few clicks. PAYG models require the building of a product vector space to represent the products contained in your source. For Salesforce, REST API, Database, or GraphQL API sources, Coveo PAYG needs to be enabled in order to produce the product vector space.

Warning

You must enable PAYG in your source before starting to index content in it.

To enable PAYG in your source

  1. Edit the source JSON configuration.

  2. Modify the parameters section by adding the following:

    "parameters": {
      "UseStreamApi": {
        "value": "true"
      }
    }

With Catalog sources or SAP, however, this modification isn’t required. You’ll be able to benefit from Coveo PAYG functionalities as soon as it’s enabled in your organization.

Forbid item deletion during a rescan

You may have a source whose content rarely or never gets deleted, such as a static site or an application where older content is archived. In rare cases, due to an error with an API or with Coveo, a source rescan may delete this stable content.

For example, if your server returns no item during the rescan, Coveo will consider your content has been deleted and will remove it from your index. As a result, your source content won’t be searchable through your search interface.

Although this issue rarely happens, Coveo offers a source parameter that forbids item deletion during a source rescan, as an extra layer of security. Enabling this parameter for a source with stable content ensures that your content remains available through your search interface despite the error.

If you know your source content rarely or never gets deleted and you want to forbid item deletion during rescans, edit the source JSON configuration and then, under parameters, add "SkipUncrawledDocumentsDeletionOnRescan": {"value": "true"}.

Once the parameter is enabled, the only way to have the source delete content from your index is to launch a source rebuild.

"parameters": {
  "SkipUncrawledDocumentsDeletionOnRescan": {
    "value": "true"
  }
}

Add a hidden source parameter

You can edit frequently used source parameters from the Administration Console. Other rarely used parameters aren’t exposed in the console user interface but can be added to the source JSON configuration upon instructions from Coveo Support.

Hidden parameters have two attributes:

  • sensitive which is set to false by default for all parameters. Set to true when the parameter value contains sensitive information. When set to true, the value attribute won’t appear in the JSON configuration once the source is rebuilt.

  • value which is obviously the value of the parameter.

Example

A Coveo Support agent tells you to add a hidden source parameter in the JSON configuration parameters section to fix a specific issue that you’re experiencing.

Assuming the recommended parameter is Boolean, should be set to false, and named AHiddenSourceParameter, you would add:

"AHiddenSourceParameter": {
  "sensitive": false,
  "value": "false"
}
Note

When editing a parameter with the sensitive attribute set to true, you must specify a value to overwrite the current one which is hidden.