Source JSON Modification Examples

This topic presents various examples of aspects of a source that you can change when you have the required privileges, by modifying its source JSON configuration from the Coveo Cloud administration console (see Edit the JSON Configuration).

Add a Hidden Source Parameter

You can edit frequently used source parameters from the Coveo Cloud administration console. Other rarely used parameters are not exposed in the console user interface but can be added to the source JSON configuration.

A Coveo Support agent tells you to add a hidden source parameter in the JSON configuration parameters section to fix a specific issue that you are experiencing.

Assuming the recommended parameter is Boolean, should be set to false, and named AHiddenSourceParameter, you would add:

"AHiddenSourceParameter": {
   "value": "false"
 }

Change Indexed Item Types

Out-of-the-box, each source type is configured to index several item types (based on their file extension) that can typically be found in the specific system. In the source JSON configuration, you can see the list of supported file extensions and the associated settings determining how they are processed by the source. In particular, you can easily change which item types are indexed or not.

By default, an Amazon S3 source indexes many item types. You index an Amazon S3 bucket that contains .html and .pdf files, but you only want to index the HTML files.

In the documentConfig section of the source JSON configuration, you identify the extensions sub-sections containing the .pdf extension type, and change the action and actionOnError values from Retrieve to Ignore, and then rebuild your source to reject the PDF files.

{
    "extensions": [
        ".pdf"
    ],
    "extensionSetting": {
        "action": "Ignore",
        "actionOnError": "Ignore",
        "converter": "Detect",
        "useContentType": false,
        "indexContainer": true,
        "fileTypeValue": "",
        "generateThumbnail": true,
        "useExternalHTMLGenerator": false,
        "convertDirectlyToHtml": false
    }
}

Add Source Filters

The source URL (startingAddress) determines the content of a system that is crawled while source filters control what is indexed. In the source JSON configuration, you can add source filters to more specifically control which items to index or not under a source URL. Inclusion and exclusion rules use either wildcard or regular expressions to define filtering patterns.

Following saving changes to the source JSON configuration, a source rescan takes the changes into account, a rebuild is not necessary (even if the Start required rebuild link appears next to the source in the administration console Sources page.)

Do not remove the startingAddresses parameter value.

For a Web source, the configuration panel offers the Inclusion filters and Exclusions filters parameters to easily set source filters (see Add/Edit Web Source - Panel).

Source filters must respect the following JSON template:

"startingAddresses": [
     "http://www.example.com/sitemap.xml"
],
"addressPatterns": [
     {
       "expression": "*",
       "patternType": "Wildcard",
       "allowed": true
     },
     {
       "expression": "pattern",
       "patternType": "type",
       "allowed": Boolean
     }
],

In this template, the startingAddresses values must be valid one or more URLs. Ensure to encode space and special characters (see HTML URL Encoding Reference). Under addressPatterns, you must also replace the following values with the desired ones:

  • pattern: URI filter to include or exclude source content using either wildcard characters ( * or ?) or a regular expression (see Regular-Expressions.info).

    When entering a pattern, have in mind the following rules:

    • The startingAddresses value(s) must be part of the inclusion filter scope. Otherwise, no items will be indexed.

      You remove the default inclusion filter (with the * pattern) in your Sitemap source JSON configuration. Therefore, you validate the Sitemap URL is included in the startingAddresses parameter to ensure items will be indexed.

            "startingAddresses": [
              "http://www.example.com/sitemap.xml"
            ],
      
    • You must always include the following original wildcard * filter, otherwise you get a NO_DOCUMENT_INDEXED error.

        "addressPatterns": [
            {
             "expression": "*",
             "patternType": "Wildcard",
             "allowed": true
           },
         ],
      
    • The startingAddresses value(s) must not be excluded by one of your exclusion filters.

    • Two inclusion filters or exclusion filters cannot contain the same pattern. An inclusion and an exclusion filter also cannot contain the same pattern.

    • URIs must be valid. Ensure to encode space and special characters (see HTML URL Encoding Reference).

    • You must escape all backslash characters by adding a backslash in front to get a valid JSON.

      If your regular expression is:

      ^https?://docs\.coveo\.com/en/7\d+/$

      In the expression in the JSON must be:

      "expression": "^https?://docs\\.coveo\\.com/en/7\\d+/$",

    • You do not need to escape slash characters.

    Using Wildcard:

    • file://corp.MyCompany.com/dfs/dept/HR/employees/retired/*

    • http://career.MyCompany.com/jobs/*

    • http://career.MyCompany.com/open%20positions/*

  • type: Wildcard or RegEx

  • Boolean: true (for an inclusion filter) or false (for an exclusion filter)

By default, a Sitemap source indexes all listed web pages from a Sitemap. Many listed web pages contain JPG images, but you only want the text to be indexed.

So you add the following exclusion filter, and then rebuild your source to reject JPG images.

"addressPatterns": [
    {
     "expression": "*",
     "patternType": "Wildcard",
     "allowed": true
   },
   {
     "expression": "*.jpg",
     "patternType": "Wildcard",
     "allowed": false
   }
 ],

Configure a Source-Level Permission

Some source types do not support the Secured permission type that sets permissions at a source item-level, in which case you are left with the Shared and Private permissions types that are setting the permissions at the source-level (see Source Permission Types and Available Coveo Cloud V2 Source Types). The Shared permission type typically sets the source item permission to *@*, essentially meaning that anyone having access to the organization can search for the content of the source. You can however modify the JSON configuration to change the source-level permissions to a more restricted shared group.

You want to index an internal website in an organization that includes internal and external content. You want to ensure that only your employees can see the content of this website in search results.

In the permissionSets section of the source JSON configuration, you modify the allowedPermissions to allow only users authenticated with an email identity of your company.

"permissionSets": [
    {
        "allowedPermissions": [
            {
                "identityType": "Group",
                "securityProvider": "Email Security Provider",
                "identity": "*@mycompany.com"
            }
        ]
    }
]

Find the Identity that Has Access to a Private Source

As an administrator, you may want to know who is the user that is the only one that has access to a Private source (see Source Permission Types). You can find this information in the source JSON.

The permission type of one of your sources is set to Private. You want to know the only identity that currently has access to the items of this source. In the permissionSets section of the source JSON configuration, under allowedPermissions, find the identity (typically an email).

"permissionSets": [
    {
        "allowedPermissions": [
            {
                "identityType": "User",
                "securityProvider": "Email Security Provider",
                "identity": "someone@mycompany.com"
            }
        ]
    }
]

Only a user authenticated with this email identity will be able to see search results from this source. Here, you can also change the identity.

Add Conditional Indexing for a Salesforce Source

The Salesforce source allows you to index items only when they meet specific conditions, which can reduce the size of your index (see Introducing Conditional Indexing).

For former Coveo Cloud V1 users, the Coveo Cloud V2 conditional indexing is the equivalent to the Manage Exclusion option from Coveo Cloud V1 (see Customizing the Salesforce Source Configuration).