Source JSON Modification Examples

This article presents various examples of aspects of a source that members of the Administrators and Content Managers built-in groups can change, by modifying its source JSON configuration from the Coveo Administration Console (see Edit the JSON Configuration).

Add a Hidden Source Parameter

You can edit frequently used source parameters from the Administration Console. Other rarely used parameters aren’t exposed in the console user interface but can be added to the source JSON configuration.

Hidden parameters have two attributes:

  • sensitive which is set to false by default for all parameters. Set to true when the parameter value contains sensitive information. When set to true, the value attribute won’t appear in the JSON configuration once the source is rebuilt.

  • value which is obviously the value of the parameter.

A Coveo Support agent tells you to add a hidden source parameter in the JSON configuration parameters section to fix a specific issue that you’re experiencing.

Assuming the recommended parameter is Boolean, should be set to false, and named AHiddenSourceParameter, you would add:

"AHiddenSourceParameter": {
  "sensitive": false,
  "value": "false"
},

When editing a parameter with the sensitive attribute set to true, you need to specify a value to overwrite the current one which is hidden.

Change Indexed Item Types

By default, each connector is configured to index several item types (based on their file extension) that can typically be found in the specific system. In the source JSON configuration, you can see the list of supported file extensions and the associated settings determining how they’re processed by the source. In particular, you can easily change which item types are indexed or not.

By default, an Amazon S3 source indexes many item types. You index an Amazon S3 bucket that contains .html and .pdf files, but you only want to index the HTML files.

In the documentConfig section of the source JSON configuration, you identify the extensions sub-sections containing the .pdf extension type, and change the action and actionOnError values from Retrieve to Ignore, and then rebuild your source to reject the PDF files.

{
    "extensions": [
        ".pdf"
    ],
    "extensionSetting": {
        "action": "Ignore",
        "actionOnError": "Ignore",
        "converter": "Detect",
        "useContentType": false,
        "indexContainer": true,
        "fileTypeValue": "",
        "generateThumbnail": true,
        "useExternalHTMLGenerator": false,
        "convertDirectlyToHtml": false
    }
}

Add Source Filters

The source URL (startingAddress) determines the content of a system that’s crawled while source filters control what’s indexed. In the source JSON configuration, you can add source filters to more specifically control which items to index or not under a source URL. Inclusion and exclusion rules use either wildcard or regular expressions to define filtering patterns.

Following saving changes to the source JSON configuration, a source rescan takes the changes into account, a rebuild isn’t necessary (even if the Start required rebuild link appears next to the source on the Administration Console Sources page.)

Don’t remove the startingAddresses parameter value.

For a Web source, the configuration panel offers the Inclusion filters and Exclusions filters parameters to easily set source filters (see Add or Edit a Web Source).

Source filters must respect the following JSON template:

"startingAddresses": [
     "http://www.example.com/sitemap.xml"
],
"addressPatterns": [
     {
       "expression": "*",
       "patternType": "Wildcard",
       "allowed": true
     },
     {
       "expression": "pattern",
       "patternType": "type",
       "allowed": Boolean
     }
],

In this template, the startingAddresses values must be valid one or more URLs. Ensure to encode space and special characters (see HTML URL Encoding Reference). Under addressPatterns, you must also replace the following values with the desired ones:

  • pattern: URI filter to include or exclude source content using either wildcard characters ( * or ?) or a regular expression (see Regular-Expressions.info).

    When entering a pattern, have in mind the following rules:

    • The startingAddresses value(s) must be part of the inclusion filter scope. Otherwise, no items will be indexed.

      For example, you remove the default inclusion filter (with the * pattern) in your Sitemap source JSON configuration. Therefore, you validate the Sitemap URL is included in the startingAddresses parameter to ensure items will be indexed.

            "startingAddresses": [
              "http://www.example.com/sitemap.xml"
            ],
      
    • You must always include the following original wildcard * filter, otherwise you get a NO_DOCUMENT_INDEXED error.

        "addressPatterns": [
            {
             "expression": "*",
             "patternType": "Wildcard",
             "allowed": true
           },
         ],
      
    • The startingAddresses value(s) must not be excluded by one of your exclusion filters.

    • Two inclusion filters or exclusion filters can’t contain the same pattern. An inclusion and an exclusion filter also can’t contain the same pattern.

    • URIs must be valid. Ensure to encode space and special characters (see HTML URL Encoding Reference).

    • You must escape all backslash characters by adding a backslash in front to get a valid JSON.

      For example, if your regular expression is:

      ^https?://docs\.coveo\.com/en/7\d+/$

      The expression in the JSON must be:

      "expression": "^https?://docs\\.coveo\\.com/en/7\\d+/$",

    • You don’t need to escape slash characters.

    Using Wildcard:

    • file://corp.MyCompany.com/dfs/dept/HR/employees/retired/*

    • http://career.MyCompany.com/jobs/*

    • http://career.MyCompany.com/open%20positions/*

  • type: Wildcard or RegEx

    You have an AWS S3 source, where the bucket contains PDFs, along with some compressed files, and images. You want to index only the PDFs.

    So you add the following exclusion filter, and then rebuild your source.

      "addressPatterns": [
          {
          "expression": "*",
          "patternType": "Wildcard",
          "allowed": true
      },
      {
          "expression": ".*\\.(zip|rar|tar|7z|png|jpg)",
          "patternType": "RegEx",
          "allowed": false
      }
      ],
    
    • The RegEx string in the patternType value is in camel case.

    • The second . character in the expression value is doubly escaped (once for the regular expression, and once for the JSON).

  • Boolean: true (for an inclusion filter) or false (for an exclusion filter)

    By default, a Sitemap source indexes all listed web pages from a Sitemap. Many listed web pages contain JPG images, but you only want the text to be indexed.

    So you add the following exclusion filter, and then rebuild your source to reject JPG images.

      "addressPatterns": [
          {
          "expression": "*",
          "patternType": "Wildcard",
          "allowed": true
      },
      {
          "expression": "*.jpg",
          "patternType": "Wildcard",
          "allowed": false
      }
      ],
    

Find the Creator of a Source

As an administrator, you may want to know who is the user that’s the only one that has access to a source whose content is accessible to the source creator only. You can find this information in the source JSON.

You have a source whose content accessible to the source creator only. You want to know the only identity that currently has access to the items of this source. In the permissionSets section of the source JSON configuration, under allowedPermissions, find the identity (typically an email).

"permissionSets": [
    {
        "allowedPermissions": [
            {
                "identityType": "User",
                "securityProvider": "Email Security Provider",
                "identity": "someone@mycompany.com"
            }
        ]
    }
]

Only a user authenticated with this email identity will be able to see search results from this source. Here, you can also change the identity.

Add Conditional Indexing for a Salesforce Source

The Salesforce source allows you to index items only when they meet specific conditions, which can reduce the size of your index (see Introducing Conditional Indexing).

For former Coveo Cloud V1 users, the Coveo Cloud conditional indexing is the equivalent to the Manage Exclusion option from Coveo Cloud V1 (see Customizing the Salesforce Source Configuration).

Change the Default Security Provider of a Generic REST API Source

Generic REST API sources use the Email security identity provider by default. This means that all security identities are identified with an email address. Alternatively, you can switch to any other existing security identity provider.

For example, to switch to a custom security identity provider populated through the Push API, in the source JSON configuration, replace the securityProviders object with the following configuration:

"securityProviders": {
  "SecurityProvider": {
    "name": "<SECURITY_PROVIDER_NAME>",
    "typeName": "Expanded"
}

Where <SECURITY_PROVIDER_NAME> is a name of the target security provider, as displayed on the Security Identities Administration Console page.

A Generic REST API source can handle only one security identity provider.

Recommended Articles