Source JSON Modification Options
Source JSON Modification Options
This article covers options for the source aspects that you can modify by editing the source JSON configuration.
Add source filters
By default, Coveo indexes all the items in your source URL. However, if you want to index only certain items or ignore unwanted items, you can add filters to your source configuration. Coveo source configurations support inclusion and exclusion filters.
To fine-tune the items to index or ignore, you must define source filters in your source JSON configuration. Alternatively, with a Web source, you can define filters directly in the source configuration panel. In any case, use the reference below as a guide.
Once you’ve saved your configuration changes, launch a source rescan to apply them. A rebuild may be suggested on the Administration Console Sources (platform-ca | platform-eu | platform-au) page, but isn’t required.
JSON filter reference
Your source URL and filters appear at the top of your JSON configuration as follows:
"startingAddresses": [
"http://www.example.com/sitemap.xml"
],
"addressPatterns": [
{
"expression": "*",
"patternType": "Wildcard",
"allowed": true
},
{
"expression": "<YOUR_FILTERING_EXPRESSION>",
"patternType": "<EXPRESSION_TYPE>",
"allowed": <BOOLEAN>
}
]
The default addressPatterns array object (i.e., the all-inclusive filter). |
startingAddresses
(Array, Required)
This array contains the source URL(s) that Coveo crawls to retrieve the content to index.
You must encode space and special characters in your source URL.
addressPatterns
(Array, Required)
This array contains your source filters.
Each filter is represented by an object grouping the three mandatory filter parameters.
These parameters are: expression
, patternType
, and allowed
.
|
By default, the
With this default configuration, Coveo doesn’t filter at all (i.e., it crawls all document paths it finds using the Importantly, the Ensure you have at least one |
expression
(String, Required)
This parameter determines the wildcard or regular expression that defines your source filter. Items at URIs matching this pattern will be indexed or ignored by Coveo.
-
With a wildcard:
"expression": "http://career.MyCompany.com/jobs/*"
-
With a regular expression:
"expression": ".*\\.(zip|rar|tar|7z|png|jpg)"
You must encode space and special characters in your expression. In addition, you must escape all backslash characters by adding a backslash in front of them. Slash characters do not need to be escaped.
For example, if your desired regular expression is:
^https?://docs\.coveo\.com/en/7\d+/$
The expression to provide in the source JSON is:
"expression": "^https?://docs\\.coveo\\.com/en/7\\d+/$",
patternType
(String Enum, Required)
This parameter determines the type of expression used.
Allowed values are Wildcard
and RegEx
.
You have an AWS S3 source, where the bucket contains PDFs, compressed files, and images. You want to index only PDFs, so you add the following filter:
{
"expression": ".*\\.(zip|rar|tar|7z|png|jpg)",
"patternType": "RegEx",
"allowed": false
}
Note that in the second expression
value above, the second .
character is escaped twice: once for the regular expression and once for the JSON.
allowed
(Boolean, Required)
This parameter determines whether the filter is an inclusion filter or an exclusion filter, i.e., whether the items at URIs matching the pattern should be indexed or ignored.
Allowed values are true
for an inclusion filter and false
for an exclusion filter.
By default, a Sitemap source indexes all web pages listed in a Sitemap. Many listed web pages contain JPG images, but you only want the text to be indexed. So, you add the following filter:
{
"expression": "*.jpg",
"patternType": "Wildcard",
"allowed": false
}
Change indexed item types
By default, each connector is configured to index several item types (based on their file extension) that can typically be found in the specific system. In the source JSON configuration, you can see the list of supported file extensions and the associated settings determining how they’re processed by the source. In particular, you can easily change which item types are indexed or not.
By default, an Amazon S3 source indexes many item types.
You index an Amazon S3 bucket that contains .html
and .pdf
files, but you only want to index the HTML files.
In the documentConfig
section of the source JSON configuration, you identify the extensions
sub-sections containing the .pdf
extension type, and change the action
and actionOnError
values from Retrieve
to Ignore
, and then rebuild your source to reject the PDF files.
{
"extensions": [
".pdf"
],
"extensionSetting": {
"action": "Ignore",
"actionOnError": "Ignore",
"converter": "Detect",
"useContentType": false,
"indexContainer": true,
"fileTypeValue": "",
"generateThumbnail": true,
"useExternalHTMLGenerator": false,
"convertDirectlyToHtml": false
}
}
Add conditional indexing for a Salesforce source
The Salesforce source lets you index items only when they meet specific conditions, which can reduce the size of your index (see Introducing conditional indexing).
Change the default security provider of a Generic REST API source
Generic REST API sources use the Email security identity provider by default. This means that all security identities are identified with an email address. Alternatively, you can switch to any other existing security identity provider.
For example, to switch to a custom security identity provider populated through the Push API, in the source JSON configuration, replace the securityProviders
object with the following configuration:
"securityProviders": {
"SecurityProvider": {
"name": "<SECURITY_PROVIDER_NAME>",
"typeName": "Expanded"
}
}
Where <SECURITY_PROVIDER_NAME>
is a name of the target security provider, as displayed on the Security Identities (platform-ca | platform-eu | platform-au) Administration Console page.
|
Note
A Generic REST API source can handle only one security identity provider. |
Enable Coveo personalization-as-you-go for a source
In order for your source to benefit from the group of functionalities that promote the personalization feature, it needs to be Stream API enabled. By default, this feature is built into the catalog source, however we also now support it for the Generic REST API and Salesforce sources, which you can change by editing the source JSON configuration from the Administration Console Sources (platform-ca | platform-eu | platform-au) page.
Add the UseStreamApi
parameter to the parameters
section of the source JSON configuration.
|
Ensure you’ve added the |
"parameters": {
"UseStreamApi": {
"value": "true"
}
// add additional parameters
}
Add a hidden source parameter
You can edit frequently used source parameters from the Administration Console. Other rarely used parameters aren’t exposed in the console user interface but can be added to the source JSON configuration upon instructions from Coveo Support.
Hidden parameters have two attributes:
-
sensitive
which is set tofalse
by default for all parameters. Set totrue
when the parametervalue
contains sensitive information. When set totrue
, thevalue
attribute won’t appear in the JSON configuration once the source is rebuilt. -
value
which is obviously the value of the parameter.
A Coveo Support agent tells you to add a hidden source parameter in the JSON configuration parameters
section to fix a specific issue that you’re experiencing.
Assuming the recommended parameter is Boolean, should be set to false
, and named AHiddenSourceParameter
, you would add:
"AHiddenSourceParameter": {
"sensitive": false,
"value": "false"
}
|
Note
When editing a parameter with the |