Source JSON modification options
Source JSON modification options
This article covers options for the source aspects that you can modify by editing the source JSON configuration.
Add source filters
By default, Coveo indexes all the items in your source URL. However, if you want to index only certain items or ignore unwanted items, you can add filters to your source configuration. Coveo source configurations support inclusion and exclusion filters.
To fine-tune the items to index or ignore, you must define source filters in your source JSON configuration. Alternatively, with a Web source, you can define rules directly in the source configuration panel. In any case, use the reference below as a guide.
Once you’ve saved your configuration changes, launch a source rescan to apply them. A rebuild may be suggested on the Administration Console Sources (platform-ca | platform-eu | platform-au) page, but isn’t required.
JSON filter reference
Your source URL and filters appear at the top of your JSON configuration as follows:
"startingAddresses": [
"http://www.example.com/sitemap.xml"
],
"addressPatterns": [
{
"expression": "*",
"patternType": "Wildcard",
"allowed": true
},
{
"expression": "<YOUR_FILTERING_EXPRESSION>",
"patternType": "<EXPRESSION_TYPE>",
"allowed": <BOOLEAN>
}
]
The default addressPatterns array object (that is, the all-inclusive filter). |
startingAddresses
(Array, Required)
This array contains the source URL(s) that Coveo crawls to retrieve the content to index.
You must encode space and special characters in your source URL.
addressPatterns
(Array, Required)
This array contains your source filters.
Each filter is represented by an object grouping the three mandatory filter parameters.
These parameters are: expression
, patternType
, and allowed
.
By default, the
With this default configuration, Coveo doesn’t filter at all (that is, it crawls all document paths it finds using the Importantly, the Ensure you have at least one |
expression
(String, Required)
This parameter determines the wildcard or regular expression that defines your source filter. Items at URIs matching this pattern will be indexed or ignored by Coveo.
-
With a wildcard:
"expression": "http://career.MyCompany.com/jobs/*"
-
With a regular expression:
"expression": ".*\\.(zip|rar|tar|7z|png|jpg)"
You must encode space and special characters in your expression. In addition, you must escape all backslash characters by adding a backslash in front of them. Slash characters do not need to be escaped.
For example, if your desired regular expression is:
^https?://docs\.coveo\.com/en/7\d/$+
The expression to provide in the source JSON is:
"expression": "^https?://docs\\.coveo\\.com/en/7\\d/$",+
patternType
(String Enum, Required)
This parameter determines the type of expression used.
Allowed values are Wildcard
and RegEx
.
You have an AWS S3 source, where the bucket contains PDFs, compressed files, and images. You want to index only PDFs, so you add the following filter:
{
"expression": ".*\\.(zip|rar|tar|7z|png|jpg)",
"patternType": "RegEx",
"allowed": false
}
Note that in the second expression
value above, the second .
character is escaped twice: once for the regular expression and once for the JSON.
allowed
(Boolean, Required)
This parameter determines whether the filter is an inclusion filter or an exclusion filter, that is, whether the items at URIs matching the pattern should be indexed or ignored.
Allowed values are true
for an inclusion filter and false
for an exclusion filter.
By default, a Sitemap source indexes all web pages listed in a Sitemap. Many listed web pages contain JPG images, but you only want the text to be indexed. So, you add the following filter:
{
"expression": "*.jpg",
"patternType": "Wildcard",
"allowed": false
}
Change indexed item types
By default, each connector is configured to index several item types (based on their file extension) that can typically be found in the specific system. In the source JSON configuration, you can see the list of supported file extensions and the associated settings determining how they’re processed by the source. In particular, you can easily change which item types are indexed or not.
By default, an Amazon S3 source indexes many item types.
You index an Amazon S3 bucket that contains .html
and .pdf
files, but you only want to index the HTML files.
In the documentConfig
section of the source JSON configuration, you identify the extensions
sub-sections containing the .pdf
extension type, and change the action
and actionOnError
values from Retrieve
to Ignore
, and then rebuild your source to reject the PDF files.
{
"extensions": [
".pdf"
],
"extensionSetting": {
"action": "Ignore",
"actionOnError": "Ignore",
"converter": "Detect",
"useContentType": false,
"indexContainer": true,
"fileTypeValue": "",
"generateThumbnail": true,
"useExternalHTMLGenerator": false,
"convertDirectlyToHtml": false
}
}
Add conditional indexing for a Salesforce source
The Salesforce source lets you index items only when they meet specific conditions, which can reduce the size of your index (see Introducing conditional indexing).
Enable Coveo Personalization-as-you-go for a source
When using a REST API, Database, Sitemap, Web, or GraphQL API source to index commerce-specific content, such as products, variants, and availabilities, you have to undergo a catalog configuration process to benefit from all commerce-related capabilities.
Coveo Machine Learning tools include Coveo Personalization-as-you-go (PAYG) capabilities for commerce use cases. This suite of advanced features learns from a user’s intent and reacts within a few clicks. PAYG models require the building of a product vector space to represent the products contained in your source. For REST API, Database, Sitemap, Web, or GraphQL API sources, Coveo PAYG needs to be enabled in order to produce the product vector space. Contact your Coveo representative to discuss your options.
Note
With Catalog sources or SAP, however, this modification isn’t required. You’ll be able to benefit from Coveo PAYG functionalities as soon as it’s enabled in your organization. |
Forbid item deletion during a rescan
You may have a source whose content rarely or never gets deleted, such as a static site or an application where older content is archived. In rare cases, due to an error with an API or with Coveo, a source rescan may delete this stable content.
For example, if your server returns no item during the rescan, Coveo will consider your content has been deleted and will remove it from your index. As a result, your source content won’t be searchable through your search interface.
Although this issue rarely happens, Coveo offers a source parameter that forbids item deletion during a source rescan, as an extra layer of security. Enabling this parameter for a source with stable content ensures that your content remains available through your search interface despite the error.
If you know your source content rarely or never gets deleted and you want to forbid item deletion during rescans, edit the source JSON configuration and then, under parameters
, add "SkipUncrawledDocumentsDeletionOnRescan": {"value": "true"}
.
Once the parameter is enabled, the only way to have the source delete content from your index is to launch a source rebuild.
"parameters": {
"SkipUncrawledDocumentsDeletionOnRescan": {
"value": "true"
}
}
Alternatively, you can configure your source to block the deletion process if it’s about to delete more than a certain percentage of the source items.
Forbid item deletion based on a percentage condition
You may have a source that’s crucial to your business and whose content is mostly stable. That is, if items get deleted at the end of a rescan operation, it’s usually a fraction of the source content.
To protect this source’s content from accidental deletions, you can use the AllowedDeletionPercentage
parameter to block the deletion process if it’s about to delete more than the specified percentage of the source items.
For example, let’s say you set this parameter to 10% as follows, and then make a change in the source configuration panel.
"parameters": {
"AllowedDeletionPercentage": {
"value": "10"
}
}
At the next rescan, if Coveo flags 50% of the source items for deletion due to this change, the deletion process will be blocked and your source will display an error with code DELETION_BLOCKED_BY_ALLOWED_PERCENTAGE
.
As a result, your search interface will keep displaying the source content that would have otherwise been deleted.
You may also want to take advantage of the AllowedDeletionPercentage
parameter if the source’s content comes from an API or server that you know is unreliable.
For example, if a scheduled rescan takes place while your API momentarily returns no items (and no errors either), this parameter will prevent the deletion of all the items in your index, and your search interface will keep displaying your content.
Note
The File system source doesn’t support the |
Alternatively, you can configure your source to skip the deletion process following a rescan altogether.
Add a hidden source parameter
You can edit frequently used source parameters from the Administration Console. Other rarely used parameters aren’t exposed in the console user interface but can be added to the source JSON configuration upon instructions from Coveo Support.
Hidden parameters have two attributes:
-
sensitive
which is set tofalse
by default for all parameters. Set totrue
when the parametervalue
contains sensitive information. When set totrue
, thevalue
attribute won’t appear in the JSON configuration once the source is rebuilt. -
value
which is obviously the value of the parameter.
A Coveo Support agent tells you to add a hidden source parameter in the JSON configuration parameters
section to fix a specific issue that you’re experiencing.
Assuming the recommended parameter is Boolean, should be set to false
, and named AHiddenSourceParameter
, you would add:
"AHiddenSourceParameter": {
"sensitive": false,
"value": "false"
}
Note
When editing a parameter with the |