Web source JSON modification

This is for:

In this article

Configuring listed and unlisted parameters
Reference

Many source configuration parameters can be set through the user interface. Others, such as rarely used parameters or new parameters that aren’t yet editable through the user interface, can only be configured in the source JSON configuration.

This article explains how to configure Web source parameters, whether they’re already listed in the JSON or not.

Configuring listed and unlisted parameters

Changing a parameter value in the source JSON configuration | Coveo

If the parameter you want to change is already listed in the parameters section of the source JSON configuration, just modify its value in the JSON configuration.

If the parameter isn’t listed in the parameters section, copy the entire parameter example object from the Reference section below and paste it into the parameters section of the source JSON configuration. Then, modify the value in the JSON configuration, if necessary.

If a parameter has a value attribute that contains sensitive information, set the sensitive attribute to true. Otherwise, the value will appear in clear text in the JSON configuration.

Document the changes you make to the source JSON configuration in the Change notes area below the JSON configuration. This ensures that you can easily revert to a previous configuration if needed.

Reference

This section provides information on the Web source parameters that you can only modify through the JSON configuration.

If a JSON configuration parameter isn’t documented in this article, configure it through the user interface instead.

`AdditionalHeaders` (Object | Null)

Specifies a map of HTTP headers in JSON format to be sent with every HTTP request. The value must be a valid JSON map in the format: {\"header1\": \"value1\", \"header2\": \"value2\", … }. The value is null by default.

Example

"AdditionalHeaders": {
  "sensitive": true,
  "value": "{\"X-CSRF-Token\": \"6b609652ef8e448eb6de270139d489c3\"}"
}

When the Execute JavaScript on pages option is enabled, the Web connector doesn’t support the AdditionalHeaders parameter.
Don’t use the AdditionalHeaders parameter to send the Authorization header. This will generate an error as soon as you start an indexing action.

`EnableJavaScriptRenderingOptimizations` (Boolean)

Whether to enable JavaScript rendering optimizations. When set to true, the crawler doesn’t download images and external files. The default value is true.

Example where you would set EnableJavaScriptRenderingOptimizations to false

On a page, you have a dynamically generated table. The data in the table comes from a JSON file, downloaded from a server, using JavaScript. To index the table data, you would need to set EnableJavaScriptRenderingOptimizations to false.

Example

"EnableJavaScriptRenderingOptimizations": {
  "sensitive": false,
  "value": "false"
}

`ExpandBeforeFiltering` (Boolean)

Whether the crawler expands the pages before applying the inclusion and exclusion filters. This might make the crawling process slower since filtered pages are still going to be downloaded. The default value is false.

Example

You want to index the pages that are hyperlinked on https://mysite.com/, but you don’t want to index https://mysite.com/ itself. You therefore proceed as follows:

You set a Starting URL value to https://mysite.com/ in the source configuration user interface.
You add an exclusion rule for https://mysite.com/ in the source configuration user interface.
You set ExpandBeforeFiltering to true.

"ExpandBeforeFiltering": {
  "sensitive": false,
  "value": "true"
}

`FollowAutoRedirects` (Boolean)

Whether the crawler should follow automatic redirections. The default value is true.

Example

"FollowAutoRedirects": {
  "sensitive": false,
  "value": "false"
}

`FollowCanonicalLink` (Boolean)

Whether the crawler should redirect to the canonical link (if any) in the <head> section of the page. Using canonical links helps reduce duplicates, the canonical link being the "preferred" version of a page. The default value is false.

Example

"FollowCanonicalLink": {
  "sensitive": false,
  "value": "true"
}

`IgnoreUrlFragment` (Boolean)

Whether to ignore the fragment part of found pages. The fragment part of the URL is everything after the hash (#) sign. The default value is true. Setting IgnoreUrlFragment to false may be necessary, for example, with one-page web apps.

Example

"IgnoreUrlFragment": {
  "sensitive": false,
  "value": "false"
}

`IndexExternalPages` (Boolean)

Whether to index linked web pages that aren’t part of the domain of the specified Starting URL. The default value is false.

The IndexExternalPages parameter has no bearing on whether Starting URL subdomain pages are indexed.

Enabling this setting is unsafe.

With basic authentication, when the crawler is challenged for access to an external page, it then sends your basic authentication credentials to the external domain server.

With form authentication, when the crawler is denied access to an external page and redirected to a page that contains a login form, it submits your form authentication credentials to the external domain server.

Example

"IndexExternalPages": {
  "sensitive": false,
  "value": "true"
}

IndexExternalPages applicability examples

Example 1: A local page contains a link to an external page

One of your Starting URLs is https://www.mycompany.com and one of its pages contains a link to https://en.wikipedia.org/. If IndexExternalPages is set to false, the linked Wikipedia page won’t be crawled and included in your searchable content.

Example 2: A local page that redirects to an external page

The IndexExternalPages setting isn’t taken into account in this scenario.

For example, one of your Starting URLs is https://www.mycompany.com and one of its pages contains a link to https://www.mycompany.com/mypage. When clicked, this link redirects the reader to https://en.wikipedia.org/. In this scenario, the Wikipedia page will be crawled and indexed if it’s included in your crawling scope.

`IndexExternalPagesLinks` (Boolean)

Whether linked pages in external web pages are indexed. The default value is false.

Note

IndexExternalPages must be set to true for IndexExternalPagesLinks to be taken into account.

Including pages linked in external pages can lead anywhere on the Internet, infinitely discovering linked pages on other sites.

If you do enable this option, you should add one or more rules, preferably inclusion rules, to restrict the discovery to identifiable sites.

Example

"IndexExternalPagesLinks": {
  "sensitive": false,
  "value": "true"
}

`IndexJsonLdMetadata` (Boolean)

Whether to index metadata from JSON-LD <script> tags. The default value is false.

When enabled, JSON-LD objects in the web page are flattened and represented in jsonld.parent.child metadata format in your Coveo organization.

Example

Given the following JSON-LD script tag in a web page:

<script id="jsonld" type="application/ld+json">
   {
      "@context": "https://schema.org",
      "@type": "NewsArticle",
      "url": "http://www.bbc.com/news/world-us-canada-39324587",
      "publisher": {
          "@type": "Organization",
          "name": "BBC News",
          "logo": "http://www.bbc.co.uk/news/special/2015/newsspec_10857/bbc_news_logo.png?cb=1"
      },
      "headline": "Canada Strikes Gold in Olympic Hockey Final"
   }
</script>

To index the publisher name value (that is, BBC News in this page) in an index field

Set your configuration as follows:

"IndexJsonLdMetadata": {
  "sensitive": false,
  "value": "true"
}

Use the following mapping rule for your field: %[jsonld.publisher.name]

Note

Contextual JSON-LD key-value pairs whose keys begin with @ aren’t converted into metadata.

`IndexSubdomains` (Boolean)

Whether to index subdomains of the base Starting URL you entered when creating the Web source. The default value is false.

Setting IndexSubdomains to true in effect adds all subdomains to the list of starting URLs without having to add them individually in the user interface. It doesn’t change your inclusion and exclusion rules though. If you set IndexSubdomains to true, make sure you adjust your inclusion and exclusion rules.

When IndexSubdomains is set to true, the Web source strips the www subdomain when determining subdomains. For example, if your base Starting URL is https://www.abc.com, the Web source considers https://docs.abc.com as being a subdomain.

Example

"IndexSubdomains": {
  "sensitive": false,
  "value": "true"
}

`MaxAutoRedirects` (Integer)

The maximum number of redirects that the requests may follow. No maximum is applied when the value is 0. The default value is 7.

Example

"MaxAutoRedirects": {
  "sensitive": false,
  "value": "10"
}

`MaxCrawlDelayInSeconds` (Integer)

The maximum number of seconds to respect in the robots.txt directive of Crawl-delay: X. The robots.txt directive override source crawling setting in the user interface must not be enabled for MaxCrawlDelayInSeconds to be used. If MaxCrawlDelayInSeconds is set to 0, the crawler won’t respect the directive of the robots.txt file. The default value is 5.

Example

"MaxCrawlDelayInSeconds": {
  "sensitive": false,
  "value": "3"
}

`MaxCrawlDepth` (Integer)

Maximum number of link levels under your Starting URLs to include in the source. When the value is 0, only the starting URLs are crawled and all their links are ignored. The default value is 100.

Example

"MaxCrawlDepth": {
  "sensitive": false,
  "value": "10"
}

`MaxPageSizeInBytes` (Integer)

The maximum size of a web resource in bytes. If the resource size exceeds this value, it’s not downloaded or processed. No maximum is applied when the value is 0. The default value is 536870912 (512 MB).

This setting prevents excessively long rebuilds and rescans. Increase its value only when absolutely necessary.

Example

"MaxPageSizeInBytes": {
  "sensitive": false,
  "value": "400000000"
}

`MaxPagesToCrawl` (Integer)

The maximum number of pages to index. No maximum is applied when the value is 0. The default value is 2500000.

This setting prevents excessively long rebuilds and rescans. Increase its value only when absolutely necessary.

Example

"MaxPagesToCrawl": {
  "sensitive": false,
  "value": "10000"
}

`MaxPagesToCrawlPerDomain` (Integer)

The maximum number of pages to index per domain. No maximum is applied when the value is 0. The default value is 0.

Example

"MaxPagesToCrawlPerDomain": {
  "sensitive": false,
  "value": "8000"
}

`MaxRequestsRetryCount` (Integer)

The maximum number of request retries for a URL, if an error occurs. No retry is attempted when the value is 0. The default value is 3.

Example

"MaxRequestsRetryCount": {
  "sensitive": false,
  "value": "5"
}

`MinCrawlDelayPerDomainInMilliseconds` (Integer)

The number of milliseconds between consecutive HTTP requests sent to retrieve content on a given domain. The default value is 1000, which means that a maximum of 3,600 items are indexed per hour on any given domain.

Example

"MinCrawlDelayPerDomainInMilliseconds": {
  "sensitive": false,
  "value": "1500"
}

`RequestsRetryDelayInMilliseconds` (Integer)

The minimum delay in milliseconds between a failed HTTP request and the ensuing retry. The default value is 1000.

Example

"RequestsRetryDelayInMilliseconds": {
  "sensitive": false,
  "value": "1500"
}

`RequestsTimeoutInSeconds` (Integer)

Timeout value in seconds for web requests. Web requests don’t timeout when the value is set to 0. The default value is 60.

Consider increasing the value if a target website sometimes responds slower. This prevents getting errors that can be avoided.

Example

"RequestsTimeoutInSeconds": {
  "sensitive": false,
  "value": "60"
}

`RespectUrlCasing` (Boolean)

Whether page URLs that you include are case sensitive. The default value is true.

Example

"RespectUrlCasing": {
  "sensitive": false,
  "value": "true"
}

When RespectUrlCasing is Set to True

The crawler uses page URLs as is.

If the same web page appears twice on the Web server (that is, with different URL casings), you can expect duplicate items in your source, one copy for each URL casing variant found by the crawler.

When web page URLs are case sensitive (that is, when RespectUrlCasing is set to true), inclusion and exclusion rules are also case sensitive.

When RespectUrlCasing is Set to False

The crawler first converts the current page URL to lowercase letters. It then uses the lowercased page URL to perform an HTTP request to the Web server.

If the Web server is case-sensitive and resource URLs on the server contain uppercase letters, the Web server won’t serve the request. The way the Web server deals with this situation (for example, HTTP error) varies from site to site.

When web page URLs aren’t case sensitive (that is, when RespectUrlCasing is set to false), inclusion and exclusion rules are also not case sensitive.

`RobotsDotTextUserAgentString` (String | Null)

The User Agent used by the Coveo crawler to interact with the target Web content. The User Agent is compared against website robots.txt file allow/disallow directives, provided the robots.txt directive override user interface setting isn’t enabled. The default value is CoveoBot.

Example

"RobotsDotTextUserAgentString": {
  "sensitive": false,
  "value": "AcmeBot"
}

`SendCookies` (Boolean)

Whether the crawler can exchange HTTP cookies with the crawled site. The default value is true.

Example

"SendCookies": {
  "sensitive": false,
  "value": "false"
}

`SkipOnStartingAddressError` (Boolean)

Whether the crawler should skip a Starting URL when erroneous. This setting may be useful on a source that includes multiple starting URLs. When SkipOnStartingAddressError is set to true, a rebuild doesn’t fail when unable to retrieve a given starting URL. Instead, it moves on to the next starting URL specified in the source configuration user interface. The default value is true.

When SkipOnStartingAddressError is set to true, if a rebuild or rescan is unable to access a starting URL, existing source content associated with that starting URL is deleted.

Example

"SkipOnStartingAddressError": {
  "sensitive": false,
  "value": "false"
}

`UseHiddenBasicAuthentication` (Boolean)

Whether the crawler should send a basic authentication header in all requests.

The default value is false.

Note

To use UseHiddenBasicAuthentication, you need Execute JavaScript on pages to be disabled.

Enabling this setting is unsafe, as your basic authentication credentials will be sent with every page your source requests, regardless of the domain.

Example

"UseHiddenBasicAuthentication": {
  "sensitive": false,
  "value": "true"
}

`UseProxy` (Boolean)

UseProxy is specifically for Crawling Module Web sources. Don’t use this parameter in a cloud Web source.
This parameter doesn’t support scenarios where Execute JavaScript on pages is enabled.
This parameter can’t be used in combination with Form authentication.

Whether the Crawling Module should use a proxy to access the content to be crawled. The default value is false.

Example

"UseProxy": {
  "sensitive": false,
  "value": "false"
}

`UserAgentString` (String)

Identifier used by the Coveo crawler when requesting pages from the web server. The default value is Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) (compatible; Coveobot/2.0;+http://www.coveo.com/bot.html).

Example

"UserAgentString": {
  "sensitive": false,
  "value": "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) (compatible; Coveobot/2.0;+http://www.coveo.com/bot.html)"
}