Web source JSON modification

Many source configuration parameters can be set through the user interface. Others, such as rarely used parameters or new parameters that aren’t yet editable through the user interface, can only be changed or added in the JSON configuration. This article presents Web source parameters that you can change by modifying the source JSON configuration.

If the parameter you want to change is already presented in the parameters section of the source JSON configuration, only modify their value and sensitive values, if necessary. If the parameter isn’t already presented in the parameters section of the source JSON configuration, add the entire parameter object to the parameters section, and then set its sensitive and value values. Each parameter listed in this article includes an example of how it appears or should appear in the source JSON configuration.

Important

Set the sensitive attribute of a parameter to true if its value contains sensitive information. Otherwise, the value will appear in clear text in the JSON configuration.

For instructions on accessing the JSON file and best practices to follow, see Edit a source JSON configuration.

AdditionalHeaders (object | null)

Specifies a map of HTTP headers in JSON format to be sent with every HTTP request. The value must be a valid JSON map in the format: {\"header1\": \"value1\", \"header2\": \"value2\", …​ }. The value is null by default.

Example
"AdditionalHeaders": {
  "sensitive": true,
  "value": "{\"X-CSRF-Token\": \"6b609652ef8e448eb6de270139d489c3\"}"
}
Important
  • When the Execute JavaScript on pages option is enabled, the Web connector doesn’t support the AdditionalHeaders parameter.

  • Don’t use the AdditionalHeaders parameter to send the Authorization header. This will generate an error as soon as you start an indexing action.

ExpandBeforeFiltering (Boolean)

Whether the crawler expands the pages before applying the inclusion and exclusion filters. This might make the crawling process slower since filtered pages are still going to be downloaded. The default value is false.

Example

You want to index the pages that are hyperlinked on https://mysite.com/, but you don’t want to index https://mysite.com/ itself. You therefore proceed as follows:

  1. You set a Starting URL value to https://mysite.com/ in the source configuration user interface.

  2. You add an exclusion rule for https://mysite.com/ in the source configuration user interface.

  3. You set ExpandBeforeFiltering to true.

"ExpandBeforeFiltering": {
  "sensitive": false,
  "value": "true"
}

FollowAutoRedirects (Boolean)

Whether the crawler should follow automatic redirections. The default value is true.

Example
"FollowAutoRedirects": {
  "sensitive": false,
  "value": "false"
}

Whether the crawler should redirect to the canonical link (if any) in the <head> section of the page. Using canonical links helps reduce duplicates, the canonical link being the "preferred" version of a page. The default value is false.

Example
"FollowCanonicalLink": {
  "sensitive": false,
  "value": "true"
}

IgnoreUrlFragment (Boolean)

Whether to ignore the fragment part of found pages. The fragment part of the URL is everything after the hash (#) sign. The default value is true. Setting IgnoreUrlFragment to false may be necessary, for example, with one-page web apps.

Example
"IgnoreUrlFragment": {
  "sensitive": false,
  "value": "false"
}

IndexExternalPages (Boolean)

Whether to index linked web pages that aren’t part of the domain of the specified Starting URL. The default value is false.

Note

The IndexExternalPages parameter has no bearing on whether Starting URL subdomain pages are indexed.

Example
"IndexExternalPages": {
  "sensitive": false,
  "value": "true"
}
Example

One of your Starting URLs is https://www.mycompany.com and one of its pages contains a link to https://en.wikipedia.org/. If IndexExternalPages is set to false, the linked Wikipedia page isn’t crawled and included in your searchable content.

However, a local URL that redirects the visitor to an external page will override this setting.

Example

One of your Starting URLs is https://www.mycompany.com and one of its pages contains a link to https://www.mycompany.com/mypage. When clicked, this link redirects the reader to https://en.wikipedia.org/. The Wikipedia page will be crawled and included in your searchable content.

Whether linked pages in external web pages are indexed. The default value is false.

Note

IndexExternalPages must be set to true for IndexExternalPagesLinks to be taken into account.

Important

Including pages linked in external pages can lead anywhere on the Internet, infinitely discovering linked pages on other sites.

If you do enable this option, you should add one or more rules, preferably inclusion rules, to restrict the discovery to identifiable sites.

Example
"IndexExternalPagesLinks": {
  "sensitive": false,
  "value": "true"
}

IndexJsonLdMetadata (Boolean)

Whether to index metadata from JSON-LD <script> tags. The default value is false.

When enabled, JSON-LD objects in the web page are flattened and represented in jsonld.parent.child metadata format in your Coveo organization.

Example

Given the following JSON-LD script tag in a web page:

<script id="jsonld" type="application/ld+json">
   {
      "@context": "https://schema.org",
      "@type": "NewsArticle",
      "url": "http://www.bbc.com/news/world-us-canada-39324587",
      "publisher": {
          "@type": "Organization",
          "name": "BBC News",
          "logo": "http://www.bbc.co.uk/news/special/2015/newsspec_10857/bbc_news_logo.png?cb=1"
      },
      "headline": "Canada Strikes Gold in Olympic Hockey Final"
   }
</script>

To index the publisher name value (that is, BBC News in this page) in an index field

  1. Set your configuration as follows:

    "IndexJsonLdMetadata": {
      "sensitive": false,
      "value": "true"
    }
  2. Use the following mapping rule for your field: %[jsonld.publisher.name]

Note

Contextual JSON-LD key-value pairs whose keys begin with @ aren’t converted into metadata.

IndexSubdomains (Boolean)

Whether to index subdomains of the base Starting URL you entered when creating the Web source. The default value is false.

Setting IndexSubdomains to true in effect adds all subdomains to the list of starting URLs without having to add them individually in the user interface. It doesn’t change your inclusion and exclusion rules though. If you set IndexSubdomains to true, make sure you adjust your inclusion and exclusion rules.

When IndexSubdomains is set to true, the Web source strips the www subdomain when determining subdomains. For example, if your base Starting URL is https://www.abc.com, the Web source considers https://docs.abc.com as being a subdomain.

Example
"IndexSubdomains": {
  "sensitive": false,
  "value": "true"
}

MaxAutoRedirects (integer)

The maximum number of redirects that the requests may follow. No maximum is applied when the value is 0. The default value is 7.

Example
"MaxAutoRedirects": {
  "sensitive": false,
  "value": "10"
}

MaxCrawlDelayInSeconds (integer)

The maximum number of seconds to respect in the robots.txt directive of Crawl-delay: X. The robots.txt directive override source crawling setting in the user interface must not be enabled for MaxCrawlDelayInSeconds to be used. If MaxCrawlDelayInSeconds is set to 0, the crawler won’t respect the directive of the robots.txt file. The default value is 5.

Example
"MaxCrawlDelayInSeconds": {
  "sensitive": false,
  "value": "3"
}

MaxCrawlDepth (integer)

Maximum number of link levels under your Starting URLs to include in the source. When the value is 0, only the starting URLs are crawled and all their links are ignored. The default value is 100.

Example
"MaxCrawlDepth": {
  "sensitive": false,
  "value": "10"
}

MaxPageSizeInBytes (integer)

The maximum size of a web resource in bytes. If the resource size exceeds this value, it isn’t downloaded or processed. No maximum is applied when the value is 0. The default value is 536870912 (512 MB).

Important

This setting prevents excessively long rebuilds and rescans. Increase its value only when absolutely necessary.

Example
"MaxPageSizeInBytes": {
  "sensitive": false,
  "value": "400000000"
}

MaxPagesToCrawl (integer)

The maximum number of pages to index. No maximum is applied when the value is 0. The default value is 2500000.

Important

This setting prevents excessively long rebuilds and rescans. Increase its value only when absolutely necessary.

Example
"MaxPagesToCrawl": {
  "sensitive": false,
  "value": "10000"
}

MaxPagesToCrawlPerDomain (integer)

The maximum number of pages to index per domain. No maximum is applied when the value is 0. The default value is 0.

Example
"MaxPagesToCrawlPerDomain": {
  "sensitive": false,
  "value": "8000"
}

MaxRequestsRetryCount (integer)

The maximum number of request retries for a URL, if an error occurs. No retry is attempted when the value is 0. The default value is 3.

Example
"MaxRequestsRetryCount": {
  "sensitive": false,
  "value": "5"
}

MinCrawlDelayPerDomainInMilliseconds (integer)

The number of milliseconds between consecutive HTTP requests sent to retrieve content on a given domain. The default value is 1000, which means that a maximum of 3,600 items are indexed per hour on any given domain.

Example
"MinCrawlDelayPerDomainInMilliseconds": {
  "sensitive": false,
  "value": "1500"
}

RequestsRetryDelayInMilliseconds (integer)

The minimum delay in milliseconds between a failed HTTP request and the ensuing retry. The default value is 1000.

Example
"RequestsRetryDelayInMilliseconds": {
  "sensitive": false,
  "value": "1500"
}

RequestsTimeoutInSeconds (integer)

Timeout value in seconds for web requests. Web requests don’t timeout when the value is set to 0. The default value is 60.

Consider increasing the value if a target website sometimes responds slower. This prevents getting errors that can be avoided.

Example
"RequestsTimeoutInSeconds": {
  "sensitive": false,
  "value": "60"
}

RespectUrlCasing (Boolean)

Whether page URLs that you include are case sensitive. The default value is true.

Example
"RespectUrlCasing": {
  "sensitive": false,
  "value": "true"
}
Important
When RespectUrlCasing is Set to True

The crawler uses page URLs as is.

If the same web page appears twice on the Web server (that is, with different URL casings), you can expect duplicate items in your source, one copy for each URL casing variant found by the crawler.

When web page URLs are case sensitive (that is, when RespectUrlCasing is set to true), inclusion and exclusion rules are also case sensitive.

Important
When RespectUrlCasing is Set to False

The crawler first converts the current page URL to lowercase letters. It then uses the lowercased page URL to perform an HTTP request to the Web server.

If the Web server is case-sensitive and resource URLs on the server contain uppercase letters, the Web server won’t serve the request. The way the Web server deals with this situation (for example, HTTP error) varies from site to site.

When web page URLs aren’t case sensitive (that is, when RespectUrlCasing is set to false), inclusion and exclusion rules are also not case sensitive.

RobotsDotTextUserAgentString (string | null)

The User Agent used by the Coveo crawler to interact with the target Web content. The User Agent is compared against website robots.txt file allow/disallow directives, provided the robots.txt directive override user interface setting isn’t enabled. The default value is CoveoBot.

Example
"RobotsDotTextUserAgentString": {
  "sensitive": false,
  "value": "AcmeBot"
}

SendCookies (Boolean)

Whether the crawler can exchange HTTP cookies with the crawled site. The default value is true.

Example
"SendCookies": {
  "sensitive": false,
  "value": "false"
}

SkipOnStartingAddressError (Boolean)

Whether the crawler should skip a Starting URL when erroneous. This setting may be useful on a source that includes multiple starting URLs. When SkipOnStartingAddressError is set to true, a rebuild doesn’t fail when unable to retrieve a given starting URL. Instead, it moves on to the next starting URL specified in the source configuration user interface. The default value is true.

Important

When SkipOnStartingAddressError is set to true, if a rebuild or rescan is unable to access a starting URL, existing source content associated with that starting URL is deleted.

Example
"SkipOnStartingAddressError": {
  "sensitive": false,
  "value": "false"
}

UseHiddenBasicAuthentication (Boolean)

Whether the crawler should send a basic authentication header when the crawled website responds with an HTTP 404 Not Found status code. This is unsafe in most cases and thus UseHiddenBasicAuthentication is false by default.

Example
"UseHiddenBasicAuthentication": {
  "sensitive": false,
  "value": "true"
}

UseProxy (Boolean)

Important

Whether the Crawling Module should use a proxy to access the content to be crawled. The default value is false.

Example
"UseProxy": {
  "sensitive": false,
  "value": "false"
}

UserAgentString (string)

Identifier used by the Coveo crawler when requesting pages from the web server. The default value is Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) (compatible; Coveobot/2.0;+http://www.coveo.com/bot.html).

Example
"UserAgentString": {
  "sensitive": false,
  "value": "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) (compatible; Coveobot/2.0;+http://www.coveo.com/bot.html)"
}