Web Source JSON Modification

Many source configuration parameters can be set through the Coveo Administration Console user interface. Others, such as rarely used parameters or new parameters that aren’t yet editable through the user interface, can only be changed or added in the JSON configuration.

This article presents Web source hidden parameters that you can change by modifying the source JSON configuration from the Coveo Administration Console. All parameters must be added to the parameters section of the source JSON configuration.

Important

Set the sensitive attribute of a parameter to true if its value contains sensitive information. Otherwise, the value will appear in clear text in the JSON configuration.

For instructions on accessing the JSON file and best practices to follow, see Edit a source JSON configuration.

AdditionalHeaders (Object | null)

Specifies a map of HTTP headers in JSON format to be sent with every HTTP request. The value must be a valid JSON map in the format: {\"header1\": \"value1\", \"header2\": \"value2\", …​ }. The value is null by default.

Example
"AdditionalHeaders": {
  "sensitive": true,
  "value": "{\"X-CSRF-Token\": \"6b609652ef8e448eb6de270139d489c3\"}"
}
Important
  • When the Render-Javascript option is enabled, the Web connector doesn’t support the AdditionalHeaders parameter.

  • Don’t use the AdditionalHeaders parameter to send the Authorization header.

ExpandBeforeFiltering (Boolean)

Whether the crawler expands the pages before applying the inclusion and exclusion filters. This might make the crawling process slower since filtered pages are still going to be downloaded. The default value is false.

Example

You want to index the pages that are hyperlinked on https://mysite.com/, but you don’t want to index https://mysite.com/ itself. You therefore proceed as follows:

  1. You set your Site URL value to https://mysite.com/ in the source configuration user interface.

  2. You add an exclusion filter for https://mysite.com/ in the source configuration user interface Content to Include settings section.

  3. You set ExpandBeforeFiltering to true.

"ExpandBeforeFiltering": {
  "sensitive": false,
  "value": "true"
}

FollowAutoRedirects (Boolean)

Whether the crawler should follow automatic redirections. The default value is true.

Example
"FollowAutoRedirects": {
  "sensitive": false,
  "value": "false"
}

Whether the crawler should redirect to the canonical link (if any) in the <head> section of the page. Using canonical links helps reduce duplicates, the canonical link being the "preferred" version of a page. The default value is false.

Example
"FollowCanonicalLink": {
  "sensitive": false,
  "value": "true"
}

IgnoreUrlFragment (Boolean)

Whether to ignore the fragment part of found pages. The fragment part of the URL is everything after the hash (#) sign. The default value is true. Setting IgnoreUrlFragment to false may be necessary, for example, with one-page web apps.

Example
"IgnoreUrlFragment": {
  "sensitive": false,
  "value": "false"
}

IndexExternalPages (Boolean)

Whether to index linked web pages that aren’t part of the domain of the specified Site URL. The default value is false.

Note

The IndexExternalPages parameter has no bearing on whether Site URL subdomain pages are indexed.

Example
"IndexExternalPages": {
  "sensitive": false,
  "value": "true"
}
Example

Your Site URL is https://www.mycompany.com and one of its pages contains a link to https://en.wikipedia.org/. If IndexExternalPages is set to false, the linked Wikipedia page isn’t crawled and included in your searchable content.

However, a local URL that redirects the visitor to an external page will override this setting.

Example

Your Site URL is https://www.mycompany.com and one of its pages contains a link to https://www.mycompany.com/mypage. When clicked, this link redirects the reader to https://en.wikipedia.org/. The Wikipedia page will be crawled and included in your searchable content.

Whether linked pages in external web pages are indexed. The default value is false.

Note

IndexExternalPages must be set to true for IndexExternalPagesLinks to be taken into account.

Important

Including pages linked in external pages can lead anywhere on the Internet, infinitely discovering linked pages on other websites.

If you do enable this option, you should add one or more filters, preferably Inclusion filters, to restrict the discovery to identifiable sites.

Example
"IndexExternalPagesLinks": {
  "sensitive": false,
  "value": "true"
}

IndexJsonLdMetadata (Boolean)

Whether to index metadata from JSON-LD <script> tags. The default value is false.

When enabled, JSON-LD objects in the web page are flattened and represented in jsonld.parent.child metadata format in your Coveo organization.

Example

Given the following JSON-LD script tag in a web page:

<script id="jsonld" type="application/ld+json">
   {
      "@context": "https://schema.org",
      "@type": "NewsArticle",
      "url": "http://www.bbc.com/news/world-us-canada-39324587",
      "publisher": {
          "@type": "Organization",
          "name": "BBC News",
          "logo": "http://www.bbc.co.uk/news/special/2015/newsspec_10857/bbc_news_logo.png?cb=1"
      },
      "headline": "Canada Strikes Gold in Olympic Hockey Final"
   }
</script>

To index the publisher name value (i.e., BBC News in this page) in an index field

  1. Set your configuration as follows:

    "IndexJsonLdMetadata": {
      "sensitive": false,
      "value": "true"
    }
  2. Use the following mapping rule for your field: %[jsonld.publisher.name]

Note

Contextual JSON-LD key-value pairs whose keys begin with @ aren’t converted into metadata.

MaxAutoRedirects (Integer)

The maximum number of redirects that the requests may follow. No maximum is applied when the value is 0. The default value is 7.

Example
"MaxAutoRedirects": {
  "sensitive": false,
  "value": "10"
}

MaxCrawlDelayInSeconds (Integer)

The maximum number of seconds to respect in the robots.txt directive of Crawl-delay: X. The Respect robots.txt directives source crawling setting in the user interface must be enabled for MaxCrawlDelayInSeconds to be used. If MaxCrawlDelayInSeconds is set to 0, the crawler won’t respect the directive of the robots.txt file. The default value is 5.

Example
"MaxCrawlDelayInSeconds": {
  "sensitive": false,
  "value": "3"
}

MaxCrawlDepth (Integer)

Maximum number of link levels under the Site URL root page to include in the source. When the value is 0, only the Site URL root page is crawled and all of its links are ignored. The default value is 100.

Example
"MaxCrawlDepth": {
  "sensitive": false,
  "value": "10"
}

MaxPageSizeInBytes (Integer)

The maximum size of a web resource in bytes. If the resource size exceeds this value, it isn’t downloaded or processed. No maximum is applied when the value is 0. The default value is 536870912 (512 MB).

Important

This setting prevents excessively long rebuilds and rescans. Increase its value only when absolutely necessary.

Example
"MaxPageSizeInBytes": {
  "sensitive": false,
  "value": "400000000"
}

MaxPagesToCrawl (Integer)

The maximum number of pages to index. No maximum is applied when the value is 0. The default value is 2500000.

Important

This setting prevents excessively long rebuilds and rescans. Increase its value only when absolutely necessary.

Example
"MaxPagesToCrawl": {
  "sensitive": false,
  "value": "10000"
}

MaxPagesToCrawlPerDomain (Integer)

The maximum number of pages to index per domain. No maximum is applied when the value is 0. The default value is 0.

Example
"MaxPagesToCrawlPerDomain": {
  "sensitive": false,
  "value": "8000"
}

MaxRequestsRetryCount (Integer)

The maximum number of request retries for a URL, if an error occurs. No retry is attempted when the value is 0. The default value is 3.

Example
"MaxRequestsRetryCount": {
  "sensitive": false,
  "value": "5"
}

MinCrawlDelayPerDomainInMilliseconds (Integer)

The number of milliseconds between consecutive HTTP requests sent to retrieve content on a given domain. The default value is 1000, which means that a maximum of 3,600 items are indexed per hour on any given domain.

Example
"MinCrawlDelayPerDomainInMilliseconds": {
  "sensitive": false,
  "value": "1500"
}

RequestsRetryDelayInMilliseconds (Integer)

The minimum delay in milliseconds between a failed HTTP request and the ensuing retry. The default value is 1000.

Example
"RequestsRetryDelayInMilliseconds": {
  "sensitive": false,
  "value": "1500"
}

RequestsTimeoutInSeconds (Integer)

Timeout value in seconds for web requests. Web requests don’t timeout when the value is set to 0. The default value is 60.

Consider increasing the value if a target website sometimes responds slower. This prevents getting errors that can be avoided.

Example
"RequestsTimeoutInSeconds": {
  "sensitive": false,
  "value": "60"
}

RespectUrlCasing (Boolean)

Whether page URLs that you include are case sensitive. The default value is true.

Example
"RespectUrlCasing": {
  "sensitive": false,
  "value": "true"
}
Important
When RespectUrlCasing is Set to True

The crawler uses page URLs as is.

If the same web page appears twice on the Web server (i.e., with different URL casings), you can expect duplicate items in your source, one copy for each URL casing variant found by the crawler.

When web page URLs are case sensitive (i.e., when RespectUrlCasing is set to true), inclusion and exclusion filters are also case sensitive.

Important
When RespectUrlCasing is Set to False

The crawler first converts the current page URL to lowercase letters. It then uses the lowercased page URL to perform an HTTP request to the Web server.

If the Web server is case-sensitive and resource URLs on the server contain uppercase letters, the Web server won’t serve the request. The way the Web server deals with this situation (e.g., HTTP error) varies from site to site.

When web page URLs aren’t case sensitive (i.e., when RespectUrlCasing is set to false), inclusion and exclusion filters are also not case sensitive.

RobotsDotTextUserAgentString (String | null)

The User Agent used by the Coveo crawler to interact with the target Web content. The User Agent is compared against website robots.txt file allow/disallow directives, provided the Respect robots.txt directives source crawling setting in the user interface is enabled. The default value is CoveoBot.

Example
"RobotsDotTextUserAgentString": {
  "sensitive": false,
  "value": "AcmeBot"
}

SendCookies (Boolean)

Whether the crawler can exchange HTTP cookies with the crawled website. The default value is true.

Example
"SendCookies": {
  "sensitive": false,
  "value": "false"
}

SkipOnStartingAddressError (Boolean)

Whether the crawler should skip a starting address when erroneous. This setting may be useful on a source that includes multiple Site URL (i.e., starting address) values. When SkipOnStartingAddressError is set to true, a rebuild doesn’t fail when unable to retrieve a given Site URL but, instead, it moves on to the next Site URL specified in the source configuration user interface. The default value is true.

Important

When SkipOnStartingAddressError is set to true, if a rebuild or rescan is unable to access a starting address, existing source content associated with that starting address is deleted.

Example
"SkipOnStartingAddressError": {
  "sensitive": false,
  "value": "false"
}

UseHiddenBasicAuthentication (Boolean)

Whether the crawler should send a basic authentication header when the crawled website responds with an HTTP 404 Not Found status code. This is unsafe in most cases and thus UseHiddenBasicAuthentication is false by default.

Example
"UseHiddenBasicAuthentication": {
  "sensitive": false,
  "value": "true"
}

UseProxy (Boolean)

Important

This parameter is specifically for Crawling Module Web sources. Don’t use this parameter in a cloud Web source.

Whether the Crawling Module should use a proxy to access the content to be crawled. The default value is false.

Example
"UseProxy": {
  "sensitive": false,
  "value": "false"
}