Sitemap source JSON modification

Many source configuration parameters can be set through the user interface. Others, such as rarely used parameters or new parameters that aren’t yet editable through the user interface, can only be changed or added in the JSON configuration. This article presents Sitemap source parameters that you can change by modifying the source JSON configuration.

If the parameter you want to change is already presented in the parameters section of the source JSON configuration, only modify their value and sensitive values, if necessary. If the parameter isn’t already presented in the parameters section of the source JSON configuration, add the entire parameter object to the parameters section, and then set its sensitive and value values. Each parameter listed in this article includes an example of how it appears or should appear in the source JSON configuration.

Important

Set the sensitive attribute of a parameter to true if its value contains sensitive information. Otherwise, the value will appear in clear text in the JSON configuration.

For instructions on accessing the JSON file and best practices to follow, see Edit a source JSON configuration.

AdditionalHeaders (String)

Semicolon separated list of additional HTTP headers added to the connector requests in the following format: key1\\=value1\\;key2\\=value2. The parameter is empty by default.

Important
  • You can’t use manual cookies if you crawl content dynamically rendered by JavaScript (if the JavaScript-rendered checkbox is selected in the Sitemap source configuration).

  • Don’t use the AdditionalHeaders parameter to send the Authorization header. This will generate an error as soon as you start an indexing action.

Example
"AdditionalHeaders": {
  "sensitive": true,
  "value": "X-CSRF-Token\\=<CSRF_TOKEN_VALUE>"
}

where you replace <CSRF_TOKEN_VALUE> with the actual token.

AllowAutoRedirect (Boolean)

Whether a crawler request should automatically follow redirection responses from the web resource.

When set to true, the crawler only performs a single HTTP request (that is, for the current page). It automatically follows server HTTP redirect responses to reach the final redirection page.

When set to false, the crawler performs an HTTP request for the current page, and then another for each server HTTP redirect response until it reaches the final page.

The default value is true.

Example
"AllowAutoRedirect": {
  "sensitive": false,
  "value": "false"
}

DateFormat (String)

(For XML sitemaps only) When the last modification dates aren’t in a standard format (for example, YYYY-MM-DDThh:mm:ss.sTZD), therefore triggering the SITEMAP_INVALID_FORMAT_ERROR error in the Administration Console, specify the Sitemap file date custom format. The format must use the MSDN format specifiers (see Custom date and time format strings).

Example
"DateFormat": {
  "sensitive": false,
  "value": "yyyy;MM;ddTHH:mm:sszzz"
}

EnableJavaScriptRenderingOptimizations (Boolean)

Whether to enable JavaScript rendering optimizations. When set to true, the crawler doesn’t download images and external files. The default value is true.

Example where you would set EnableJavaScriptRenderingOptimizations to false

On a page, you have a dynamically generated table. The data in the table comes from a JSON file, downloaded from a server, using JavaScript. To index the table data, you would need to set EnableJavaScriptRenderingOptimizations to false.

Example
"EnableJavaScriptRenderingOptimizations": {
  "sensitive": false,
  "value": "false"
}

ForceBasicAuthorizationHeader (Boolean)

Whether to enforce basic header authentication. The default value is false. Set it to true when your server doesn’t challenge the caller for authentication for example, or when you get an HTTP 404 error (often occurs on non-IIS servers) in the Administration Console that looks like the following:

Exception during item expansion: https://myorgwebsite.com/basicauth/user/password. -> The remote server returned an error: (404) Not Found.

Example
"ForceBasicAuthorizationHeader": {
  "sensitive": false,
  "value": "true"
}

HtmlXPathSelectorExpression (String)

The XPath expression used to select one or more nodes of an HTML format sitemap file containing the URLs to crawl. The parameter is empty by default, which results in the connector indexing all web pages listed in the sitemap file.

Example

You only want to index a specific part (only the web pages linked inside the cbc-sitemap div container) of the CBC sitemap web page so you add the following:

"HtmlXPathSelectorExpression": {
  "sensitive": false,
  "value": "/div[@id='cbc-sitemap']"
}
Note

The ParseSitemapInStrictMode JSON parameter must also be set to false since an HTML format sitemap file doesn’t follow the Sitemap protocol.

IndexHtmlMetadata (Boolean)

Whether metadata tags found in HTML files should be indexed by the Sitemap crawler. When enabled, the content attribute of <meta> tags is indexed for tags keyed with one of the following attributes: name, property, itemprop, or http-equiv. The default value is false since the parameter has an impact on indexing performance.

By default, the Coveo converter extracts metadata from <meta> HTML tags with a name attribute more efficiently. Therefore, consider enabling this option only when you want to extract from <meta> tags with a property, itemprop, or http-equiv attribute.

Example
"IndexHtmlMetadata": {
  "sensitive": false,
  "value": "true"
}

See Index HTML page metadata for more information.

NumberOfRetries (Integer)

The number of retries allowed when a failed web request is recoverable. Only the following HTTP errors will be retried: 408, 500, 503, and 504. The default value is 3 retries.

Example
"NumberOfRetries": {
  "sensitive": false,
  "value": "5"
}

ParsableContentTypes (String)

A list of content types in JSON format, for which the content will be parsed to find data such as hyperlinks. The default value is "application/xhtml+xml", "application/xml", "text/html".

Example
"ParsableContentTypes": {
  "sensitive": false,
  "value": "[\"application/xhtml+xml\", \"application/xml\", \"text/html\"]"
}

ParsableContentTypesSuffixes (String)

A list of content types suffixes in JSON format, for which the content will be parsed to find data such as hyperlinks. The default value is +xml.

Example
"ParsableContentTypesSuffixes": {
  "sensitive": false,
  "value": "[\"+xml\", \"+json\"]"
}

Whether the Sitemap <url> element child <xhtml:link> alternate language links should be crawled. The default value is false.

Note

When ParseSitemapAlternateLinks is set to true, if an <xhtml:link> element has its hreflang attribute set to x-default, the corresponding href URL will be crawled unless this URL has already been crawled as the <loc> element text value.

Example
"ParseSitemapAlternateLinks": {
  "sensitive": false,
  "value": "true"
}

ParseSitemapInStrictMode (Boolean)

Whether each Sitemap file should be parsed in strict mode. The default value is false.

When ParseSitemapInStrictMode is set to false, a URL must only be well-formatted (that is, an absolute HTTP or HTTPs URL) to be considered valid. Non-valid URLs are skipped.

When ParseSitemapInStrictMode is set to true, the Sitemap source also performs the following validations on sitemap files and sitemap index files:

  • The uncompressed file must be no larger than 10 MB (even if the file is compressed with GZIP). If this condition isn’t met, a descriptive error is displayed and the sitemap or sitemap index file isn’t processed.

  • The file can’t contain more than 50,000 URLs. If this condition isn’t met, a descriptive error is displayed and the sitemap or sitemap index file isn’t processed.

  • A referenced URL must be relative to the sitemap that references it and in the same domain. The location of a sitemap file determines the set of URLs that can be included in that sitemap. For example, a sitemap file located at http://myorgwebsite.com/tech/sitemap.xml can include any URL starting with http://myorgwebsite.com/tech/ but can’t include URLs starting with http://myorgwebsite/catalog/. If a URL doesn’t meet this condition, it’s skipped.

  • A referenced URL in the sitemap file must be less than 2,048 characters long. If a URL doesn’t meet this condition, it’s skipped.

Example
"ParseSitemapInStrictMode": {
  "sensitive": false,
  "value": "true"
}

ReadTimeout (Integer)

The timeout duration in seconds when the connector reads web page content from a stream (that is, downloading a Sitemap/web page content). The default value is 300 seconds. When not receiving input from a stream, the crawler will wait for this duration before moving on.

Example
"ReadTimeout": {
  "sensitive": false,
  "value": "500"
}

SkipOnSitemapError (Boolean)

Whether the crawler should skip a sitemap instead of stopping when encountering an exception. The default value is false.

Skipping may occur on an exception in one of the source Sitemap URLs, or in a sitemap referenced in one of the Sitemap URLs. When the crawler skips a sitemap, no error message is displayed in the Coveo Administration Console and Coveo doesn’t delete existing indexed items located under that sitemap directory.

Example
"SkipOnSitemapError": {
  "sensitive": false,
  "value": "true"
}

Timeout (Integer)

The number of seconds to wait before a request times out (that is, the time a server can take to respond to a request). The default value is 100 seconds.

Example
"Timeout": {
  "sensitive": false,
  "value": "100"
}

UseCookies (Boolean)

Whether cookies must be enabled to crawl. The default value is false. Set the value to true when you want a cookie container to be initialized and reused for each crawling web request.

Example
"UseCookies": {
  "sensitive": false,
  "value": "true"
}

UseProxy (Boolean)

Important

Whether the Crawling Module should use a proxy to access the content to be crawled. The default value is false.

Example
"UseProxy": {
  "sensitive": false,
  "value": "false"
}

UserAgent (String)

The value of the user agent header sent in the HTTP request. The default value is Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) (compatible; Coveobot/2.0;+http://www.coveo.com/bot.html).

Example
"UserAgent": {
  "sensitive": false,
  "value": "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) (compatible; Coveobot/2.0;+http://www.coveo.com/bot.html)"
}