Sitemap source JSON modification

Many source configuration parameters can be set through the user interface. Others, such as rarely used parameters or new parameters that aren’t yet editable through the user interface, can only be configured in the source JSON configuration.

This article explains how to configure Sitemap source parameters, whether they’re already listed in the JSON or not.

Configuring listed and unlisted parameters

Changing a parameter value in the source JSON configuration | Coveo

If the parameter you want to change is already listed in the parameters section of the source JSON configuration, just modify its value in the JSON configuration.

If the parameter isn’t listed in the parameters section, copy the entire parameter example object from the Reference section below and paste it into the parameters section of the source JSON configuration. Then, modify the value in the JSON configuration, if necessary.

Important

If a parameter has a value attribute that contains sensitive information, set the sensitive attribute to true. Otherwise, the value will appear in clear text in the JSON configuration.

Tip

Document the changes you make to the source JSON configuration in the Change notes area below the JSON configuration. This ensures that you can easily revert to a previous configuration if needed.

Reference

This section provides information on the Sitemap source parameters that you can only modify through the JSON configuration.

If a JSON configuration parameter isn’t documented in this article, configure it through the user interface instead.

AdditionalHeaders (String)

Semicolon separated list of additional HTTP headers added to the connector requests in the following format: key1\\=value1\\;key2\\=value2. The parameter is empty by default.

Important
  • You can’t use manual cookies if you crawl content dynamically rendered by JavaScript (if the JavaScript-rendered checkbox is selected in the Sitemap source configuration).

  • Don’t use the AdditionalHeaders parameter to send the Authorization header. This will generate an error as soon as you start an indexing action.

Example
"AdditionalHeaders": {
  "sensitive": true,
  "value": "X-CSRF-Token\\=<CSRF_TOKEN_VALUE>"
}

where you replace <CSRF_TOKEN_VALUE> with the actual token.

AllowAutoRedirect (Boolean)

Whether a crawler request should automatically follow redirection responses from the web resource.

When set to true, the crawler only performs a single HTTP request (that is, for the current page). It automatically follows server HTTP redirect responses to reach the final redirection page.

When set to false, the crawler performs an HTTP request for the current page, and then another for each server HTTP redirect response until it reaches the final page.

The default value is true.

Example
"AllowAutoRedirect": {
  "sensitive": false,
  "value": "false"
}

DateFormat (String)

(For XML sitemaps only) When the last modification dates aren’t in a standard format (for example, YYYY-MM-DDThh:mm:ss.sTZD), therefore triggering the SITEMAP_INVALID_FORMAT_ERROR error in the Administration Console, specify the Sitemap file date custom format. The format must use the MSDN format specifiers (see Custom date and time format strings).

Example
"DateFormat": {
  "sensitive": false,
  "value": "yyyy;MM;ddTHH:mm:sszzz"
}

EnableJavaScriptRenderingOptimizations (Boolean)

Whether to enable JavaScript rendering optimizations. When set to true, the crawler doesn’t download images and external files. The default value is true.

Example where you would set EnableJavaScriptRenderingOptimizations to false

On a page, you have a dynamically generated table. The data in the table comes from a JSON file, downloaded from a server, using JavaScript. To index the table data, you would need to set EnableJavaScriptRenderingOptimizations to false.

Example
"EnableJavaScriptRenderingOptimizations": {
  "sensitive": false,
  "value": "false"
}

ForceBasicAuthorizationHeader (Boolean)

Whether to enforce basic header authentication. The default value is false.

Note

To use ForceBasicAuthorizationHeader, you need Execute JavaScript on pages to be disabled.

Warning

Enabling this setting is unsafe, as your basic authentication credentials will be sent with every page your source requests, regardless of the domain.

Example
"ForceBasicAuthorizationHeader": {
  "sensitive": false,
  "value": "true"
}

HtmlXPathSelectorExpression (String)

The XPath expression used to select one or more nodes of an HTML format sitemap file containing the URLs to crawl. The parameter is empty by default, which results in the connector indexing all web pages listed in the sitemap file.

Example

You only want to index a specific part (only the web pages linked inside the cbc-sitemap div container) of the CBC sitemap web page so you add the following:

"HtmlXPathSelectorExpression": {
  "sensitive": false,
  "value": "/div[@id='cbc-sitemap']"
}
Note

The ParseSitemapInStrictMode JSON parameter must also be set to false since an HTML format sitemap file doesn’t follow the Sitemap protocol.

IndexHtmlMetadata (Boolean)

Whether metadata tags found in HTML files should be indexed by the Sitemap crawler. When enabled, the content attribute of <meta> tags is indexed for tags keyed with one of the following attributes: name, property, itemprop, or http-equiv. The default value is false since the parameter has an impact on indexing performance.

By default, the Coveo converter extracts metadata from <meta> HTML tags with a name attribute more efficiently. Therefore, consider enabling this option only when you want to extract from <meta> tags with a property, itemprop, or http-equiv attribute.

Example
"IndexHtmlMetadata": {
  "sensitive": false,
  "value": "true"
}

See Index HTML page metadata for more information.

NumberOfRetries (Integer)

The number of retries allowed when a failed web request is recoverable. Only the following HTTP errors will be retried: 408, 500, 503, and 504. The default value is 3 retries.

Example
"NumberOfRetries": {
  "sensitive": false,
  "value": "5"
}

ParsableContentTypes (String)

A list of content types in JSON format, for which the content will be parsed to find data such as hyperlinks. The default value is "application/xhtml+xml", "application/xml", "text/html".

Example
"ParsableContentTypes": {
  "sensitive": false,
  "value": "[\"application/xhtml+xml\", \"application/xml\", \"text/html\"]"
}

ParsableContentTypesSuffixes (String)

A list of content types suffixes in JSON format, for which the content will be parsed to find data such as hyperlinks. The default value is +xml.

Example
"ParsableContentTypesSuffixes": {
  "sensitive": false,
  "value": "[\"+xml\", \"+json\"]"
}

Whether the Sitemap <url> element child <xhtml:link> alternate language links should be crawled. The default value is false.

Note

When ParseSitemapAlternateLinks is set to true, if an <xhtml:link> element has its hreflang attribute set to x-default, the corresponding href URL will be crawled unless this URL has already been crawled as the <loc> element text value.

Example
"ParseSitemapAlternateLinks": {
  "sensitive": false,
  "value": "true"
}

ParseSitemapInStrictMode (Boolean)

Whether each Sitemap file should be parsed in strict mode. The default value is false.

When ParseSitemapInStrictMode is set to false, a URL must only be well-formatted (that is, an absolute HTTP or HTTPs URL) to be considered valid. Non-valid URLs are skipped.

When ParseSitemapInStrictMode is set to true, the Sitemap source also performs the following validations on sitemap files and sitemap index files:

  • The uncompressed file must be no larger than 10 MB (even if the file is compressed with GZIP). If this condition isn’t met, a descriptive error is displayed and the sitemap or sitemap index file isn’t processed.

  • The file can’t contain more than 50,000 URLs. If this condition isn’t met, a descriptive error is displayed and the sitemap or sitemap index file isn’t processed.

  • A referenced URL must be relative to the sitemap that references it and in the same domain. The location of a sitemap file determines the set of URLs that can be included in that sitemap. For example, a sitemap file located at http://myorgwebsite.com/tech/sitemap.xml can include any URL starting with http://myorgwebsite.com/tech/ but can’t include URLs starting with http://myorgwebsite/catalog/. If a URL doesn’t meet this condition, it’s skipped.

  • A referenced URL in the sitemap file must be less than 2,048 characters long. If a URL doesn’t meet this condition, it’s skipped.

Example
"ParseSitemapInStrictMode": {
  "sensitive": false,
  "value": "true"
}

ReadTimeout (Integer)

The timeout duration in seconds when the connector reads web page content from a stream (that is, downloading a Sitemap/web page content). The default value is 300 seconds. When not receiving input from a stream, the crawler will wait for this duration before moving on.

Example
"ReadTimeout": {
  "sensitive": false,
  "value": "500"
}

SkipOnSitemapError (Boolean)

Whether the crawler should skip a sitemap instead of stopping when encountering an exception. The default value is false.

Skipping may occur on an exception in one of the source Sitemap URLs, or in a sitemap referenced in one of the Sitemap URLs. When the crawler skips a sitemap, no error message is displayed in the Coveo Administration Console and Coveo doesn’t delete existing indexed items located under that sitemap directory.

Example
"SkipOnSitemapError": {
  "sensitive": false,
  "value": "true"
}

Timeout (Integer)

The number of seconds to wait before a request times out (that is, the time a server can take to respond to a request). The default value is 100 seconds.

Example
"Timeout": {
  "sensitive": false,
  "value": "100"
}

UseCookies (Boolean)

Whether cookies must be enabled to crawl. The default value is false. Set the value to true when you want a cookie container to be initialized and reused for each crawling web request.

Example
"UseCookies": {
  "sensitive": false,
  "value": "true"
}

UseProxy (Boolean)

Important

Whether the Crawling Module should use a proxy to access the content to be crawled. The default value is false.

Example
"UseProxy": {
  "sensitive": false,
  "value": "false"
}

UserAgent (String)

The value of the user agent header sent in the HTTP request. The default value is Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) (compatible; Coveobot/2.0;+http://www.coveo.com/bot.html).

Example
"UserAgent": {
  "sensitive": false,
  "value": "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) (compatible; Coveobot/2.0;+http://www.coveo.com/bot.html)"
}