Sitemap source JSON modification
Sitemap source JSON modification
Many source configuration parameters can be set through the user interface. Others, such as rarely used parameters or new parameters that aren’t yet editable through the user interface, can only be configured in the source JSON configuration.
This article explains how to configure Sitemap source parameters, whether they’re already listed in the JSON or not.
Configuring listed and unlisted parameters
If the parameter you want to change is already listed in the parameters
section of the source JSON configuration, just modify its value
in the JSON configuration.
If the parameter isn’t listed in the parameters
section, copy the entire parameter example object from the Reference section below and paste it into the parameters
section of the source JSON configuration.
Then, modify the value
in the JSON configuration, if necessary.
If a parameter has a |
Document the changes you make to the source JSON configuration in the Change notes area below the JSON configuration. This ensures that you can easily revert to a previous configuration if needed. |
Reference
This section provides information on the Sitemap source parameters that you can only modify through the JSON configuration.
If a JSON configuration parameter isn’t documented in this article, configure it through the user interface instead.
AdditionalHeaders
(String)
Semicolon separated list of additional HTTP headers added to the connector requests in the following format: key1\\=value1\\;key2\\=value2
.
The parameter is empty by default.
|
"AdditionalHeaders": {
"sensitive": true,
"value": "X-CSRF-Token\\=<CSRF_TOKEN_VALUE>"
}
where you replace <CSRF_TOKEN_VALUE>
with the actual token.
AllowAutoRedirect
(Boolean)
Whether a crawler request should automatically follow redirection responses from the web resource.
When set to true
, the crawler only performs a single HTTP request (that is, for the current page).
It automatically follows server HTTP redirect responses to reach the final redirection page.
When set to false
, the crawler performs an HTTP request for the current page, and then another for each server HTTP redirect response until it reaches the final page.
The default value is true
.
"AllowAutoRedirect": {
"sensitive": false,
"value": "false"
}
DateFormat
(String)
(For XML sitemaps only) When the last modification dates aren’t in a standard format (for example, YYYY-MM-DDThh:mm:ss.sTZD
), therefore triggering the SITEMAP_INVALID_FORMAT_ERROR
error in the Administration Console, specify the Sitemap file date custom format.
The format must use the MSDN format specifiers (see Custom date and time format strings).
"DateFormat": {
"sensitive": false,
"value": "yyyy;MM;ddTHH:mm:sszzz"
}
EnableJavaScriptRenderingOptimizations
(Boolean)
Whether to enable JavaScript rendering optimizations.
When set to true
, the crawler doesn’t download images and external files.
The default value is true
.
EnableJavaScriptRenderingOptimizations
to false
On a page, you have a dynamically generated table.
The data in the table comes from a JSON file, downloaded from a server, using JavaScript.
To index the table data, you would need to set EnableJavaScriptRenderingOptimizations
to false
.
"EnableJavaScriptRenderingOptimizations": {
"sensitive": false,
"value": "false"
}
ForceBasicAuthorizationHeader
(Boolean)
Whether to enforce basic header authentication.
The default value is false
.
Note
To use |
Enabling this setting is unsafe, as your basic authentication credentials will be sent with every page your source requests, regardless of the domain. |
"ForceBasicAuthorizationHeader": {
"sensitive": false,
"value": "true"
}
HtmlXPathSelectorExpression
(String)
The XPath expression used to select one or more nodes of an HTML format sitemap file containing the URLs to crawl. The parameter is empty by default, which results in the connector indexing all web pages listed in the sitemap file.
You only want to index a specific part (only the web pages linked inside the cbc-sitemap
div container) of the CBC sitemap web page so you add the following:
"HtmlXPathSelectorExpression": {
"sensitive": false,
"value": "/div[@id='cbc-sitemap']"
}
Note
The |
IndexHtmlMetadata
(Boolean)
Whether metadata tags found in HTML files should be indexed by the Sitemap crawler.
When enabled, the content
attribute of <meta>
tags is indexed for tags keyed with one of the following attributes: name
, property
, itemprop
, or http-equiv
.
The default value is false
since the parameter has an impact on indexing performance.
By default, the Coveo converter extracts metadata from <meta>
HTML tags with a name
attribute more efficiently.
Therefore, consider enabling this option only when you want to extract from <meta>
tags with a property
, itemprop
, or http-equiv
attribute.
"IndexHtmlMetadata": {
"sensitive": false,
"value": "true"
}
See Index HTML page metadata for more information.
NumberOfRetries
(Integer)
The number of retries allowed when a failed web request is recoverable.
Only the following HTTP errors will be retried: 408, 500, 503, and 504.
The default value is 3
retries.
"NumberOfRetries": {
"sensitive": false,
"value": "5"
}
ParsableContentTypes
(String)
A list of content types in JSON format, for which the content will be parsed to find data such as hyperlinks.
The default value is "application/xhtml+xml", "application/xml", "text/html"
.
"ParsableContentTypes": {
"sensitive": false,
"value": "[\"application/xhtml+xml\", \"application/xml\", \"text/html\"]"
}
ParsableContentTypesSuffixes
(String)
A list of content types suffixes in JSON format, for which the content will be parsed to find data such as hyperlinks.
The default value is +xml
.
"ParsableContentTypesSuffixes": {
"sensitive": false,
"value": "[\"+xml\", \"+json\"]"
}
ParseSitemapAlternateLinks
(Boolean)
Whether the Sitemap <url>
element child <xhtml:link>
alternate language links should be crawled.
The default value is false
.
Note
When |
"ParseSitemapAlternateLinks": {
"sensitive": false,
"value": "true"
}
ParseSitemapInStrictMode
(Boolean)
Whether each Sitemap file should be parsed in strict mode.
The default value is false
.
When ParseSitemapInStrictMode
is set to false
, a URL must only be well-formatted (that is, an absolute HTTP or HTTPs URL) to be considered valid.
Non-valid URLs are skipped.
When ParseSitemapInStrictMode
is set to true
, the Sitemap source also performs the following validations on sitemap files and sitemap index files:
-
The uncompressed file must be no larger than 10 MB (even if the file is compressed with GZIP). If this condition isn’t met, a descriptive error is displayed and the sitemap or sitemap index file isn’t processed.
-
The file can’t contain more than 50,000 URLs. If this condition isn’t met, a descriptive error is displayed and the sitemap or sitemap index file isn’t processed.
-
A referenced URL must be relative to the sitemap that references it and in the same domain. The location of a sitemap file determines the set of URLs that can be included in that sitemap. For example, a sitemap file located at
http://myorgwebsite.com/tech/sitemap.xml
can include any URL starting withhttp://myorgwebsite.com/tech/
but can’t include URLs starting withhttp://myorgwebsite/catalog/
. If a URL doesn’t meet this condition, it’s skipped. -
A referenced URL in the sitemap file must be less than 2,048 characters long. If a URL doesn’t meet this condition, it’s skipped.
"ParseSitemapInStrictMode": {
"sensitive": false,
"value": "true"
}
ReadTimeout
(Integer)
The timeout duration in seconds when the connector reads web page content from a stream (that is, downloading a Sitemap/web page content).
The default value is 300
seconds.
When not receiving input from a stream, the crawler will wait for this duration before moving on.
"ReadTimeout": {
"sensitive": false,
"value": "500"
}
SkipOnSitemapError
(Boolean)
Whether the crawler should skip a sitemap instead of stopping when encountering an exception.
The default value is false
.
Skipping may occur on an exception in one of the source Sitemap URLs, or in a sitemap referenced in one of the Sitemap URLs. When the crawler skips a sitemap, no error message is displayed in the Coveo Administration Console and Coveo doesn’t delete existing indexed items located under that sitemap directory.
"SkipOnSitemapError": {
"sensitive": false,
"value": "true"
}
Timeout
(Integer)
The number of seconds to wait before a request times out (that is, the time a server can take to respond to a request).
The default value is 100
seconds.
"Timeout": {
"sensitive": false,
"value": "100"
}
UseCookies
(Boolean)
Whether cookies must be enabled to crawl.
The default value is false
.
Set the value to true
when you want a cookie container to be initialized and reused for each crawling web request.
"UseCookies": {
"sensitive": false,
"value": "true"
}
UseProxy
(Boolean)
|
Whether the Crawling Module should use a proxy to access the content to be crawled.
The default value is false
.
"UseProxy": {
"sensitive": false,
"value": "false"
}
UserAgent
(String)
The value of the user agent header sent in the HTTP request.
The default value is Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) (compatible; Coveobot/2.0;+http://www.coveo.com/bot.html)
.
"UserAgent": {
"sensitive": false,
"value": "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) (compatible; Coveobot/2.0;+http://www.coveo.com/bot.html)"
}