Sitemap source JSON modification
Sitemap source JSON modification
AdditionalHeaders
(String)AllowAutoRedirect
(Boolean)DateFormat
(String)IndexHtmlMetadata
(Boolean)IndexJsonLdMetadata
(Boolean)ForceBasicAuthorizationHeader
(Boolean)HtmlXPathSelectorExpression
(String)NumberOfRetries
(Integer)ParsableContentTypes
(String)ParsableContentTypesSuffixes
(String)ParseSitemapAlternateLinks
(Boolean)ParseSitemapInStrictMode
(Boolean)ReadTimeout
(Integer)SkipOnSitemapError
(Boolean)UseCookies
(Boolean)UseProxy
(Boolean)
Many source configuration parameters can be set through the Coveo Administration Console user interface. Others, such as rarely used parameters or new parameters that aren’t yet editable through the user interface, can only be changed or added in the JSON configuration.
This article presents Sitemap source hidden parameters that you can change by modifying the source JSON configuration from the Coveo Administration Console. All parameters must be added to the parameters
section of the source JSON configuration.
|
Set the For instructions on accessing the JSON file and best practices to follow, see Edit a source JSON configuration. |
AdditionalHeaders
(String)
Semicolon separated list of additional HTTP headers added to the connector requests in the following format: key1\\=value1\\;key2\\=value2
.
The parameter is empty by default.
|
|
"AdditionalHeaders": {
"sensitive": true,
"value": "X-CSRF-Token\\=<CSRF_TOKEN_VALUE>"
}
where you replace <CSRF_TOKEN_VALUE>
with the actual token.
AllowAutoRedirect
(Boolean)
Whether a crawler request should automatically follow redirection responses from the web resource.
When set to true
, the crawler only performs a single HTTP request (i.e., for the current page).
It automatically follows server HTTP redirect responses to reach the final redirection page.
When set to false
, the crawler performs an HTTP request for the current page, and then another for each server HTTP redirect response until it reaches the final page.
The default value is true
.
"AllowAutoRedirect": {
"sensitive": false,
"value": "false"
}
DateFormat
(String)
(For XML sitemaps only) When the last modification dates aren’t in a standard format (e.g., YYYY-MM-DDThh:mm:ss.sTZD
), therefore triggering the SITEMAP_INVALID_FORMAT_ERROR
error in the Administration Console, specify the Sitemap file date custom format.
The format must use the MSDN format specifiers (see Custom date and time format strings).
"DateFormat": {
"sensitive": false,
"value": "yyyy;MM;ddTHH:mm:sszzz"
}
IndexHtmlMetadata
(Boolean)
Whether metadata tags found in HTML files should be indexed by the Sitemap crawler.
When enabled, the content
attribute of meta
tags is indexed for tags keyed with one of the following attributes: name
, property
, itemprop
, or http-equiv
.
The default value is false
since the parameter has an impact on indexing performance.
By default, the Coveo converter extracts meta
HTML elements with a name
attribute more efficiently.
Therefore, consider enabling this option only when you want to extract meta
HTML elements with a property
, itemprop
, or http-equiv
attribute.
"IndexHtmlMetadata": {
"sensitive": false,
"value": "true"
}
See Index XML sitemap metadata for more information.
IndexJsonLdMetadata
(Boolean)
Whether to index metadata from JSON-LD <script>
tags.
The default value is false
.
When enabled, JSON-LD objects on the web page are flattened and represented in jsonld.parent.child
metadata format in your Coveo organization.
Given the following JSON-LD script tag in a web page:
<script id="jsonld" type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "NewsArticle",
"url": "http://www.bbc.com/news/world-us-canada-39324587",
"publisher": {
"@type": "Organization",
"name": "BBC News",
"logo": "http://www.bbc.co.uk/news/special/2015/newsspec_10857/bbc_news_logo.png?cb=1"
},
"headline": "Canada Strikes Gold in Olympic Hockey Final"
}
</script>
To index the publisher name value (i.e., BBC News
in this page) in an index field
-
Set your configuration as follows:
"IndexJsonLdMetadata": { "sensitive": false, "value": "true" }
-
Use the following mapping rule for your field:
%[jsonld.publisher.name]
|
Note
Contextual JSON-LD key-value pairs whose keys begin with |
ForceBasicAuthorizationHeader
(Boolean)
Whether to enforce basic header authentication.
The default value is false
.
Set it to true
when your server doesn’t challenge the caller for authentication for example, or when you get an HTTP 404 error (often occurs on non-IIS servers) in the Administration Console that looks like the following:
Exception during item expansion: https://myorgwebsite.com/basicauth/user/password.
-> The remote server returned an error: (404) Not Found.
"ForceBasicAuthorizationHeader": {
"sensitive": false,
"value": "true"
}
HtmlXPathSelectorExpression
(String)
The XPath expression used to select one or more nodes of an HTML format sitemap file containing the URLs to crawl. The parameter is empty by default, which results in the connector indexing all web pages listed in the sitemap file.
You only want to index a specific part (only the web pages linked inside the cbc-sitemap
div container) of the CBC sitemap web page so you add the following:
"HtmlXPathSelectorExpression": {
"sensitive": false,
"value": "/div[@id='cbc-sitemap']"
}
|
Note
The |
NumberOfRetries
(Integer)
The number of retries allowed when a failed web request is recoverable.
Only the following HTTP errors will be retried: 408, 500, 503, and 504.
The default value is 3
retries.
"NumberOfRetries": {
"sensitive": false,
"value": "5"
}
ParsableContentTypes
(String)
A list of content types in JSON format, for which the content will be parsed to find data such as hyperlinks.
The default value is "application/xhtml+xml", "application/xml", "text/html"
.
"ParsableContentTypes": {
"sensitive": false,
"value": "[\"application/xhtml+xml\", \"application/xml\", \"text/html\"]"
}
ParsableContentTypesSuffixes
(String)
A list of content types suffixes in JSON format, for which the content will be parsed to find data such as hyperlinks.
The default value is +xml
.
"ParsableContentTypesSuffixes": {
"sensitive": false,
"value": "[\"+xml\", \"+json\"]"
}
ParseSitemapAlternateLinks
(Boolean)
Whether the Sitemap <url>
element child <xhtml:link>
alternate language links should be crawled.
The default value is false
.
|
Note
When |
"ParseSitemapAlternateLinks": {
"sensitive": false,
"value": "true"
}
ParseSitemapInStrictMode
(Boolean)
Whether each Sitemap file should be parsed in strict mode.
The default value is false
.
When ParseSitemapInStrictMode
is set to false
, a URL must only be well-formatted (i.e., an absolute HTTP or HTTPs URL) to be considered valid.
Non-valid URLs are skipped.
When ParseSitemapInStrictMode
is set to true
, the Sitemap source also performs the following validations on sitemap files and sitemap index files:
-
The uncompressed file must be no larger than 10 MB (even if the file is compressed with GZIP). If this condition isn’t met, a descriptive error is displayed and the sitemap or sitemap index file isn’t processed.
-
The file can’t contain more than 50,000 URLs. If this condition isn’t met, a descriptive error is displayed and the sitemap or sitemap index file isn’t processed.
-
A referenced URL must be relative to the sitemap that references it and in the same domain. The location of a sitemap file determines the set of URLs that can be included in that sitemap. For example, a sitemap file located at
http://myorgwebsite.com/tech/sitemap.xml
can include any URL starting withhttp://myorgwebsite.com/tech/
but can’t include URLs starting withhttp://myorgwebsite/catalog/
. If a URL doesn’t meet this condition, it’s skipped. -
A referenced URL in the sitemap file must be less than 2,048 characters long. If a URL doesn’t meet this condition, it’s skipped.
"ParseSitemapInStrictMode": {
"sensitive": false,
"value": "true"
}
ReadTimeout
(Integer)
The timeout duration in seconds when the connector reads web page content from a stream (i.e., downloading a Sitemap/web page content).
The default value is 300
seconds.
When not receiving input from a stream, the crawler will wait for this duration before moving on.
"ReadTimeout": {
"sensitive": false,
"value": "500"
}
SkipOnSitemapError
(Boolean)
Whether the crawler should skip a sitemap instead of stopping when encountering an exception.
The default value is false
.
Skipping may occur on an exception in one of the source URLs, or in a sitemap referenced in a source URL. When the crawler skips a sitemap, no error message is displayed in the Coveo Administration Console and Coveo doesn’t delete existing indexed items located under that sitemap directory.
"SkipOnSitemapError": {
"sensitive": false,
"value": "true"
}
UseCookies
(Boolean)
Whether cookies must be enabled to crawl.
The default value is false
.
Set the value to true
when you want a cookie container to be initialized and reused for each crawling web request.
"UseCookies": {
"sensitive": false,
"value": "true"
}
UseProxy
(Boolean)
|
This parameter is specifically for Crawling Module Sitemap sources. Don’t use this parameter in a cloud Sitemap source. |
Whether the Crawling Module should use a proxy to access the content to be crawled.
The default value is false
.
"UseProxy": {
"sensitive": false,
"value": "false"
}