Sitemap Source JSON Modification

Many source configuration parameters can be set through the Coveo Administration Console user interface. Others, such as rarely used parameters or new parameters that aren’t yet editable through the user interface, can only be changed or added in the JSON configuration.

This article presents Sitemap source hidden parameters that you can change by modifying the source JSON configuration from the Coveo Administration Console. All parameters must be added to the parameters section of the source JSON configuration.

Set the sensitive attribute of a parameter to true if its value contains sensitive information. Otherwise, the value will appear in clear text in the JSON configuration.

For instructions on accessing the JSON file and best practices to follow, see Edit a Source JSON Configuration.

AdditionalHeaders (String)

Semicolon separated list of additional HTTP headers added to the connector requests in the following format: key1\\=value1\\;key2\\=value2. The parameter is empty by default.

You can’t use manual cookies if you crawl content dynamically rendered by JavaScript (if the JavaScript-rendered check box is selected in the Sitemap source configuration).

Special values allowed :

  • %[Username]

  • %[Password]

which will be respectively replaced by the username and password provided in the source configuration.

EXAMPLE
"AdditionalHeaders": {
  "sensitive": true,
  "value": "X-CSRF-Token\\=<CSRF_TOKEN_VALUE>"
},

where you replace <CSRF_TOKEN_VALUE> with the actual token.

AllowAutoRedirect (Boolean)

Whether the request should automatically follow redirection responses from the web resource. The default value is true.

EXAMPLE

You don’t want the crawler to follow redirections on indexed pages, so you add the following:

"AllowAutoRedirect": {
  "sensitive": false,
  "value": "false"
},

DateFormat (String)

(For XML sitemaps only) When the last modification dates aren’t in a standard format (e.g., YYYY-MM-DDThh:mm:ss.sTZD), therefore triggering the SITEMAP_INVALID_FORMAT_ERROR error in the Administration Console, specify the Sitemap file date custom format. The format must use the MSDN format specifiers (see Custom date and time format strings).

EXAMPLE
"DateFormat": {
  "sensitive": false,
  "value": "yyyy;MM;ddTHH:mm:sszzz"
},

IndexHtmlMetadata (Boolean)

Whether metadata tags found in HTML files should be indexed. In that case, the content attribute of meta tags is indexed for tags keyed with one of the following attributes: name, property, itemprop, or http-equiv. The default value is false since the parameter has an impact on indexing performance.

By default, the Coveo Platform converter more efficiently extracts meta HTML elements with a name attribute. Therefore, consider enabling this option only when you want to extract meta HTML elements with a property, itemprop, or http-equiv attribute.

EXAMPLE
"IndexHtmlMetadata": {
  "sensitive": false,
  "value": "true"
},

See Index XML Sitemap Metadata for more information.

IndexJsonLdMetadata (Boolean)

Whether to index metadata from JSON-LD <script> tags. The default value is false.

When enabled, JSON-LD objects in the web page are flattened and represented in jsonld.parent.child metadata format in your Coveo organization.

EXAMPLE

Given the following JSON-LD script tag in a web page:

<script id="jsonld" type="application/ld+json">
   {
      "@context": "https://schema.org",
      "@type": "NewsArticle",
      "url": "http://www.bbc.com/news/world-us-canada-39324587",
      "publisher": {
          "@type": "Organization",
          "name": "BBC News",
          "logo": "http://www.bbc.co.uk/news/special/2015/newsspec_10857/bbc_news_logo.png?cb=1"
      },
      "headline": "Canada Strikes Gold in Olympic Hockey Final"
   }
</script>

To index the publisher name value (i.e., BBC News in this page) in an index field

  1. Set your configuration as follows:

    "IndexJsonLdMetadata": {
      "sensitive": false,
      "value": "true"
    },
  2. Use the following mapping rule for your field: %[jsonld.publisher.name]

Contextual JSON-LD key-value pairs whose keys begin with @ aren’t converted into metadata.

ForceBasicAuthorizationHeader (Boolean)

Whether to enforce basic header authentication. The default value is false. Set it to true when your server doesn’t challenge the caller for authentication for example, or when you get an HTTP 404 error (often occurs on non-IIS servers) in the Administration Console that looks like the following:

Exception during item expansion: https://myorgwebsite.com/basicauth/user/password. -> The remote server returned an error: (404) Not Found.

EXAMPLE
"ForceBasicAuthorizationHeader": {
  "sensitive": false,
  "value": "true"
},

HtmlXPathSelectorExpression (String)

The XPath expression used to select one or more nodes of a web page containing the URLs to crawl. By default, the connector indexes all web pages listed in HTML Sitemaps.

EXAMPLE

You only want to index a specific part (only the web pages linked inside the cbc-sitemap div container) of the CBC Sitemap web page so you add the following:

"HtmlXPathSelectorExpression": {
  "sensitive": false,
  "value": "/div[@id='cbc-sitemap']"
},

The ParseSitemapInStrictMode hidden parameter should also be set to false since an HTML web page doesn’t follow the Sitemap protocol.

NumberOfRetries (Integer)

The number of retries allowed when a failed web request is recoverable. Only the following HTTP errors will be retried: 408, 500, 503, and 504. The default value is 3 retries.

EXAMPLE
"NumberOfRetries": {
  "sensitive": false,
  "value": "5"
},

ParsableContentTypes (String)

A list of content types in JSON format, for which the content will be parsed to find data such as hyperlinks. The default value is "application/xhtml+xml", "application/xml", "text/html".

EXAMPLE
"ParsableContentTypes": {
  "sensitive": false,
  "value": "[\"application/xhtml+xml\", \"application/xml\", \"text/html\"]"
},

ParsableContentTypesSuffixes (String)

A list of content types suffixes in JSON format, for which the content will be parsed to find data such as hyperlinks. The default value is +xml.

EXAMPLE
"ParsableContentTypesSuffixes": {
  "sensitive": false,
  "value": "[\"+xml\", \"+json\"]"
},

Whether the Sitemap <url> element child <xhtml:link> alternate language links should be crawled. The default value is false.

When ParseSitemapAlternateLinks is set to true, if an <xhtml:link> element has its hreflang attribute set to x-default, the corresponding href URL will be crawled unless this URL has already been crawled as the <loc> element text value.

EXAMPLE
"ParseSitemapAlternateLinks": {
  "sensitive": false,
  "value": "true"
},

ParseSitemapInStrictMode (Boolean)

Whether each Sitemap file should be parsed in strict mode, i.e., when the Sitemap file doesn’t follow the protocol specification, the parsing throws an exception (see Sitemap protocol). The default value is false. Set to true when you want to index your Sitemap files with standard protocol validations.

EXAMPLE
"ParseSitemapInStrictMode": {
  "sensitive": false,
  "value": "true"
},

ReadTimeout (Integer)

The timeout duration in seconds when the connector reads web page content from a stream (i.e., downloading a Sitemap/web page content). The default value is 300 seconds. When not receiving input from a stream, the crawler will wait for this duration before moving on.

EXAMPLE
"ReadTimeout": {
  "sensitive": false,
  "value": "500"
},

SkipOnSitemapError (Boolean)

Whether the crawler should skip a sitemap instead of stopping when encountering an exception. The default value is false.

Skipping may occur on an exception in one of the source URLs, or in a sitemap referenced in a source URL. When the crawler skips a sitemap, no error message is displayed in the Coveo Platform and Coveo doesn’t delete existing indexed items located under that sitemap directory.

EXAMPLE
"SkipOnSitemapError": {
  "sensitive": false,
  "value": "true"
},

UrlReplacementPattern (String)

The replacement pattern (regex) to match the part of the URL to replace by the UrlReplacementValue hidden parameter value. The parameter is empty by default.

The UrlReplacementValue hidden parameter must also be set if you use this parameter.

EXAMPLE
"UrlReplacementPattern": {
  "sensitive": false,
  "value": "https:\/\/help-internal(\\.qa)?\\.corp\\.mycompany\\.com"
},

UrlReplacementValue (String)

The URL replacement value matched by the replacement pattern specified with the UrlReplacementPattern hidden parameter. The parameter is empty by default. If the UrlReplacementPattern parameter isn’t specified, the authority part of the URL is replaced by the specified value.

EXAMPLE
"UrlReplacementValue": {
  "sensitive": false,
  "value": "https://mycompany-services.com/help-article"
},

UseCookies (Boolean)

Whether cookies must be enabled to crawl. The default value is false. Set the value to true when you want a cookie container to be initialized and reused for each crawling web request.

EXAMPLE
"UseCookies": {
  "sensitive": false,
  "value": "true"
},
Recommended Articles