Sitemap Source JSON Modification

This article presents Sitemap source hidden parameters that you can change by modifying the source JSON configuration from the Coveo Cloud Administration Console (see Edit a Source JSON Configuration and Source JSON Modification Examples).

The following list describes the advanced hidden parameters available with Sitemap sources. The parameter type (integer, string, etc.) appears between parentheses following the parameter name.

All parameters must be added to the parameters section of the source JSON configuration.

AdditionalHeaders (String)

Semicolon separated list of additional HTTP headers added to the connector requests in the following format: key1\\=value1\\;key2\\=value2. The parameter is empty by default.

  • Set the sensitive attribute to true if the value contains sensitive information. Otherwise, the value will appear in clear text in the JSON configuration.

  • You can’t use manual cookies if you crawl content dynamically rendered by JavaScript (if the JavaScript-rendered check box is selected in the Sitemap source configuration).

Special values allowed :

  • %[Username]

  • %[Password]

which will be respectively replaced by the username and password provided in the source configuration.

"AdditionalHeaders": {
  "sensitive": true,
  "value": "X-CSRF-Token\\=csrftokenvalue"
},

where you replace csrftokenvalue with the actual token.

AllowAutoRedirect (Boolean)

Whether the request should automatically follow redirection responses from the web resource. The default value is true.

You don’t want the crawler to follow redirections on indexed pages, so you add the following:

"AllowAutoRedirect": {
  "sensitive": false,
  "value": "false"
},

DateFormat (String)

(For XML sitemaps only) When the last modification dates aren’t in a standard format (e.g., YYYY-MM-DDThh:mm:ss.sTZD), thus triggering the SITEMAP_INVALID_FORMAT_ERROR error in the Administration Console, specify the Sitemap file date custom format. The format must use the MSDN format specifiers (see Custom date and time format strings).

"DateFormat": {
  "sensitive": false,
  "value": "yyyy;MM;ddTHH:mm:sszzz"
},

IndexHtmlMetadata (Boolean)

Whether metadata tags found in HTML files should be indexed. In that case, the content attribute of meta tags is indexed for tags keyed with one of the following attributes: name, property, itemprop, or http-equiv. The default value is false since the parameter has an impact on indexing performance.

By default, the Coveo Cloud Platform converter more efficiently extracts meta HTML elements with a name attribute. Therefore, consider enabling this option only when you want to extract meta HTML elements with a property, itemprop, or http-equiv attribute.

"IndexHtmlMetadata": {
  "sensitive": false,
  "value": "true"
},

See Indexing XML Sitemap Metadata for more information.

ForceBasicAuthorizationHeader (Boolean)

Whether to enforce basic header authentication. The default value is false. Set it to true when your server doesn’t challenge the caller for authentication for example, or when you get an HTTP 404 error (often occurs on non-IIS servers) in the Administration Console that looks like the following:

Exception during item expansion: https://myorgwebsite.com/basicauth/user/password. -> The remote server returned an error: (404) Not Found.

"ForceBasicAuthorizationHeader": {
  "sensitive": false,
  "value": "true"
},

HtmlXPathSelectorExpression (String)

The XPath) expression used expression used to select one or more nodes of a web page containing the URLs to crawl. By default, the connector indexes all web pages listed in HTML Sitemaps.

You only want to index a specific part (only the web pages linked inside the cbc-sitemap div container) of the CBC Sitemap web page so you add the following:

"HtmlXPathSelectorExpression": {
  "sensitive": false,
  "value": "/div[@id='cbc-sitemap']"
},

The ParseSitemapInStrictMode hidden parameter should also be set to false since an HTML web page doesn’t follow the Sitemap protocol.

ManualCookies (String)

A collection of manual cookies to inject with each HTTP web request in the following format:

MyCookieName=MyCookieValue;Domain=coveo.com;Expires=Wdy, DD Mon YYYY HH:MM:SS GMT;Path=/;Domain=mydomain.com;Secure;HttpOnly

where you need to enter your information at the specified places.

When you need to define more than one cookie, separate each cookie definition with the ;; separator. The parameter is empty by default. Use this parameter when your website doesn’t use one of the four supported authentication schemes and thus needs a specific cookie for crawling (see Supported Authentication Schemes).

  • Set the sensitive attribute to true if the value contains sensitive information. Otherwise, the value will appear in clear text in the JSON configuration.

  • You can’t use manual cookies if you crawl content dynamically rendered by JavaScript (if the JavaScript-rendered check box is selected in the Sitemap source configuration).

"ManualCookies": {
  "sensitive": false,
  "value": "MyFirstCookie=MyFirstValue;Domain=www.coveo.com;;MySecondCookie=MySecondValue;Domain=www.example.com"
},
  • The only mandatory attributes are the cookie name, its value and the domain (where the cookie belongs to). All attributes must be separated using a semicolon (;) character.

  • The supported optional attributes are:

    • Expires: the expiration date in RFC 1123 format (Wdy, DD Mon YYYY HH:MM:SS GMT)

    • Path: the subfolder path where the cookie belongs to (relative to the root domain)

    • Secure: means to keep cookie communication limited to encrypted transmission;

    • HttpOnly: directs browsers to not expose cookies through channels other than HTTP (and HTTPS) requests.

    The Secure and HttpOnly attributes don’t have associated values. The presence of their attribute names indicates that their behaviors are enabled.

NumberOfRetries (Integer)

The number of retries allowed when a failed web request is recoverable. Only the following HTTP errors will be retried: 408, 500, 503, and 504. The default value is 3 retries.

"NumberOfRetries": {
  "sensitive": false,
  "value": "5"
},

ParsableContentTypes (String)

A list of content types in JSON format, for which the content will be parsed to find data such as hyperlinks. The default value is "application/xhtml+xml", "application/xml", "text/html".

"ParsableContentTypes": {
  "sensitive": false,
  "value": "[\"application/xhtml+xml\", \"application/xml\", \"text/html\"]"
},

ParsableContentTypesSuffixes (String)

A list of content types suffixes in JSON format, for which the content will be parsed to find data such as hyperlinks. The default value is +xml.

"ParsableContentTypesSuffixes": {
  "sensitive": false,
  "value": "[\"+xml\", \"+json\"]"
},

ParseSitemapInStrictMode (Boolean)

Whether each Sitemap file should be parsed in strict mode, i.e., when the Sitemap file doesn’t follow the protocol specification, the parsing throws an exception (see Sitemap protocol). The default value is false. Set to true when you want to index your Sitemap files with standard protocol validations.

"ParseSitemapInStrictMode": {
  "sensitive": false,
  "value": "true"
},

ReadTimeout (Integer)

The timeout duration in seconds when the connector reads web page content from a stream (i.e., downloading a Sitemap/web page content). The default value is 300 seconds. When not receiving input from a stream, the crawler will wait for this duration before moving on.

"ReadTimeout": {
  "sensitive": false,
  "value": "500"
},

UrlReplacementPattern (String)

The replacement pattern (regex) to match the part of the URL to replace by the UrlReplacementValue hidden parameter value. The parameter is empty by default.

The UrlReplacementValue hidden parameter must also be set if you use this parameter.

"UrlReplacementPattern": {
  "sensitive": false,
  "value": "https:\/\/help-internal(\\.qa)?\\.corp\\.mycompany\\.com"
},

UrlReplacementValue (String)

The URL replacement value matched by the replacement pattern specified with the UrlReplacementPattern hidden parameter. The parameter is empty by default. If the UrlReplacementPattern parameter isn’t specified, the authority part of the URL is replaced by the specified value.

"UrlReplacementValue": {
  "sensitive": false,
  "value": "https://mycompany-services.com/help-article"
},

UseCookies (Boolean)

Whether cookies must be enabled to crawl. The default value is false. Set the value to true when you want a cookie container to be initialized and reused for each crawling web request.

"UseCookies": {
  "sensitive": false,
  "value": "true"
},
Recommended Articles