Add or Edit a Sitemap Source

Members of the Administrators and Content Managers built-in groups can use a Sitemap source to make the content of listed web pages from a sitemap file or a Sitemaps index file searchable.

A sitemap file is a component that can be added to a website and is required when using a Sitemap source. The file contains a list of the website’s URLs along with their respective metadata which include the LMD (last-modified-date). This enables the Sitemap source to perform a refresh rather than a rescan, as is the case with a Web source. For this reason, although a Sitemap source requires the extra step of adding a sitemap file, it offers an increased performance compared to a Web source.

For secured websites (non-public accessible Sitemap), the source supports several authentication modes.

Ensure that you have the right to crawl the public content in the event where you aren’t the owner of the website. Crawling websites that you don’t own nor have the right to crawl could create accessability issues.

Furthermore, certain websites may use security mechanisms that can impact Coveo’s ability to retrieve the content. If you are unfamiliar with these mechanisms, we recommend investigating and learning about them beforehand. For example, one impact these softwares can have is detecting our crawler as an attack and blocking us from any further crawling.

Source Key Characteristics

Features Supported Additional information
Sitemap version XML, Text, RSS 2.0, Atom 1.0, and HTML

Sitemap files and sitemap index files must respect the Sitemap protocol (you can, however, disable validations with a parameter).

Supports sitemap files containing custom metadata (see Index XML Sitemap Metadata).

Searchable content type

Web pages (URL)

Content update operations Refresh
  • The sitemap file must define the optional Last Modification Date attribute for each entry (e.g., lastmod for XML sitemaps, updated for Atom sitemaps, pubDate for RSS sitemaps) for each URL. Text sitemaps don't contain such attributes.

    The Last Modification Date attribute must specify the modification time in W3C DateTime format, i.e., YYYY-MM-DDThh:mm:ss (see Date and Time Formats). Moreover, unless you specify a time zone, you must express the modification time in Coordinated Universal Time (UTC).

  • A rescan or rebuild operation is required to take account of deleted and new sitemap entries.

Rescan Takes place every day by default.
Rebuild  
Content security options Determined by source permissions

Source creator  
Everyone  

The content attribute of meta tags is indexed when the tag is keyed with one of the following attributes: name, property, itemprop, or http-equiv.

For example, in the tag <meta property="og:title" content="The Article Title"/>, The Article Title is indexed.

Requirements

Supported Sitemap File Formats

The source can include web pages from the following sitemap file formats:

  • XML (sitemap and index)

  • Text

  • Syndication Feeds (Atom 1.0 and RSS 2.0)

  • HTML

Supported Authentication Schemes

The source can authenticate with the following authentication schemes:

  • Basic

  • Digest

  • NTLM

  • Negotiate/Kerberos

  • Form based

You can enter the authentication parameters in the “Authentication” Section.

Add or Edit a Sitemap Source

When adding a source, in the Add a source of content panel, click the Cloud or the Crawling Module tab, depending on whether you need to use the Coveo On-Premises Crawling Module to retrieve your content. See Content Retrieval Methods for details.

To edit a source, on the Sources page, click the desired source, and then, in the Action bar, click Edit.

The completion steps are especially important when creating or editing a source of this type.

“Configuration” Tab

In the Add/Edit a Sitemap Source panel, the Configuration tab is selected by default. It contains your source’s general and authentication information, as well as other parameters.

General Information

Source Name

Enter a name for your source.

A source name can’t be modified once it’s saved, therefore be sure to use a short and descriptive name, using letters, numbers, hyphens (-), and underscores (_). Avoid spaces and other special characters.

URLs

Enter the URL(s) to a sitemap index file or sitemap files in either the http:// or https:// form.

When you want to retrieve the content of listed web pages from a XML sitemap, enter the direct sitemap URL instead of the sitemap website address. Otherwise, the source can interpret the web page as a sitemap file in HTML and crawl the discovered links. For example, you enter the following URL: http://myorgwebsite.com/sitemap.xml instead of http://myorgwebsite.com/.

  • Public website sitemap: http://myorgwebsite.com/sitemap.xml

  • Public website sitemap compressed with GZIP: http://myorgwebsite.com/sitemap.xml.gz

  • Web page containing links such as a sitemap: http://myorgwebsite.com/sitemap

Avoid including more than one sitemap in a given source. Instead, create one source for each sitemap.

  • By default, sitemap files and sitemap index files that don’t respect the following validations based on the sitemap protocol are ignored while the content is included:

    • An uncompressed sitemap file must be no larger than 10 MB (even if the file is compressed with GZIP).

    • A sitemap file can’t contain more than 50,000 URLs.

    • All referenced URLs must be less than 2,048 characters.

    • All referenced URLs must be relative to the sitemap that references them and in the same domain. The location of a sitemap file determines the set of URLs that can be included in that sitemap. For example, a sitemap file located at http://myorgwebsite.com/tech/sitemap.xml can include any URLs starting with http://myorgwebsite.com/tech/ but can’t include URLs starting with http://myorgwebsite/catalog/.

  • When you don’t want your sitemap files and sitemaps index file to be validated, add the ParseSitemapInStrictMode hidden parameter and set it to false in the parameters section of the source JSON configuration. In this case, the above validations aren’t performed. Consequently, all web pages are included if their reference URL is valid and absolute.

  • The Sitemap source can retrieve all links contained in a web page. The Sitemap source crawler doesn’t expand all discovered links, but only includes the web page as a sitemap file in HTML.

  • You can also select to include only a specific part of a web page by adding the HtmlXPathSelectorExpression hidden parameter in the parameters section of the source JSON configuration.

User Agent

Enter the user agent string you want the Sitemap source to send with HTTP requests to identify itself when downloading pages.

The default value used is: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) (compatible; Coveobot/2.0;+https://www.coveo.com/bot.html).

Paired Crawling Module

If your source is a Crawling Module source and if you have more than one Crawling Module linked to this organization, select the one with which you want to pair your source. If you change the Crawling Module instance paired with your source, a successful rebuild is required for your change to apply.

Character Optical Recognition (OCR)

If you want Coveo Cloud to extract text from image files or PDF files containing images, check the appropriate box. OCR-extracted text is processed as item data, meaning that it’s fully searchable and will appear in the item Quick View. See Enable Optical Character Recognition for details on this feature.

Index

When adding a source, if you have more than one logical (non-Elasticsearch) index in your organization, select the index in which the retrieved content will be stored (see Leverage Many Coveo Indexes). If your organization only has one index, this drop-down menu isn’t visible and you have no decision to make.

  • To add a source storing content in an index different than default, you need the View access level on the Logical Index domain (see Manage Privileges and Logical Indexes Domain).

  • Once the source is added, you can’t switch to a different index.

“Authentication” Section

When the Sitemap you want to make searchable uses one of the supported authentication types to secure access to its content, expand the Authentication section to configure the source credentials allowing your Coveo organization to gain access to the secured content. You can refer to the Source Credentials Leading Practices for additional information.

The Sitemap source supports the following authentication types. Click the desired authentication method for details on the parameters to configure.

  • Basic authentication

    Select this option when the desired website uses the normal NTLM identity. See Basic access authentication for details on how this option works.

  • Manual form authentication

    Manual form authentication is now only available for modifications to legacy sources. To create new sources, use Automatic form authentication instead.

    Select this option when the desired website presents users with a form to fill to log in. You must specify the form input names and values.

  • Automatic form authentication

    Select this option when the desired website presents users with a form to fill to log in. This is a simplified version of the Manual form authentication. A web automation framework supporting JavaScript rendering executes the authentication process by autodetecting the form inputs.

Basic Authentication

When selecting Basic authentication, enter the credentials of an account on the website you’re making searchable. See Source Credentials Leading Practices.

  • For Basic authentication, to prevent exposing your credentials, provide username and password information only when the website uses a communication protocol secured with TLS or SSL (HTTPS). However, if you do enter basic authentication credentials, they will be provided regardless of whether the link requiring these credentials uses HTTP or HTTPS. It’s your responsibility to ensure that your Sitemap links requiring basic authentication credentials use HTTPS for increased security.

  • If your Sitemap contains a link to a page of a different domain or subdomain that also requires basic authentication, the Coveo Cloud Sitemap connector will also provide the credentials you entered.

Manual Form Authentication

Manual form authentication is now only available for modifications to legacy sources. To create new sources, use Automatic form authentication instead.

When selecting Manual form authentication:

  1. In the Form URL box, enter the website login page URL.

  2. (Optional) When there’s more than one form on the login page, enter the Form name.

  3. Click the Action method drop-down menu, and then select the HTTP verb used to submit the authentication request, which is whether POST or GET.

  4. Click the Content type drop-down menu, and then select the content-type headers for HTTP POST requests used by the form authentication process.

  5. (Optional) When the authentication request is sent to another URL than the specified form URL, enter the Action URL. Otherwise, leave empty.

  6. In the Username input name and the Password input name inputs, inspect the form HTML code for both parameters, locate the corresponding <input name='abc' type='text' /> element, and then enter the name attribute value.

    Based on the HTML code below:

    <input name="login" type="email" />

    <input name="pwd" type="password" />

    login is the username input name and pwd is the password input name.

  7. In the Username input value and the Password input value inputs, enter the username and password parameter values respectively.

  8. When your form has other parameters than username and password, we recommend that you select the Autodetect hidden inputs check box. Otherwise, you must enter the Input name and Input value for each parameter.

    Under Other inputs, input values are displayed in clear text. You must therefore ensure to enter your sensitive information, i.e., username and password, above the Other inputs, in the Username input name, Password input name, Username input value, and Password input value boxes (see steps 6 and 7).

    FormAuthOtherInputsEx

  9. Under Confirmation method, select the method that will determine if the authentication request failed.

    Depending on the selected confirmation method, enter the appropriate value:

    • When selecting Redirection to, enter the URL where the Coveo crawler is redirected when the login fails.

      https://mycompany.com/login/failed.html

    • When selecting Missing cookie, in the Value input, enter the name of the cookie that’s set when an authentication is successful.

      ASP.NET_SessionId

    • If you select Missing URL, in the Value input, enter a regex. Coveo will consider that an authentication request has failed if its crawler is not redirected to an URL matching the specified pattern.

    • When selecting Missing text, in the Value input, enter a string to show to authenticated users.

      • Hello, jsmith@mycompany.com!

      • Log out

    • When selecting Text presence, in the Value input, enter a string to show when a login fails.

      • An error has occurred.

      • Your username or password is invalid.

    • When selecting URL presence, in the Value input, enter a regex. Coveo will consider that an authentication request has failed if its crawler is redirected to an URL matching the specified pattern.

Automatic Form Authentication

When selecting Automatic form authentication:

  1. In the Form URL box, enter the website login page URL.

  2. In the Username and Password boxes, enter the credentials to log in. See Source Credentials Leading Practices.

  3. Under Confirmation method, select the method that will determine if the authentication request failed.

    Depending on the selected confirmation method, enter the appropriate value:

    • When selecting Redirection to, enter the URL where the Coveo crawler is redirected when the login fails.

      https://mycompany.com/login/failed.html

    • When selecting Missing cookie, in the Value input, enter the name of the cookie that’s set when an authentication is successful.

      ASP.NET_SessionId

    • If you select Missing URL, in the Value input, enter a regex. Coveo will consider that an authentication request has failed if its crawler is not redirected to an URL matching the specified pattern.

    • When selecting Missing text, in the Value input, enter a string to show to authenticated users.

      • Hello, jsmith@mycompany.com!

      • Log out

    • When selecting Text presence, in the Value input, enter a string to show when a login fails.

      • An error has occurred.

      • Your username or password is invalid.

    • When selecting URL presence, in the Value input, enter a regex. Coveo will consider that an authentication request has failed if its crawler is redirected to an URL matching the specified pattern.

    In addition, if you want Coveo’s first request to be for authentication, regardless of whether authentication is actually required, check the Force login box.

  4. If the website relies heavily on asynchronous requests, under Loading delay, enter the maximum delay in milliseconds to wait for the JavaScript to execute on each web page.

  5. (Optional) If your automatic form authentication configuration doesn’t work, or if the web page requires specific actions during the login process, you might have to enter a Custom login sequence configuration. Contact the Coveo Support team for help.

“Content to Include” Section

When the website pages indexed by your Sitemap source are rendered dynamically with JavaScript, expand the Content to Include section, and then select the JavaScript-rendered check box. By default, the Sitemap source doesn’t execute the JavaScript code in crawled website pages.

Selecting the JavaScript-rendered check box may significantly increase the time needed to crawl pages.

  • When the JavaScript takes longer to execute than normal or makes asynchronous calls, consider increasing the Loading delay parameter value to ensure that the pages with longest rendering time are indexed with all the rendered content. Enter the time in milliseconds allowed for dynamic content to be retrieved before indexing the content. When the value is 0 (default), the crawler doesn’t wait after the page is loaded.

  • To configure page filters, you must edit the source JSON configuration and configure the addressPatterns hidden parameter.

“Crawling Settings” Section

When a target website sometimes responds slower:

  1. In the Request timeout box, use the + and - buttons to select the web request timeout value in seconds. The default is 100 seconds. When the value is 0, there’s no timeout. By increasing the timeout value, you increase the delay tolerance and avoid timeout errors.

  2. Under Request interval delay, enter the number of milliseconds there should be between each request sent to retrieve your Sitemap content. The default value is 0, which means there’s no speed limitation. To decrease the crawling speed, enter a higher number. The maximum possible delay is 5000 milliseconds.

“Web Scraping” Section

When you want to exclude page sections (such as headers and footers) or extract information from the pages to create metadata, expand the Web Scraping section to use this powerful feature.

You have a Sitemap source for which you want to exclude HTML item header and footer sections, so you enter the following in the Web scraping configuration input.

"sensitive": false,
"value": "[\n  {\n      \"for\": {\n        \"urls\": [\".*\"]\n      },\n      \"exclude\": [\n        { \"path\": \"#ohHeader\" },\n        { \"path\": \"#MainSection > div.col-md-3\" },\n        { \"path\": \"#answerLink\" }\n      ],\n      \"metadata\": {\n        \"topicTitle\": { \"path\": \"div.topic  h1::text\" },\n        \"topicLastUpdate\": { \"path\": \"#LastUpdate::text\" }\n      }\n  }\n]"
}

“Content Security” Tab

Select who will be able to access the source items through a Coveo-powered search interface. For details on this parameter, see Content Security.

“Access” Tab

In the Access tab, determine whether each group and API key can view or edit the source configuration (see Resource Access):

  1. In the Access Level column, select View or Edit for each available group.

  2. On the left-hand side of the tab, if available, click Groups or API Keys to switch lists.

Completion

  1. Finish adding or editing your source:

    • When you want to save your source configuration changes without starting a build/rebuild, such as when you know you want to do other changes soon, click Add Source/Save.

      To add the source content or to make your changes effective, on the Sources page, you must click Launch build or Start required rebuild in the source Status column.

      OR

    • When you’re done editing the source and want to make changes effective, click Add and Build Source/Save and Rebuild Source.

      Back on the Sources page, you can review the progress of your source addition or modification.

    Once the source is built or rebuilt, you can review its content in the Content Browser.

  2. Optionally, consider editing or adding mappings.

    You can only manage mapping rules once you build the source (see Refresh, Rescan, or Rebuild Sources).

What’s Next?

Recommended Articles