Add or Edit a Sitemap Source

Important

Coveo is discontinuing use of the PhantomJS web driver in its Web and Sitemap sources in January 2023.

Learn more on what you need to do.

Members of the Administrators and Content Managers built-in groups can use a Sitemap source to make the content of listed web pages from a sitemap file or a Sitemaps index file searchable.

A sitemap file can be added to a website and is required when using a Sitemap source. The file contains a list of the website’s URLs along with their respective metadata which include the LMD (last-modified-date). This enables the Sitemap source to perform a refresh rather than a rescan, as is the case with a Web source. For this reason, although a Sitemap source requires the extra step of adding a sitemap file, it offers an increased performance compared to a Web source.

Source Key Characteristics

Features Supported Additional information

Searchable content types

Web pages (URL)

Sitemap file format

  • XML

  • Text

  • RSS 2.0

  • Atom 1.0

  • HTML

  • GZ

Sitemap files and sitemap index files must respect the Sitemap protocol. Strict validations can be enforced by enabling the ParseSitemapInStrictMode option.

For a .gz sitemap file, the web server response Content-Type header must be application/gzip.

Content refinement

check

Configure inclusion and exclusion filters to index only specific pages.

Sitemap file custom metadata indexing

check

Index metadata from third-party sitemap extensions or Coveo-specific metadata included in an XML sitemap file.

Web page meta tag metadata indexing

check

With the IndexHtmlMetadata setting enabled, the Sitemap crawler indexes the content attribute of meta tags when this tag is keyed with one of the following attributes: name, property, itemprop, or http-equiv.

Web scraping

check

Exclude irrelevant sections in pages and extract metadata.

JavaScript content rendering

check

The Sitemap source crawler can run JavaScript in a web page to dynamically render content before indexing the page.

Content update operations

refresh

check

The sitemap file must define the optional Last Modification Date attribute for each entry (e.g., lastmod for XML sitemaps, updated for Atom sitemaps, pubDate for RSS sitemaps) for each URL. Text sitemaps don’t contain such attributes.

The Last Modification Date attribute must specify the modification time in W3C DateTime format, i.e., YYYY-MM-DDThh:mm:ss (see Date and Time Formats). Moreover, unless you specify a time zone, you must express the modification time in Coordinated Universal Time (UTC).

A rescan or rebuild operation is required to take account of deleted and new sitemap entries.

rescan

check

Takes place every day by default.

rebuild

check

Authentication methods

Basic authentication

check

Supported HTTP authentication schemes:

  • Basic

  • Digest

  • NTLM

  • Negotiate/Kerberos

  • Form based

Form authentication

check

Content security options

Same users and groups as in your content system

x

Specific users and groups

check

Everyone

check

Leading Practices

  • Ensure that you have the right to crawl the public content in the event where you aren’t the owner of the website. Crawling websites that you don’t own nor have the right to crawl could create reachability issues.

    Furthermore, certain websites may use security mechanisms that can impact Coveo’s ability to retrieve the content. If you’re unfamiliar with these mechanisms, we recommend investigating and learning about them beforehand. For example, one impact this type of software (e.g., Akamai, Cloudflare) can have is detecting our crawler as an attack and blocking us from any further crawling.

  • The number of items that a source processes per hour (crawling speed) depends on various factors, such as network bandwidth and source configuration. See About Crawling Speed for information on what can impact crawling speed, as well as possible solutions.

  • Break down large sitemap files into multiple sitemap files.

Troubleshooting Indexing Speed Issues

The table below lists common indexing speed issues you may face when indexing with the Sitemap source and actions to consider to fix the problem. Always review the Activity Browser (platform-eu | platform-au) page for the full context around an abnormal indexing activity. You can also download the source update logs for a chronological account of what happened during the process.

Symptoms Possible causes Actions to consider

The indexing rate is low or it suddenly drops during indexing.

By default, the Request interval delay value is 0 milliseconds and the Sitemap crawler doesn’t take into account website robots.txt Crawl-delay directives. The Sitemap crawler may be getting throttled by the web server.

Increase the Request interval delay value (e.g., 1000 milliseconds).

Add or Edit a Sitemap Source

When adding a source, in the Add a source of content panel, select the Cloud or Crawling Module tab, depending on whether you need to use the Coveo On-Premises Crawling Module to retrieve your content. See Content Retrieval Methods for details.

To edit a source, on the Sources (platform-eu | platform-au) page, click the desired source, and then click Edit in the Action bar.

The completion steps are especially important when creating or editing a source of this type.

"Configuration" Tab

In the Add/Edit a Sitemap Source panel, the Configuration tab is selected by default. It contains your source’s general and authentication information, as well as other parameters.

General Information

Source Name

Enter a name for your source.

Tip
Leading practice

A source name can’t be modified once it’s saved, therefore be sure to use a short and descriptive name, using letters, numbers, hyphens (-), and underscores (_). Avoid spaces and other special characters.

URLs

Enter the URL(s) to a sitemap index file or sitemap files in either the http:// or https:// form. Enter the direct sitemap URL, and not the sitemap website address. Otherwise, the source can interpret the URL(s) as HTML format sitemap file(s) and crawl the links they contain. For example, enter the following URL: http://myorgwebsite.com/sitemap.xml instead of http://myorgwebsite.com/.

Keep in mind that when adding multiple starting addresses, any indexing operation that would fail on one of the first starting addresses will abort the entire indexing operation. If you encounter such an issue you can enable SkipOnSitemapError or split the affected sitemaps into their own sources for troubleshooting.

Examples
  • Public website sitemap: http://myorgwebsite.com/sitemap.xml

  • Public website sitemap compressed with GZIP: http://myorgwebsite.com/sitemap.xml.gz

  • Web page containing links such as a sitemap: http://myorgwebsite.com/sitemap

Notes
  • The Sitemap source crawler only crawls pages listed in a sitemap file. It doesn’t crawl links in the listed web pages themselves.

  • The ParseSitemapInStrictMode JSON parameter dictates the extent of validation the Sitemap source applies on sitemap and sitemap index files, and on their referenced URLs.

  • To exclude certain web pages listed in a sitemap file, first configure and save your source with a broad URL. Then, see Refine the Content to Index.

  • With an HTML format sitemap file, you can choose to crawl only a specific part of the sitemap file using the HtmlXPathSelectorExpression JSON parameter.

User Agent

Enter the user agent string you want the Sitemap source to send with HTTP requests to identify itself when downloading pages.

The default value used is: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) (compatible; Coveobot/2.0;+https://www.coveo.com/bot.html).

Paired Crawling Module

If your source is a Crawling Module source, and if you have more than one Crawling Module linked to this organization, select the one with which you want to pair your source. If you change the Crawling Module instance paired with your source, a successful rebuild is required for your change to apply.

Optical Character Recognition (OCR)

If you want Coveo to extract text from image files or PDF files containing images, check the appropriate box. OCR-extracted text is processed as item data, meaning that it’s fully searchable and will appear in the item Quick View. See Enable Optical Character Recognition for details on this feature.

Note

Contact Coveo Sales to add this feature to your organization license.

"Authentication" Section

If necessary, expand the Authentication section to configure the source credentials allowing your Coveo organization to gain access to the secured content you want to index. You can refer to the Source Credentials Leading Practices for additional information.

Important

Multi-factor authentication (MFA) and CAPTCHA aren’t supported.

The Sitemap source supports the following authentication types. Click the desired authentication method for details on the parameters to configure.

  • Basic authentication

    Select this option when the desired website uses the normal NTLM identity. See Understanding HTTP Authentication for details on how this option works.

  • Manual form authentication

    Note

    Manual form authentication is now only available for modifications to legacy sources. To create new sources, use Automatic form authentication instead.

    Select this option when the desired website presents users with a form to fill to log in. You must specify the form input names and values.

  • Automatic form authentication

    Select this option when the desired website presents users with a form to fill to log in. This is a simplified version of the Manual form authentication. A web automation framework supporting JavaScript rendering executes the authentication process by autodetecting the form inputs.

Basic Authentication

When selecting Basic authentication, enter the credentials of an account on the website you’re making searchable. See Source Credentials Leading Practices.

Notes
  • For Basic authentication, to prevent exposing your credentials, provide username and password information only when the website uses a communication protocol secured with TLS or SSL (HTTPS). However, if you do enter basic authentication credentials, they will be provided regardless of whether the link requiring these credentials uses HTTP or HTTPS. It’s your responsibility to ensure that your Sitemap links requiring basic authentication credentials use HTTPS for increased security.

  • If your Sitemap contains a link to a page of a different domain or subdomain that also requires basic authentication, the Sitemap connector will also provide the credentials you entered.

Manual Form Authentication
Note

Manual form authentication is now only available for modifications to legacy sources. To create new sources, use Automatic form authentication instead.

When selecting Manual form authentication:

  1. In the Form URL box, enter the website login page URL.

  2. (Optional) When there’s more than one form on the login page, enter the Form name.

  3. Click the Action method drop-down menu, and then select the HTTP verb used to submit the authentication request, which is either POST or GET.

  4. Click the Content type drop-down menu, and then select the content-type headers for HTTP POST requests used by the form authentication process.

  5. (Optional) When the authentication request is sent to another URL than the specified form URL, enter the Action URL. Otherwise, leave empty.

  6. In the Username input name and the Password input name inputs, inspect the form HTML code for both parameters, locate the corresponding <input name='abc' type='text' /> element, and then enter the name attribute value.

    Example

    Based on the HTML code below:

    <input name="login" type="email" />

    <input name="pwd" type="password" />

    login is the username input name and pwd is the password input name.

  7. In the Username input value and the Password input value inputs, enter the username and password parameter values respectively.

  8. When your form has other parameters than username and password, we recommend that you select the Autodetect hidden inputs check box. Otherwise, you must enter the Input name and Input value for each parameter.

    Important

    Under Other inputs, input values are displayed in clear text. You must therefore ensure to enter your sensitive information, i.e., username and password, above the Other inputs, in the Username input name, Password input name, Username input value, and Password input value boxes (see steps 6 and 7).

    Example
    FormAuthOtherInputsEx
  9. Under Confirmation method, select the method that will determine if the authentication request failed.

    Depending on the selected confirmation method, enter the appropriate value:

    • When selecting Redirection to, enter the URL where the Coveo crawler is redirected when the login fails.

      Example

      https://mycompany.com/login/failed.html

    • When selecting Missing cookie, in the Value input, enter the name of the cookie that’s set when an authentication is successful.

      Example

      ASP.NET_SessionId

    • If you select Missing URL, in the Value input, enter a regex. Coveo will consider that an authentication request has failed if its crawler is not redirected to an URL matching the specified pattern.

    • When selecting Missing text, in the Value input, enter a string to show to authenticated users.

      Examples
      • Hello, jsmith@mycompany.com!

      • Log out

    • When selecting Text presence, in the Value input, enter a string to show when a login fails.

      Examples
      • An error has occurred.

      • Your username or password is invalid.

    • When selecting URL presence, in the Value input, enter a regex. Coveo will consider that an authentication request has failed if its crawler is redirected to an URL matching the specified pattern.

Automatic Form Authentication

When selecting Automatic form authentication:

  1. In the Form URL box, enter the website login page URL.

  2. In the Username and Password boxes, enter the credentials to log in. See Source Credentials Leading Practices.

  3. Under Confirmation method, select the method that will determine if the authentication request failed.

    Depending on the selected confirmation method, enter the appropriate value:

    • When selecting Redirection to, enter the URL where the Coveo crawler is redirected when the login fails.

      Example

      https://mycompany.com/login/failed.html

    • When selecting Missing cookie, in the Value input, enter the name of the cookie that’s set when an authentication is successful.

      Example

      ASP.NET_SessionId

    • If you select Missing URL, in the Value input, enter a regex. Coveo will consider that an authentication request has failed if its crawler is not redirected to an URL matching the specified pattern.

    • When selecting Missing text, in the Value input, enter a string to show to authenticated users.

      Examples
      • Hello, jsmith@mycompany.com!

      • Log out

    • When selecting Text presence, in the Value input, enter a string to show when a login fails.

      Examples
      • An error has occurred.

      • Your username or password is invalid.

    • When selecting URL presence, in the Value input, enter a regex. Coveo will consider that an authentication request has failed if its crawler is redirected to an URL matching the specified pattern.

  4. If you want Coveo’s first request to be for authentication, regardless of whether authentication is actually required, check the Force login box.

  5. If the website relies heavily on asynchronous requests, under Loading delay, enter the maximum delay in milliseconds to wait for the JavaScript to execute on each web page.

  6. (Optional) If your automatic form authentication configuration doesn’t work, or if the web page requires specific actions during the login process, you might have to enter a Custom login sequence configuration. Contact the Coveo Support team for help.

"Content to Include" Section

When the website pages indexed by your Sitemap source are rendered dynamically with JavaScript, expand the Content to Include section, and then select the JavaScript-rendered check box. By default, the Sitemap source doesn’t execute the JavaScript code in crawled website pages.

Important

Selecting the JavaScript-rendered check box may significantly increase the time needed to crawl pages.

Notes
  • When the JavaScript takes longer to execute than normal or makes asynchronous calls, consider increasing the Loading delay parameter value to ensure that the pages with longest rendering time are indexed with all the rendered content. Enter the time in milliseconds allowed for dynamic content to be retrieved before indexing the content. When the value is 0 (default), the crawler doesn’t wait after the page is loaded.

  • To avoid indexing certain pages or to index only a few of them, first configure and save your source with a broad URL. Then, see Refine the Content to Index.

"Crawling Settings" Section

When a target website sometimes responds slower:

  1. In the Request timeout box, use the + and - buttons to select the web request timeout value in seconds. The default is 100 seconds. When the value is 0, there’s no timeout. By increasing the timeout value, you increase the delay tolerance and avoid timeout errors.

  2. Under Request interval delay, enter the number of milliseconds there should be between each request sent to retrieve your Sitemap content. The default value is 0, which means there’s no speed limitation. To decrease the crawling speed, enter a higher number. The maximum possible delay is 5000 milliseconds.

"Web Scraping" Section

When you want to exclude page sections (such as headers and footers) or extract information from the pages to create metadata, expand the Web Scraping section to use this powerful feature.

Example

You have a Sitemap source for which you want to exclude HTML item header and footer sections, so you enter the following in the Web scraping configuration input.

{
"sensitive": false,
"value": "[\n  {\n      \"for\": {\n        \"urls\": [\".*\"]\n      },\n      \"exclude\": [\n        { \"path\": \"#ohHeader\" },\n        { \"path\": \"#MainSection > div.col-md-3\" },\n        { \"path\": \"#answerLink\" }\n      ],\n      \"metadata\": {\n        \"topicTitle\": { \"path\": \"div.topic  h1::text\" },\n        \"topicLastUpdate\": { \"path\": \"#LastUpdate::text\" }\n      }\n  }\n]"
}

"Content Security" Tab

Select who will be able to access the source items through a Coveo-powered search interface. For details on this parameter, see Content Security.

"Access" Tab

In the Access tab, set whether each group and API key can view or edit the source configuration (see Resource Access):

  1. If available, in the left pane, click Groups or API Keys to select the appropriate list.

  2. In the Access Level column for groups or API keys with access to source content, select View or Edit.

Completion

  1. Finish adding or editing your source:

    • When you want to save your source configuration changes without starting a build/rebuild, such as when you know you want to do other changes soon, click Add Source/Save.

      Note

      On the Sources (platform-eu | platform-au) page, you must click Launch build or Start required rebuild in the source Status column to add the source content or to make your changes effective, respectively.

    • When you’re done editing the source and want to make changes effective, click Add and Build Source/Save and Rebuild Source.

      Back on the Sources (platform-eu | platform-au) page, you can review the progress of your source addition or modification.

      Once the source is built or rebuilt, you can review its content in the Content Browser.

  2. Optionally, consider editing or adding mappings once your source is done building or rebuilding.

Refine the Content to Index

You may want to avoid indexing certain pages, or to index only a few of them. To do so:

  1. If not already done, create and save your source with a broad URL.

  2. In your source JSON configuration, enter an address filter to refine the targeted content.

    Important

    Your URL must match one of your inclusion addressPatterns and not match any of your exclusion addressPatterns. Otherwise, Coveo will return a No Items Indexed error.

  3. Build or rebuild your source.

What’s Next?