Add or Edit a Sitemap Source

When you have the required privileges, you can use a Sitemap source to make searchable the content of listed web pages from a Sitemap (Sitemap file or a Sitemaps index file).

You can keep a Sitemap source up-to-date with a Refresh schedule to frequently update new or changed pages without rescanning all sitemap entries (see Refresh Notes and Edit a Source Schedule).

For secured websites (non-public accessible Sitemap), the source supports several authentication modes (see Supported Authentication Schemes).

Source Features Summary

Features Supported Additional information
Sitemap version XML, Text, RSS 2.0, Atom 1.0, and HTML

Sitemap files and Sitemap index file must respect the Sitemap protocol (validations can however be turned off by a parameter)

Support sitemap files containing custom metadata (see Adding and Indexing Custom Metadata in an XML Sitemap).

Searchable content type

Web pages (URL)

Content update Refresh
  • The Sitemap file must define the optional Last Modification Date attribute for each entry (e.g., lastmod for XML Sitemaps, updated for Atom Sitemaps, pubDate for RSS Sitemaps) for each URL. If not, you need to schedule a source rescan operation to retrieve new and changed items. Text Sitemaps do not contain such attributes.

    The Last Modification Date attribute must specify the modification time in W3C DateTime format, i.e., YYYY-MM-DDThh:mm:ss (see Date and Time Formats). Moreover, unless you specify a timezone, you must express the modification time in Coordinated Universal Time (UTC).

  • A rescan or rebuild operation is required to retrieve deleted sitemap entries.

Rescan  
Rebuild  
Permission types Secured

Private  
Shared  
  • When you want to include the meta tags contained in the head of listed web pages as metadata for each source item, you must add the indexHtmlMetadata hidden parameter to the source JSON configuration and set it to true (see Edit a Source JSON Configuration and Add a Hidden Source Parameter). Since this parameter has an impact on the indexing performance, it is set to false by default.

  • The content attribute of meta tags is indexed when the tag is keyed with one of the following attributes: name, property, itemprop, or http-equiv.

    In the tag <meta property="og:title" content="The Article Title"/>, The Article Title is indexed.

Requirements

Supported Sitemap File Formats

The source can include web pages from the following Sitemap file formats:

  • XML (Sitemap and index)

  • Text

  • Syndication Feeds (Atom 1.0 and RSS 2.0)

  • HTML

Supported Authentication Schemes

The source can authenticate with the following authentication schemes:

  • Basic

  • Digest

  • NTLM

  • Negotiate/Kerberos

  • Form based

You can enter the authentication parameters in the Authentication Section.

Add or Edit a Sitemap Source

  1. If not already done, ensure the Sitemap you want to include meets the source requirements (see Requirements).

  2. If not already in the Add/Edit a Sitemap source panel, go to the panel:

    • To add a source, in the main menu, under Content, select Sources > Add source button > Sitemap.

      OR

    • To edit a source, in the main menu, under Content, select Sources > source row > Edit in the Action bar.

  3. In the Configuration tab, enter appropriate values for available parameters:

    • Source name

      When you add a source, you must enter a unique source name under 255 characters (not already in use for another source in this organization).

      Once a source is created, you cannot change its name. You would rather need to create a new source with a similar configuration and the desired name, and then delete the original source.

      Take the time to plan and pick good source names.

      This name appears in the list of sources in this administration console, but may also be used by developers in search page JavaScript code, for example, to specify the scope of a search interface.

      Use a short and descriptive name, using letters, numbers, - and _ characters, and avoid spaces and other special characters.

    • URLs

      You must enter the URL(s) to a Sitemap index file or Sitemap files (one entry per line) in either the http:// or https:// form.

      When you want to retrieve the content of listed web pages from a XML Sitemap, enter the direct Sitemap URL instead of the Sitemap website address. Otherwise, the source could interpret the web page as a Sitemap file in HTML and crawl the discovered links.

      You enter the following URL: http://myorgwebsite.com/sitemap.xml instead of http://myorgwebsite.com/.

      • http://myorgwebsite.com/sitemap.xml (Public website Sitemap)

      • http://myorgwebsite.com/sitemap.xml.gz (Public website Sitemap compressed with GZIP)

      • http://myorgwebsite.com/sitemap (Web page containing links such as a site map)

      Avoid including more than one Sitemaps in a given source. Rather create one source for each Sitemap.

      • By default, Sitemap files and Sitemap index files that do not respect the following validations based on the Sitemap protocol are ignored while the content is included (see Sitemap protocol):

        • An uncompressed Sitemap file must be no larger than 10 MB (even if the file is compressed with GZIP).

        • A Sitemap file cannot contain more than 50,000 URLs.

        • All referenced URLs must be less than 2,048 characters.

        • All referenced URLs must be relative to the Sitemap that references them and in the same domain. The location of a Sitemap file determines the set of URLs that can be included in that Sitemap.

          A Sitemap file located at http://myorgwebsite.com/tech/sitemap.xml can include any URLs starting with http://myorgwebsite.com/tech/ but cannot include URLs starting with http://myorgwebsite/catalog/.

      • When you do not want your Sitemap files and Sitemaps index file to be validated, add the ParseSitemapInStrictMode hidden parameter and set it to false in the source JSON configuration (see Edit a Source JSON Configuration and Add a Hidden Source Parameter). In this case, the above validations are not performed. Consequently, all web pages are included if their reference URL is valid and absolute.

      • The Sitemap source can retrieve all links contained in a web page. The Sitemap source crawler does not expand all discovered links, but only includes the web page as a Sitemap file in HTML.

        You can also select only a specific part of a web page to be included by adding the HtmlXPathSelectorExpression hidden parameter in the source JSON configuration. The parameter value must be an XPath expression that selects one or more nodes of a web page containing the URLs to crawl (see Edit a Source JSON Configuration and Add a Hidden Source Parameter). By default, the source retrieves all listed web pages from an HTML Sitemap.

        You want only to index a specific portion (only the web pages linked inside the cbc-sitemap div container) of the CBC Sitemap web page, so you add the parameter with the following value: //div[@id='cbc-sitemap'].

        • Any XPath selecting node can be used to set the website portion to include (see XPath syntax).

        • You should also set the ParseSitemapInStrictMode hidden parameter to false in the source JSON configuration since an HTML web page does not follow the Sitemap protocol (see Sitemap Protocol and Add a Hidden Source Parameter).

    • User agent

      The user agent string sent with HTTP requests by the Coveo Sitemap source to identify itself when downloading pages. When left empty, the default value used is: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) (compatible; Coveobot/2.0;+https://www.coveo.com/bot.html).

    • Content security

      Select who can see any Sitemap item from this source in search results of a search interface that includes this source in its scope (see Content Security):

      • Shared: everyone

      • Private: Only you, when you are authenticated to the search interface with the identity with which you create the source.

  4. In the Authentication section, when the Sitemap you want to make searchable uses one of the supported authentication types to secure access to its content, expand the Authentication section to configure the source credentials allowing your Coveo Cloud organization to gain access to the secured content (see Supported Authentication Schemes and Source Credentials Leading Practices).

    The Sitemap source supports the following authentication types:

    • Basic authentication

      When a website uses the normal NTLM identity (see Basic access authentication).

      For Basic authentication, to prevent exposing your credentials, provide username and password information only when the website uses a communication protocol secured with TLS or SSL (HTTPS).

    • Manual form authentication

      Select this option when the desired website presents users with a form to fill in to log in. You must specify the form input names and values.

    • Automatic form authentication

      Select this option when the desired website presents users with a form to fill in to log in. This is a simplified version of the Manual form authentication. A web automation framework supporting JavaScript rendering executes the authentication process by auto-detecting the form inputs.

    Depending on the selected authentication method, configure the appropriate parameters:

    • When selecting Basic authentication:

      1. In the Username box, enter the source credentials username as you would when you log in to the website you are making searchable.

      2. In the Password box, enter the source credentials password as you would when you log in to the website you are making searchable.

    • When selecting Manual form authentication:

      1. In the Form URL box, enter the website login page URL.

      2. (Optional) When there is more than one form on the login page, enter the Form name.

      3. Click the Action method drop-down menu, and then select the HTTP verb used to submit the authentication request, which is whether POST or GET.

      4. Click the Content type drop-down menu, and then select the content-type headers for HTTP POST requests used by the form authentication process.

      5. (Optional) When the authentication request is sent to another URL than the specified form URL, enter the Action URL. Otherwise, leave empty.

      6. In the Username input name and the Password input name inputs, inspect the form HTML code for both parameter, locate the corresponding <input name='abc' type='text' /> element, and then enter the name attribute value.

        Based on the HTML code below:

        <input name="login" type="email" />

        <input name="pwd" type="password" />

        login is the username input name and pwd is the password input name.

      7. In the Username input value and the Password input value inputs, enter respectively the username and password parameter value.

      8. When your form has other parameters than username and password, it is recommended to select the Autodetect hidden inputs check box. Otherwise, you must enter the Input name and Input value for each parameter.

        Under Other inputs, inputs values are displayed in clear text. You must therefore ensure to enter your sensitive information, i.e., username and password, above the Other inputs, in the Username input name, Password input name, Username input value, and Password input value boxes (see steps f and g).

        FormAuthOtherInputsEx

      9. Under Confirmation method, select the method used to know the authentication request fails: Redirection to, Missing cookie, Missing text, or Text.

        Depending on the selected confirmation method, configure the appropriate parameters:

        • When selecting Redirection to, enter the URL where users are redirected to when the login fails.

          https://mycompany.com/login/failed.html

        • When selecting Missing cookie, in the Value input, the name of the cookie that is set when an authentication is successful.

          ASP.NET_SessionId

        • When selecting Missing text, in the Value input, enter a string that you show to authenticated users.

          • Hello, jsmith@mycompany.com!
          • Log out
        • When selecting Text, in the Value input, enter a string that appears on the page when a login fails.

          • An error has occurred.
          • Your username or password is invalid.
    • When selecting Automatic form authentication:

      1. In the Form URL box, enter the website login page URL.

      2. In the Username and Password boxes, enter the credentials to use to log in.

      3. Under Confirmation method, select the method to use to determine whether the authentication request failed: Redirection to, Missing cookie, Missing text, or Text presence.

        Depending on the selected confirmation method, enter the appropriate value:

        • When selecting Redirection to, enter the URL where you redirect users when the login fails.

          https://mycompany.com/login/failed.html

        • When selecting Missing cookie, in the Value input, enter the name of the cookie that is set when an authentication is successful.

          ASP.NET_SessionId

        • When selecting Missing text, in the Value input, enter a string to show to authenticated users.

          • Hello, jsmith@mycompany.com!

          • Log out

        • When selecting Text, in the Value input, enter a string to display on the page when a login fails.

          • An error has occurred.
          • Your username or password is invalid.
      4. If the website relies heavily on asynchronous requests, under Loading delay, enter the maximum delay in milliseconds to wait for the JavaScript to execute on each web page.

      5. (Optional) If your automatic form authentication configuration does not work, or if the web page requires specific actions during the login process, you might have to enter a Custom login sequence configuration. Contact the Coveo Support team for assistance.

  5. (Optional) When some of the website content you want to include is dynamically rendered by JavaScript, expand the Content to Include section, and then select the JavaScript-rendered check box. By default, the Sitemap source does not execute the JavaScript code in crawled website pages.

    Selecting the JavaScript-rendered option may significantly increase the time needed to crawl pages.

    The web pages indexed by your Sitemap source are rendered dynamically by JavaScript code. You want to index the dynamic content.

    By default, the crawler does not wait once the page is loaded. You may need to set the JavaScriptLoadingDelayInMilliseconds parameter in the Sitemap source JSON if the JavaScript takes longer to execute or makes asynchronous calls to create dynamic content (see Edit the Source Configuration in JSON Format). Consider changing the default value (0) to a reasonable time allowing dynamic content rendering.

     "EnableJavaScript": {
         "value": "true"
       },
     "JavaScriptLoadingDelayInMilliseconds": {
         "value": "1000"
       }
    
  6. (Optional) When a target website occasionally responds slower, expand the Crawling Settings section, and then, in the Request timeout box, enter or use the up and down arrows to select the web request timeout value in seconds. The default is 100 seconds. When the value is 0, there is no timeout.

    Increasing the timeout value prevents getting errors that can be avoided.

  7. When you want to exclude page sections (such as headers and footers) or extract information from the pages to create metadata, expand the Web Scraping section to use this powerful feature (see Web Scraping Configuration).

    You have a Sitemap source for which you want to exclude HTML item header and footer sections, so you enter the following in the Web scraping configuration input.

       "sensitive": false,
         "value": "[\n  {\n      \"for\": {\n        \"urls\": [\".*\"]\n      },\n      \"exclude\": [\n        { \"path\": \"#ohHeader\" },\n        { \"path\": \"#MainSection > div.col-md-3\" },\n        { \"path\": \"#answerLink\" }\n      ],\n      \"metadata\": {\n        \"topicTitle\": { \"path\": \"div.topic  h1::text\" },\n        \"topicLastUpdate\": { \"path\": \"#LastUpdate::text\" }\n      }\n  }\n]"
       }
    
  8. In the Access tab, determine whether each group and API key can view or edit the source configuration (see Understanding Resource Access):
    1. In the Access Level column, select View or Edit for each available group.
    2. On the left-hand side of the tab, if available, click Groups or API Keys to switch lists.

    If you remove the Edit access level from all the groups of which you are a member, you will not be able to edit the source again after saving. Only administrators and members of other groups that have Edit access on this resource will be able to do so. To keep your ability to edit this resource, you must grant the Edit access level to at least one of your groups.

  9. Optionally, consider editing or adding mappings (see Manage Source Mappings).

    You can only manage mapping rules once you build the source (see Add or Edit a Source).

  10. Complete your source addition or edition:

    • Click Add Source/Save when you want to save your source configuration changes without starting a build/rebuild, such as when you know you want to do other changes soon.

      In the Sources page, you must click Start initial build or Start required rebuild in the source Status column to add the source content or make your changes effective, respectively.

      OR

    • Click Add and Build Source/Save and Rebuild Source when you are done editing the source and want to make changes effective.

      Back in the Sources page, you can review the progress of your source addition or modification (see Manage Sources).

    Once the source is built or rebuilt, you can review its content in the content browser (see Content Browser - Page).

What’s Next?

Review the refresh schedule (see Edit a Source Schedule).