Add/Edit Sitemap Source - Panel

When you have the required privileges, you can use a Sitemap source to make searchable the content of listed web pages from a Sitemap (Sitemap file or a Sitemaps index file).

You can keep a Sitemap source up-to-date with a Refresh schedule to frequently update new or changed pages without rescanning all sitemap entries (see Edit a Source Schedule: [SourceName] - Panel).

About the source Refresh:

  • The Sitemap file must define the optional Last Modification Date attribute for each entry (e.g., <lastmod> for XML Sitemaps, <updated> for Atom Sitemaps, <pubDate> for RSS Sitemaps) for each URL. If not, you need to schedule a source Rescan to catch new and changed items. Text Sitemaps do not contain such attributes.

  • Deleted sitemap entries require a source Rescan to be taken into account.

For secured websites (non-public accessible Sitemap), the source supports several authentication modes (see Supported Authentication Schemes).

Source Features Summary

Features Supported Additional information
Sitemap version XML, Text, RSS 2.0, Atom 1.0, and HTML

Sitemap files and Sitemap index file must respect the Sitemap protocol (validations can however be turned off by a parameter)

Support sitemap files containing custom metadata (see Adding and Indexing Custom Metadata in an XML Sitemap).

Searchable content type

Web pages (URL)

Content update Refresh
  • Rescan or rebuild needed to retrieve deleted web pages and text sitemap changes.

  • Requires the Sitemap to define the optional Last Modification Date attribute (e.g., <lastmod> for XML Sitemaps, <updated> for Atom Sitemaps, <pubDate> for RSS Sitemaps) for each URL to be supported.

    The Last Modification Date attribute must specify the modification time in the W3C DateTime format: YYYY-MM-DDThh:mm:ss.

Rescan  
Rebuild  
Permission types Secured

Private  
Shared  
  • When you want to include the meta tags contained in the head of listed web pages as metadata for each source item, add the indexHtmlMetadata hidden parameter and set it to true in the source JSON configuration (see Edit a Source JSON Configuration: [SourceName] - Panel and Add a Hidden Source Parameter). Since the parameter has an impact on the indexing performance, the parameter is set to false by default.

  • The content attribute of meta tags is indexed when the tag is keyed with one of the following attributes: name, property, itemprop, or http-equiv.

    In the tag <meta property="og:title" content="The Article Title"/>, The Article Title is indexed.

Requirements

Supported Sitemap File Formats

The source can include web pages from the following Sitemap file formats:

  • XML (Sitemap and index)

  • Text

  • Syndication Feeds (Atom 1.0 and RSS 2.0)

  • HTML

Supported Authentication Schemes

The source can authenticate with the following authentication schemes:

  • Basic

  • Digest

  • NTLM

  • Negotiate/Kerberos

  • Form based

You can enter the authentication parameters in the Authentication Section.

Add or Edit a Sitemap Source

  1. If not already done, ensure the Sitemap you want to include meets the source requirements (see Requirements).

  2. If not already in the Add/Edit a Sitemap source panel, go to the panel:

    • To add a source, in the main menu, under Content, select Sources > Add source button > Sitemap.

      OR

    • To edit a source, in the main menu, under Content, select Sources > source row > Edit in the Action bar.

  3. In the Configuration tab, enter appropriate values for available parameters:

    • Source name

      When you add a source, you must enter a unique source name under 255 characters (not already in use for another source in this organization).

      Once a source is created, you cannot change its name. You would rather need to create a new source with a similar configuration and the desired name, and then delete the original source.

      Take the time to plan and pick good source names.

      This name appears in the list of sources in this administration console, but may also be used by developers in search page JavaScript code, for example, to specify the scope of a search interface.

      Use a short and descriptive name, using letters, numbers, - and _ characters, and avoid spaces and other special characters.

    • URLs

      You must enter the URL(s) to a Sitemap index file or Sitemap files (one entry per line) in either the http:// or https:// form.

      When you want to retrieve the content of listed web pages from a XML Sitemap, enter the direct Sitemap URL instead of the Sitemap website address. Otherwise, the source could interpret the web page as a Sitemap file in HTML and crawl the discovered links.

      You enter the following URL: http://myorgwebsite.com/sitemap.xml instead of http://myorgwebsite.com/.

      • http://myorgwebsite.com/sitemap.xml (Public website Sitemap)

      • http://myorgwebsite.com/sitemap.xml.gz (Public website Sitemap compressed with GZIP)

      • http://myorgwebsite.com/sitemap (Web page containing links such as a site map)

      Avoid including more than one Sitemaps in a given source. Rather create one source for each Sitemap.

      • By default, Sitemap files and Sitemap index files that do not respect the following validations based on the Sitemap protocol are ignored while the content is included (see Sitemap protocol):

        • An uncompressed Sitemap file must be no larger than 10 MB (even if the file is compressed with GZIP).

        • A Sitemap file cannot contain more than 50,000 URLs.

        • All referenced URLs must be less than 2,048 characters.

        • All referenced URLs must be relative to the Sitemap that references them and in the same domain. The location of a Sitemap file determines the set of URLs that can be included in that Sitemap.

          A Sitemap file located at http://myorgwebsite.com/tech/sitemap.xml can include any URLs starting with http://myorgwebsite.com/tech/ but cannot include URLs starting with http://myorgwebsite/catalog/.

      • When you do not want your Sitemap files and Sitemaps index file to be validated, add the ParseSitemapInStrictMode hidden parameter and set it to false in the source JSON configuration (see Edit a Source JSON Configuration: [SourceName] - Panel and Add a Hidden Source Parameter). In this case, the above validations are not performed. Consequently, all web pages are included if their reference URL is valid and absolute.

      • The Sitemap source can retrieve all links contained in a web page. The Sitemap source crawler does not expand all discovered links, but only includes the web page as a Sitemap file in HTML.

        You can also select only a specific part of a web page to be included by adding the HtmlXPathSelectorExpression hidden parameter in the source JSON configuration. The parameter value must be an XPath expression that selects one or more nodes of a web page containing the URLs to crawl (see Edit a Source JSON Configuration: [SourceName] - Panel and Add a Hidden Source Parameter). By default, the source retrieves all listed web pages from an HTML Sitemap.

        You want only to index a specific portion (only the web pages linked inside the cbc-sitemap div container) of the CBC Sitemap web page, so you add the parameter with the following value: //div[@id='cbc-sitemap'].

        • Any XPath selecting node can be used to set the website portion to include (see XPath syntax).

        • You should also set the ParseSitemapInStrictMode hidden parameter to false in the source JSON configuration since an HTML web page does not follow the Sitemap protocol (see Sitemap Protocol and Add a Hidden Source Parameter).

    • User agent

      The user agent string sent with HTTP requests by the Coveo Sitemap source to identify itself when downloading pages. When left empty, the default value used is: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) (compatible; Coveobot/2.0;+http://www.coveo.com/bot.html).

    • Security

      Select who can see any Sitemap item from this source in search results of a search interface that includes this source in its scope (see Source Permission Types):

      • Shared: everyone

      • Private: Only you, when you are authenticated to the search interface with the identity with which you create the source.

  4. (Optional) When some of the website content you want to include is dynamically rendered by JavaScript, expand the Content to Include section, and then select the JavaScript-rendered check box. By default, the Sitemap source does not execute the JavaScript code in crawled website pages.

    Selecting the JavaScript-rendered option may significantly increase the time needed to crawl pages.

    The web pages indexed by your Sitemap source are rendered dynamically by JavaScript code. You want to index the dynamic content.

    By default, the crawler does not wait once the page is loaded. You may need to set the JavaScriptLoadingDelayInMilliseconds parameter in the Sitemap source JSON if the JavaScript takes longer to execute or makes asynchronous calls to create dynamic content (see Edit the Source Configuration in JSON Format). Consider changing the default value (0) to a reasonable time allowing dynamic content rendering.

     "EnableJavaScript": {
         "value": "true"
       },
     "JavaScriptLoadingDelayInMilliseconds": {
         "value": "1000"
       }
    
  5. (Optional) When a target website occasionally responds slower, expand the Crawling Settings section, and then, in the Request timeout box, enter or use the up and down arrows to select the web request timeout value in seconds. The default is 100 seconds. When the value is 0, there is no timeout.

    Increasing the timeout value prevents getting errors that can be avoided.

  6. When you want to exclude page sections (such as headers and footers) or extract information from the pages to create metadata, expand the Web Scraping section to use this powerful feature (see Web Scraping Configuration).

    You have a Sitemap source for which you want to exclude HTML item header and footer sections, so you enter the following in the Web scraping configuration input.

       "sensitive": false,
         "value": "[\n  {\n      \"for\": {\n        \"urls\": [\".*\"]\n      },\n      \"exclude\": [\n        { \"path\": \"#ohHeader\" },\n        { \"path\": \"#MainSection > div.col-md-3\" },\n        { \"path\": \"#answerLink\" }\n      ],\n      \"metadata\": {\n        \"topicTitle\": { \"path\": \"div.topic  h1::text\" },\n        \"topicLastUpdate\": { \"path\": \"#LastUpdate::text\" }\n      }\n  }\n]"
       }
    
  7. In the Configuration Access tab, determine whether each group and API key can view or edit the source configuration. In the Access Level column, select View or Edit for each available group. On the left-hand side of the tab, if available, click Groups or API Keys to switch lists.

    You can select an access level for the groups you are allowed to edit only (see Privileges and Groups Privilege). Groups that you are not allowed to edit are grayed out.

    If you remove the Edit access level from all the groups of which you are a member, you will not be able to edit the source again after saving. Only administrators and members of other groups that have Edit access on this resource will be able to do so. To keep your ability to edit this resource, you must grant the Edit access level to at least one of your groups.

  8. Optionally, consider editing or adding mappings (see Edit the Mappings of a Source: [SourceName]).

    You can only manage mapping rules once you build the source (see Add/Edit a Source - Panel).

  9. Complete your source addition or edition:

    • Click Add Source/Save when you want to save your source configuration changes without starting a build/rebuild, such as when you know you want to do other changes soon.

      In the Sources page, click Start initial build or Start initial rebuild in the source Status column to respectively add the source content and make your changes effective.

    • Click Add and Build Source/Save and Rebuild Source when you are done editing the source and want to make changes effective.

      Back in the Coveo Cloud administration console Sources page, you can review the progress of your source addition or modification (see Sources - Page).

What’s Next?

Review the default refresh schedule in which a source rescan starts every day (see Edit a Source Schedule: [SourceName] - Panel).