Adding a Sitemap Source

You can add the content of your websites to your Coveo organization. Coveo indexes your listed web pages from a Sitemap (Sitemap file or a Sitemaps index file) to make them searchable by only you when you make the source private or all members of the Coveo organization when you share the source. For secured websites (non-public accessible Sitemap), the source supports some authentication modes.

The Sitemap source panel doesn’t yet support to set a Retrieve new content schedule, but you can set a source Rescan all content for changes schedule to regularly update the whole source content (see Modify a Source Schedule).

Source Features Summary

Features Supported Additional information
Sitemap version XML, Text, RSS 2.0, Atom 1.0, and HTML

Sitemap files and Sitemap index file must respect the Sitemap protocol (validations can however be turned off by a parameter)

Support sitemap files containing custom metadata (see [Indexing XML Sitemap Metadata](/en/2656/)).

Searchable content type

Web pages (URL)

Content update Incremental refresh
  • Full refresh or rebuild needed to retrieve deleted web pages and text sitemap changes.

  • Requires the Sitemap to define the optional Last Modification Date attribute (e.g., <lastmod> for XML Sitemaps, <updated> for Atom Sitemaps, <pubDate> for RSS Sitemaps) for each URL to be supported.

    The Last Modification Date attribute must specify the modification time in the W3C DateTime format: YYYY-MM-DDThh:mm:ss.

Full refresh
Rebuild
Permission types Secured

Private
Shared

Add a Sitemap Source

To edit a Sitemap source, see Edit the Source Configuration to Re-Index its Content or Re-Authorize the Access, and then follow the steps below, starting from step 5.

  1. Ensure your Sitemap or Sitemaps index file is in one of the following supported formats:

    • XML, Text, RSS 2.0, Atom 1.0, and HTML
  2. If not already done, log in to your Coveo organization.

  3. In the navigation bar on the left, under Search Content, select Sources, and then click Add Source.

  4. On the Add Source page, click Sitemap.

    When you create a source, you become the owner of the source.

  5. In the Add/Edit a Sitemap Source dialog box:

    Admin-AddSitemapSource2

    1. In the Source Name box, enter a descriptive name of your choice for the source.

    2. In the URLs box, enter the URL(s) to one or more Sitemap index files or Sitemap files including the protocol (http:// or https://) and the trailing slash (/) that you want to make searchable, then press the Enter key or click Add.

      • By default, Sitemap files and Sitemap index files that don’t respect the following validations based on the Sitemap protocol are ignored during the indexing process (see Sitemap protocol):

        • An uncompressed Sitemap file must be no larger than 10 MB (even if the file is compressed with GZIP).

        • A Sitemap file can’t contain more than 50,000 URLs.

        • All referenced URLs must be less than 2,048 characters.

        • All referenced URLs must be relative to the Sitemap that references them and in the same domain. The location of a Sitemap file determines the set of URLs that can be included in that Sitemap.

          A Sitemap file located at http://myorgwebsite.com/tech/sitemap.xml/ can include any URLs starting with http://myorgwebsite.com/tech/ but can’t include URLs starting with http://myorgwebsite/catalog/.

      • When you want to retrieve the content of listed web pages from a XML Sitemap, enter the direct Sitemap URL instead of the Sitemap website address. Otherwise, the source could interpret the web page as a Sitemap file in HTML and crawl the discovered links.

        You enter the following URL: http://myorgwebsite.com/sitemap.xml instead of http://myorgwebsite.com/.

      • The Sitemap source can retrieve all links contained in a web page. The Sitemap source crawler doesn’t expand all discovered links, but only includes the web page as a Sitemap file in HTML.

        You want to include the content of a web page containing links such as a site map, so you enter the following URL: http://myorgwebsite.com/sitemap.

      • http://myorgwebsite.com/sitemap.xml (Public website Sitemap)

      • http://myorgwebsite.com/sitemap.xml.gz (Public website Sitemap compressed with GZIP)

      • To add a URL, click Add.

      • To remove a URL, click Delete.

        Create one source per website. If you choose to include more than one address, ensure that all parameters are applicable to all addresses specified in the URLs box.

    3. In the User Agent box, enter the name used by your Coveo organization to identify itself to the website when downloading pages. Leave empty to use the default value (CoveoEnterpriseSearch).

    4. In the Additional Headers box, enter a semicolon-separated list of additional HTTP headers added to the web requests sent by the source in the following format:

      key1=\value1

    5. Enable the Parse Sitemap in strict mode toggle button when the specified Sitemap files or Sitemaps index file respect the following validations based on the Sitemap protocol specifications (see Sitemap protocol):

      • An uncompressed Sitemap file must be no larger than 10 MB (even if the file is compressed with GZIP).

      • A Sitemap file can’t contain more than 50,000 URLs.

      • All referenced URLs must be less than 2,048 characters.

      • All referenced URLs must be relative to the Sitemap that references them and in the same domain. The location of a Sitemap file determines the set of URLs that can be included in that Sitemap.

        A Sitemap file located at http://myorgwebsite.com/tech/sitemap.xml can include any URLs starting with http://myorgwebsite.com/tech/ but can’t include URLs starting with http://myorgwebsite/catalog/.

      By default, all web pages are indexed if their reference URL is valid and absolute.

    6. In the Security drop-down menu, select if you want the website content to be Shared or Private (see Source Permission Types).

    7. Optionally, under URL Replacement, modify page URLs.

      The following options are useful when page URLs referenced in the sitemap aren’t the same as those that are accessible to the source crawler. A part of the page URLs (such as the host name, port, or path) can be different and automatically replaced.

      You can access your web site with different URLs depending on what side of a firewall you’re. Your website sitemap file contains the default internal host name URLs:

      https://internal.corp.mysite.com/agivenpage.html

      However, the Coveo Cloud source crawler can only access the site content through its public DNS site host name:

      https://mynicesiteurl.com/agivenpage.html

      By replacing the URLs in the source configuration, you allow the source to see the pages.

      1. In the Replacement Pattern box, enter a regular expression (REGEX .NET flavor) to match a part of the URL to replace.

        With the above example, you can enter the internal host name:

        internal.corp.mysite.com

      2. In the Replacement Value box, enter the REGEX expression to replace the matched value.

        Again with the above example, enter the public DNS host name:

        mynicesiteurl.com

      • This replacement doesn’t change the clickable URI, it only modifies where the content is fetched by the source crawler.

      • You can leave the Replacement Pattern box empty, the assumed matched part is the URL authority (host name + port), in which case you must specify the target URL authority in the Replacement Value box.

    8. If your website is secured with the basic authentication type, click Basic Authentication and then enter the Username and Password of an account that can access all the content that you want to index.

      • Your Coveo organization can authenticate with the following authentication schemes:

        • Basic

        • Digest

        • NTLM

        • Negotiate/Kerberos

      • If you configured the source to be Shared within your Coveo organization, in search results, users can see all the items that indexing account can access.

    9. Optionally, under OAuth Authentication configure an OAuth authentication.

      Admin-AddSitemapSourceOAuth

      The following options are useful only when your website is secured with an OAuth 2.0 authentication flow.

      Use the appropriate value from your identity provider.

      1. In the Provider Type box, enter the name of your OAuth provider type.

        The supported provider names are:

        • AdobeIMS

        • Salesforce

        • Google

      2. In the Identity Provider URL box, enter the URL of your OAuth identity provider.

      3. In the Client Id box, enter your OAuth client id string.

      4. In the Client Secret box, enter your OAuth client secret string.

      5. In the Authorization Code box, enter your OAuth authorization code.

      6. In the Client Refresh Token box, enter your OAuth client refresh token string.

    10. Click Start Indexing (or Refresh Index when editing the source).

  6. Back on the Sources page, you can review the progress of your Sitemap source addition (see Review the State of Sources Available to You).

What’s Next?

Recommended Articles