Troubleshooting Sitemap source issues

This article provides help troubleshooting common issues when indexing content with the Sitemap source.

Identify the issue you’re facing using the case context and symptoms provided. Then, apply the recommended resolution steps and rebuild your source.

Important

Troubleshooting symptoms are provided as a guide. Actual symptoms may vary. For example, Coveo may or may not return an error mentioned among the issue symptoms.

Review the Activity Browser (platform-ca | platform-eu | platform-au) page for a fuller picture of an abnormal indexing activity. You can also download the source update logs for a chronological account of what happened during the indexing process.

Issues are divided into categories. Click a category description below to reach the related section.

Missing pages

User agent blocklisting

Context and symptoms:

  • No items referenced in the sitemap file are indexed.

  • The Activity Browser (platform-ca | platform-eu | platform-au) page displays a SITEMAP_FORBIDDEN_ERROR error code. The error message states Forbidden access. The connector doesn’t have required permissions or it is blocked by a security feature.

Likely cause and resolution

Cause: Your web server may be blocking access to the Coveo crawler user agent (for example, using the .htaccess file mod_rewrite module on an Apache server, or the URL Rewrite module on an IIS server).

Resolution:

  1. Chrome, and other web browsers, let you emulate web requests made with a specific user agent by overriding your browser default user agent string. Use this feature to test if your web server is blocking the Coveo crawler user agent (that is, Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) (compatible; Coveobot/2.0;+http://www.coveo.com/bot.html)).

  2. If applicable, make the relevant changes to your web server or Sitemap source configuration.

    1. (Recommended) Remove the Coveo crawler user agent from the blocklist on your web server.

    2. Access the Edit a source JSON configuration panel and set the UserAgent parameter to a value that your web server doesn’t block.

  3. Rebuild your source.

Sitemap or sitemap index file can’t be reached

Context and symptoms:

Likely cause and resolution

Cause: The following are potential causes:

  1. The Coveo crawler can’t access the sitemap or sitemap index file due to server throttling.

  2. A sitemap or sitemap index file URL is incorrect.

  3. Your network may be blocking inbound requests from the Coveo Platform.

Resolution: Depending on the cause, apply the corresponding resolution below:

  1. Access your source configuration and try increasing the Time the crawler waits between requests to your server value in your source configuration (for example, 1000 milliseconds). Also check your refresh and rescan schedules. Overlapping refresh and rescan schedules can also cause an overload on your server.

  2. Access your source configuration and validate your Sitemap URLs. For example, make sure your Sitemap URLs use the correct protocol (HTTP or HTTPS). When you’re done and you’ve saved your changes, rebuild your source.

  3. Allow inbound requests from the Coveo Platform or consider installing the Coveo Crawling Module on your infrastructure to push documents to Coveo instead.

Sitemap URL exclusion

Context and symptoms:

Likely cause and resolution

Cause: The Sitemap source exclusion and inclusion rules may be filtering out that sitemap URL. Consequently, pages listed in that sitemap file aren’t indexed.

Resolution: Access your source configuration and review your exclusion and inclusion rules. Ensure your sitemap file URL:

  • Doesn’t match any exclusion rule

  • Matches at least one inclusion rule.

NO_DOCUMENT_INDEXED errors

Context and symptoms:

Likely cause and resolution

Cause: The NO_DOCUMENT_INDEXED is a generic error code used when an indexing operation fails. It tells you that the source didn’t index any items but tells you little about the reason why. The indexing operation may have failed for a variety of reasons, some of which have been covered previously.

For example, it’s possible no items were indexed because:

  1. The sitemap file was actually empty at indexing time.

  2. The sitemap is invalid.

Resolution: More information is needed to diagnose the issue. Review the source activity logs as follows:

  1. On the Sources (platform-ca | platform-eu | platform-au) page, click your source, and then click Activity in the Action bar.

  2. In the Activity panel that opens, click the desired activity (having a Failed result), and then click Download Logs in the Action bar.

  3. Open the downloaded log file in your preferred text editor.

  4. Starting from the top of the logs, and working your way down, look for the first WARN or FATAL error message. For example, the WARN message in the logs below indicates that Coveo wasn’t able to parse the sitemap file because it’s invalid (that is, the <loc> element is missing).

    Invalid sitemap file | Coveo

  5. Once you’ve identified the issue, take the necessary steps to resolve it, and then rebuild your source.

Missing or invalid basic authentication configuration

Context and symptoms:

  • A page listed in the sitemap file isn’t indexed.

  • The Activity Browser (platform-ca | platform-eu | platform-au) downloadable logs may show the following message: Authentication failed. The provided credentials may be invalid or expired.

  • When trying to access that page in a browser, you’re prompted for credentials in a pop-up window.

Likely cause and resolution

Cause: Accessing the page content requires basic authentication.

Resolution:

  • Request authentication credentials from the web server administrator. Then, access your source configuration and set up basic authentication on the source.

  • If you’re using a password manager (for example, LastPass), it may replace the previously recorded username and password with different ones as you edit the source. We recommend checking your password manager options and ensuring that it respects the autocomplete="off" attribute.

Missing or invalid form authentication configuration

Context and symptoms:

  • A page listed in the sitemap file isn’t indexed.

  • The Activity Browser (platform-ca | platform-eu | platform-au) downloadable logs may show the following exception: The form authentication request for "<URL>" was submitted to the form but it failed to authenticate.

  • When trying to access that page in a browser, a login page is displayed instead.

Likely cause and resolution

Cause: Accessing the page content requires form authentication.

Resolution:

  • Request authentication credentials from the web server administrator. Then, access your source configuration and set up form authentication on the source.

  • If you’re using a password manager (for example, LastPass), it may replace the previously recorded username and password with different ones as you edit the source. We recommend checking your password manager options and ensuring that it respects the autocomplete="off" attribute.

Authentication validation method issue

Context and symptoms:

  • A page listed in the sitemap file isn’t indexed.

  • Accessing the page content requires form authentication.

  • Form authentication is configured on the source.

  • When trying to access that page in a browser, the form authentication Login page address page is displayed. Typing in the credentials and submitting the login page brings up the page to be indexed.

Likely cause and resolution

Cause: The authentication validation method may not be configured properly.

Resolution: Access your source configuration. Ensure the Validation method you selected and the associated Value are adequate.

Redirection to login page issue

Context and symptoms:

  • A page listed in the sitemap file isn’t indexed.

  • Accessing the page content requires form authentication and your source configuration Validation method is Redirection to URL.

  • When trying to access that page manually in a browser, the form authentication Login page address page isn’t displayed.

Likely cause and resolution

Cause: The Redirection to URL validation method doesn’t work in your use case. Consequently, the Sitemap source crawler doesn’t know it must authenticate before accessing the page to index.

Resolution:

  • Access your source configuration and choose another validation method. Select a method based on the way the web server responds when you manually try to access the page to index (when unauthenticated).

  • If no reliable validation method can be found, try enabling the form authentication Force authentication option.

Content freshness issue

Symptom: Pages recently added to the site are still not appearing in the Content Browser (platform-ca | platform-eu | platform-au).

Likely cause and resolution

Cause and Resolution: See Indexed content is not up to date.

Extra or unwanted pages

Missing filtering

Symptom: All URLs listed in the sitemap file are indexed.

Likely cause and resolution

Cause: By default, the Sitemap source contains no exclusions and the inclusions are set to Include all non-excluded pages. In other words, you’re not filtering out any URLs listed in the sitemap file.

Resolution:

  • Remove any unwanted URLs from your sitemap file, OR

  • Access your source configuration and define inclusion and exclusion rules to filter out unwanted URLs.

    To be indexed, a page:

    • must not match any exclusion rule, AND

    • it must match at least one inclusion rule (for example, by selecting the Include all non-excluded pages option).

⚠️ Make sure you don’t exclude your Sitemap URLs.

Duplicate items due to redirects

Context and symptoms:

Likely cause and resolution

Cause: Your sitemap file contains several URLs that redirect to the same page. For each URL listed in your sitemap file, the Sitemap source crawler creates an index item with the Clickable URI reflecting the URL that the crawler tried reaching (that is, the URL listed in the sitemap file).

Resolution: There are two ways to address this issue:

  • (Recommended) Remove all URLs that are redirected from your sitemap file, only keeping the final destination URLs.

  • Add exclusion rules to your source configuration to filter out the URLs that are redirected.

    1. Access your source configuration.

    2. In the Exclusions and inclusions section, add an is exclusion rule for each URL that’s redirected.

      Excluding URLs one by one | Coveo

After performing the changes, the next scheduled rescan will update your source automatically. Alternatively, you can rebuild your source to apply the changes immediately.

Content freshness issue

Symptom: Pages recently deleted from the site are still appearing in the Content Browser (platform-ca | platform-eu | platform-au).

Likely cause and resolution

Cause and Resolution: See Indexed content is not up to date.

Unexpected or missing content inside pages

Indexing by reference

Context and symptoms:

Indexing by reference
Likely cause and resolution

Cause: You may be indexing by reference. When indexing by reference, the body of the web page (used for the Quick view) isn’t retrieved and no excerpt (used for the item description) is generated.

Resolution:

Access the Edit a source JSON configuration panel. If HTML documents are currently indexed by Reference, change that value to Retrieve.

Indexing by retrieve
Broken images in the Quick view

Context and symptoms:

When accessing the Quick view of an item, images are broken.

Broken image in the Quick view
Likely cause and resolution

Cause: The connector retrieves web page HTML as is and doesn’t retrieve the images referenced in the HTML. The Content Browser Quick view displays this HTML without any alteration. This means it doesn’t replace relative paths, such as <img src="/sites/…​/myimage.jpg">, with the corresponding absolute paths, such as <img src="https://…​/myimage.jpg">. As a result, when web pages contain images that are referenced using relative paths, the images can’t be displayed in the Content Browser Quick view.

Images that require authentication to be viewed also appear broken when browsing the web page item Quick view in the Content Browser.

Resolution: None. This is a known limitation of the Content Browser Quick view.

The Quick view is intended to provide a preview of the item content, not a full rendering of the web page. To view the full web page, users can open the original document by clicking the item clickable URI link in the search results.

YouTube player not available in the Quickview component

Context and symptoms:

In the Quickview component of a Coveo JavaScript Search Framework search result, the YouTube player isn’t available. You notice the following symptoms:

  • The YouTube video iframe shows the following error message:

    Try watching this video on www.youtube.com, or enable JavaScript if it is disabled in your browser.
  • In the browser console, the following message appears:

    Blocked script execution in '<INSERT SEARCH PAGE URL>' because the document's frame is sandboxed and the 'allow-scripts' permission is not set.
Likely cause and resolution

Cause: For security reasons, the only way to view a YouTube video in the YouTube player within a Coveo JavaScript Search Framework result template is by:

Resolution:

  1. Index YouTube videos with the YouTube source.

  2. In your Coveo JavaScript Search Framework search interface, use the CoveoYouTubeThumbnail component to show a relevant image of the result video content. Clicking the thumbnail starts the video.

The following is a sample implementation:

<div class="CoveoResultList" data-layout="list" data-wait-animation="fade" data-auto-select-fields-to-include="true">
  <script id="YouTubeVideo" class="result-template" type="text/html" data-layout="list" data-field-filetype="YouTubeVideo">
    <div class="coveo-result-frame">
      <div class="coveo-result-row">
        <div class="coveo-result-cell" style="width:220px; padding-top:7px">
          <span class="CoveoYouTubeThumbnail"></span>
        </div>
        <div class="coveo-result-cell">
          <div class="coveo-result-row">
            <div class="coveo-result-cell" style="font-size:16px" role="heading" aria-level="2">
              <a class="CoveoResultLink"></a>
            </div>
            <div class="coveo-result-cell" style="text-align:right; width:120px;font-size:12px">
              <span class="CoveoFieldValue" data-field="@date" data-helper="dateTime"></span>
            </div>
          </div>
          <div class="coveo-result-row" style="margin-top:10px;">
            <div class="coveo-result-cell">
              <span class="CoveoExcerpt"></span>
            </div>
          </div>
          <div class="coveo-result-row" style="margin-top:10px;">
            <div class="coveo-result-cell">
              <span class="CoveoFieldValue" data-field="@author" data-text-caption="Author" style="margin-right:30px;"></span>
              <span class="CoveoFieldValue" data-field="@ytvideoduration" data-helper="timeSpan" data-helper-options-is-milliseconds="false" data-text-caption="Length" style="margin-right:30px;"></span>
              <span class="CoveoFieldValue" data-field="@ytviewcount" data-helper="number" data-helper-options-format="n0" data-text-caption="Views" style="margin-right:30px;"></span>
              <span class="CoveoFieldValue" data-field="@language" data-text-caption="Language" style="margin-right:30px;"></span>
            </div>
          </div>
          <div class="coveo-result-row">
            <div class="coveo-result-cell">
              <div class="CoveoMissingTerms"></div>
            </div>
          </div>
        </div>
      </div>
    </div>
  </script>
  <script id="YouTubePlaylist" class="result-template" type="text/html" data-layout="list" data-field-filetype="YouTubePlaylist">
    <div class="coveo-result-frame">
      <div class="coveo-result-cell" style="vertical-align:top;text-align:center;width:32px;">
        <span class="CoveoIcon" data-small="true" data-with-label="false"></span>
      </div>
      <div class="coveo-result-cell" style="vertical-align: top;padding-left: 16px;">
        <div class="coveo-result-row" style="margin-top:0;">
          <div class="coveo-result-cell coveo-no-wrap" style="vertical-align:top;font-size:16px;" role="heading" aria-level="2">
            <a class="CoveoResultLink"></a>
          </div>
          <div class="coveo-result-cell" style="width:120px;text-align:right;font-size:12px">
            <div class="coveo-result-row">
              <span class="CoveoFieldValue" data-field="@date" data-helper="date"></span>
            </div>
          </div>
        </div>
        <div class="coveo-result-row" style="margin-top:10px;">
          <div class="coveo-result-cell">
            <span class="CoveoFieldValue" data-field="@filetype" data-text-caption="Type" style="margin-right:30px;"></span>
            <span class="CoveoFieldValue" data-field="@author" data-text-caption="Author" style="margin-right:30px;"></span>
            <span class="CoveoFieldValue" data-field="@ytitemcount" data-text-caption="NumberOfVideos" style="margin-right:30px;"></span>
          </div>
        </div>
        <div class="coveo-result-row" style="margin-top:12px;">
          <div class="coveo-result-cell" style="padding-top:5px; padding-bottom:5px; font-size:13px;">
            <span class="CoveoResultFolding" data-result-template-id="YouTubeVideo"></span>
          </div>
        </div>
        <div class="coveo-result-row">
          <div class="coveo-result-cell">
            <div class="CoveoMissingTerms"></div>
          </div>
        </div>
      </div>
    </div>
  </script>
</div>
Copy protection on PDF

Context and symptoms:

When viewing a PDF item in the Content Browser (platform-ca | platform-eu | platform-au), you notice the following:

  • There’s no description.

  • The Quick view shows the following:

    Copy protected PDF | Coveo
Likely cause and resolution

Cause: The PDF is password-protected.

Document security on document in file system | Coveo

Therefore, the source can’t retrieve the document binary content it needs to generate the description and the Quick view.

Resolution:

  1. If acceptable, remove the password protection on the PDF in the file system.

  2. Rebuild your source.

Web scraping issue

Context and symptoms:

Likely cause and resolution

Cause: A web scraping configuration may be removing the missing sections.

Resolution: Access your source configuration and review your web scraping configurations. Focus on your Pages to target configurations and whether you can set more restrictive Elements to exclude selectors.

Missing dynamic content

Context and symptoms:

  • When accessing the Quick view of an item, sections of the actual web page are missing.

  • Your web page contains dynamically rendered content (for example, responses to JavaScript API calls).

Likely cause and resolution

Cause: The source may be crawling your page before all its dynamic content is rendered.

Resolution: Access your source configuration. In the Advanced settings subtab, make sure Execute JavaScript on pages is enabled. Increase the Add time for the crawler to wait before considering a page as fully rendered value, if need be.

HTML pages indexed as txt items

Context and symptoms:

When accessing the Content Browser (platform-ca | platform-eu | platform-au), pages are appearing under the txt file type, instead of html.

Likely cause and resolution

Cause: The web page, at the moment it’s crawled, isn’t valid HTML. If the page includes dynamic content, it might not be fully rendered when the crawler processes it.

Resolution:

  1. If the page includes dynamic content, make sure it’s fully rendered when the crawler processes it.

    1. Access your source configuration.

    2. In the Advanced settings subtab, make sure Execute JavaScript on pages is enabled.

    3. Set or increase the Time the crawler waits before considering a page as fully rendered value (for example, 300 milliseconds).

    4. Save and rebuild your source.

  2. Fix the HTML of web pages still indexed as txt.

    1. Use an HTML markup validator to identify the most significant issues with the page.

    2. Fix these markup issues.

    3. Rebuild your source.

Login page content instead of proper page content

Context and symptoms:

  • When accessing the Quick view of an item, you notice that a login page content appears instead of the content of the page specified by the URI. This symptom will likely repeat itself over many items.

  • When trying to access the page to index manually in a browser, you’re redirected to that login page.

Likely cause and resolution

Cause: The page to index is protected and form authentication isn’t properly set up.

Resolution:

  1. Request the login page authentication credentials from the web server administrator.

  2. Access your source configuration and set up form authentication with the Login page address and the provided username and password.

  3. Set the Validation method to Redirection to URL and the Value to the Login page address value.

  4. Rebuild your source.

  5. Validate that the item now contains the proper content.

Indexing pipeline extension

Context and symptoms:

Likely cause and resolution

Cause: An indexing pipeline extension (IPE) may be removing the missing sections.

Resolution: Review the logs for the items affected by the extensions. Make necessary adjustments to the extension script or conditions.

Unexpected item field values

Inexistent field

Context and symptoms:

Likely cause and resolution

Cause: The field doesn’t exist. You need to create the field and the field mapping.

Resolution:

  1. On the Fields (platform-ca | platform-eu | platform-au) page, at the upper right, click Add field.

  2. Follow instructions in the Add or edit a field article to configure your field.

  3. On the Sources (platform-ca | platform-eu | platform-au) page, click your source, and then click More > View metadata.

  4. Choose the metadata you want to use to populate the field.

  5. On the Sources (platform-ca | platform-eu | platform-au) page, click your source, and then click More > Manage mappings in the Action bar.

  6. Follow instructions in the Add or edit a mapping rule section to configure your mapping.

Field mapping issue

Context and symptoms:

Likely cause and resolution

Cause: There may be a field mapping issue.

Resolution:

  1. On the Sources (platform-ca | platform-eu | platform-au) page, click your source, and then click More > View metadata.

  2. Make sure the metadata that should be used to populate your field appears. If the metadata is being used to populate a field, it will be shown as Indexed. If you see two entries under the same metadata name, take note of the indexed and not indexed metadata Origin values for the final step in this procedure.

  3. On the Sources (platform-ca | platform-eu | platform-au) page, click your source, and then click More > Manage mappings in the Action bar.

  4. Make sure the mapping rule for the field references the right metadata name.

  5. Add or edit the Origin value in the field mapping rule (for example, %[description:crawler]).

Metadata extraction issue

Context and symptoms:

Likely cause and resolution

Cause: There may be a metadata extraction issue specifically for that item.

Resolution: Search for reasons why the metadata extraction process wouldn’t be working on your specific item. For example, if you’re using a web scraping configuration, access your source configuration and validate the following:

Title field value selection

Symptom: The item title field value isn’t ideal.

Likely cause and resolution

Cause: Coveo has a title field selection process to ensure all indexed items have titles. This process may not return ideal titles in your use case.

Resolution: Coveo automatically extracts several pieces of metadata that you can use as item titles. See Item title selection mapping rule options to control the value selection process. Edit the title field mappings on your source.

Metadata origin selection

Context and symptoms:

  • The indexed item has a value for the given field, but that value isn’t the expected one.

  • On the Sources (platform-ca | platform-eu | platform-au) page, when you click the source and then click More > View metadata, you see two entries under the same metadata name.

Example:

metadata value conflict
Likely cause and resolution

Cause: There’s a metadata origin selection issue.

For example, you’ve configured a web scraping configuration to extract a description metadata. The Sitemap source may also be automatically extracting description metadata from the page <meta> tags.

When values for the same metadata name are extracted in the crawling stage and in the processing (or converter) stage of the Coveo indexing pipeline, the latter value is used by default to populate the mapped field.

Example:

item field value

Resolution:

  • Use a unique metadata name and create a dedicated field for the custom metadata you’re extracting, OR

  • Specify the origin value in the field mapping rule (for example, %[description:crawler]) to populate the field with the custom metadata you’re extracting.

Overwritten crawler metadata

Context and symptoms:

  • The indexed item has a value for the given field.

  • On the Sources (platform-ca | platform-eu | platform-au) page, when you click the source and then click More > View metadata, you see an entry under the metadata name you chose under the Crawler origin.

  • You specified the origin value in the field mapping rule (for example, %[description:crawler]), but the field value isn’t the expected one.

Likely cause and resolution

Cause: There’s a metadata conflict.

You can have two configurations extracting values for the same metadata name at the crawling stage (for example, one in a web scraping configuration, and another in the sitemap XML file). When this happens, one value overwrites the other and you only see one Crawler origin entry for that metadata name on the View Metadata subpage.

Resolution: Change the metadata name in your configuration to make it unique and adjust your field mapping rule accordingly.

Indexing is slow

Throttling

Context and symptoms:

Likely cause and resolution

Cause: The Time the crawler waits between requests to your server value may be too low and the Sitemap crawler doesn’t take into account website robots.txt Crawl-delay directives. The Sitemap crawler may be getting throttled by the web server.

Resolution: Access your source configuration and increase the Time the crawler waits between requests to your server value (for example, 1000 milliseconds).

Also ensure your source

Indexed content is not up to date

Refresh limitations

Context and symptoms:

Likely cause and resolution

Cause: A source refresh doesn’t consider deleted and new sitemap file entries. A rebuild or rescan is required to reflect these changes in your index.

Resolution: Make sure the Sitemap source rescan schedule is enabled. The default daily recurrence should suffice.

Number of items limit reached

Context and symptoms:

Likely cause and resolution

Cause: Indexing is blocked because you’ve reached the 200% license item usage threshold.

Resolution:

  • If possible, delete unused sources to bring the item count below the 200% threshold. Then, see the July 20, 2023 Coveo Platform update for suggestions on how to reduce your item count even more.

  • To reassess your needs and discuss your options, contact your Coveo Customer Success Manager.

Last modification date refresh support conditions

Context and symptoms:

Likely cause and resolution

Cause: Your sitemap file doesn’t meet all Last Modification Date refresh support conditions specified in the source key characteristics table. The schedule is running but changes aren’t being indexed.

Resolution: As a workaround, consider changing the Sitemap source rescan schedule from a daily to an hourly recurrence. Disable the refresh schedule to save resources.

SkipOnSitemapError setting

Context and symptoms:

Likely cause and resolution

Cause: The source may be configured with SkipOnSitemapError set to true and encountering exceptions on one sitemap file during content update activities.

Resolution:

  • To help expose and isolate such issues, consider breaking up the Sitemap source into multiple sources, with SkipOnSitemapError set to false on each of them, OR

  • Make sure all referenced source sitemap files are accessible. Make adjustments to the source Sitemap URLs, if necessary.