Troubleshooting Sitemap source issues

This article provides help troubleshooting common issues when indexing content with the Sitemap source.

Identify the issue you’re facing using the case context and symptoms provided. Then, apply the recommended resolution steps and rebuild your source.

Important

Troubleshooting symptoms are provided as a guide. Actual symptoms may vary. For example, Coveo may or may not return an error mentioned among the issue symptoms.

Review the Activity Browser (platform-ca | platform-eu | platform-au) page for a fuller picture of an abnormal indexing activity. You can also download the source update logs for a chronological account of what happened during the indexing process.

Issues are divided into categories. Click a category description below to reach the related section.

Missing pages

Server throttling

Context and symptoms:

Click for likely cause and resolution

Cause: By default, the Request interval delay value is 0 milliseconds and the Sitemap crawler doesn’t take into account website robots.txt crawl-delay directives. The Sitemap crawler may be getting throttled by the web server.

Resolution: Open your source. Increase the Request interval delay value (e.g., 1000 milliseconds).

Blocklisting

Context and symptoms:

Click for likely cause and resolution

Cause: Your network may be blocking inbound requests from Coveo.

Resolution: Allow inbound requests from Coveo. If it’s not possible, install the Coveo On-Premises Crawling Module on your infrastructure to push documents to Coveo instead.

URL exclusion

Context and symptoms:

Click for likely cause and resolution

Cause: The Sitemap source JSON configuration addressPatterns filters array may be filtering out that URL. Consequently, pages listed in that sitemap file aren’t indexed.

Resolution: Access the Edit a source JSON configuration panel and review your addressPatterns. Ensure your sitemap file URL:

  • Doesn’t match any exclusion filter (i.e., "allowed": false" address pattern), AND

  • Matches at least one inclusion filter (i.e., "allowed": true" address pattern)

Missing or invalid basic authentication configuration

Context and symptoms:

  • A page listed in the sitemap file isn’t indexed.

  • The Activity Browser (platform-ca | platform-eu | platform-au) downloadable logs may show the following message: Authentication failed. The provided credentials may be invalid or expired.

  • When trying to access that page in a browser, you’re prompted for credentials in a pop-up window.

Click for likely cause and resolution

Cause: Accessing the page content requires basic authentication.

Resolution: Request authentication credentials from the web server administrator. Then, open your source and configure basic authentication on the source.

Missing or invalid form authentication configuration

Context and symptoms:

  • A page listed in the sitemap file isn’t indexed.

  • The Activity Browser (platform-ca | platform-eu | platform-au) downloadable logs may show the following exception: The form authentication request for "<URL>" was submitted to the form but it failed to authenticate.

  • When trying to access that page in a browser, a login page is displayed instead.

Click for likely cause and resolution

Cause: Accessing the page content requires form authentication.

Resolution: Request authentication credentials from the web server administrator. Then, open your source and configure automatic form authentication on the source.

Authentication confirmation method issue

Context and symptoms:

  • A page listed in the sitemap file isn’t indexed.

  • Accessing the page content requires form authentication.

  • Automatic form authentication is configured on the source.

  • When trying to access that page in a browser, the automatic form authentication Form URL page is displayed. Typing in the credentials and submitting the login page brings up the page to be indexed.

Click for likely cause and resolution

Cause: The authentication confirmation method may not be configured properly.

Resolution: Open your source. Ensure the Confirmation method you have selected and the associated Value are adequate.

Redirection to login page issue

Context and symptoms:

  • A page listed in the sitemap file isn’t indexed.

  • Accessing the page content requires form authentication and your source configuration Confirmation method is Redirection to.

  • When trying to access that page manually in a browser, the automatic form authentication Form URL page isn’t displayed.

Click for likely cause and resolution

Cause: The Redirection to confirmation method doesn’t work in your use case. Consequently, the Sitemap source crawler doesn’t know it must authenticate before accessing the page to index.

Resolution:

  • Open your source and choose another confirmation method. Select a method based on the way the web server responds when you manually try to access the page to index (when unauthenticated).

  • If no reliable confirmation method can be found, try enabling the form authentication Force login option.

Content freshness issue

Symptom: Pages recently added to the website are still not appearing in the Content Browser (platform-ca | platform-eu | platform-au).

Click for likely cause and resolution

Cause and Resolution: See Indexed content isn’t up to date.

Extra or unwanted pages

Missing filtering

Symptom: All URLs listed in the sitemap file are indexed.

Click for likely cause and resolution

Cause: By default, the Sitemap JSON configuration addressPatterns filters array only includes the all-inclusive filter. In other words, you’re not filtering out any URLs listed in the sitemap file.

Resolution:

⚠️ You must have at least one allowed addressPatterns that matches each of your startingAddresses. For example, you can keep the all-inclusive filter. Also make sure you don’t have any exclusion filters that match your startingAddresses.

Content freshness issue

Symptom: Pages recently deleted from the website are still appearing in the Content Browser (platform-ca | platform-eu | platform-au).

Click for likely cause and resolution

Cause and Resolution: See Indexed content isn’t up to date.

Unexpected or missing content inside pages

Indexing by reference

Context and symptoms:

Indexing by reference
Click for likely cause and resolution

Cause: You may be indexing by reference. When indexing by reference, the body of the web page (used for the quick view) isn’t retrieved and no excerpt (used for the item description) is generated.

Resolution:

  1. Access the Edit a source JSON configuration panel.

  2. If HTML documents are currently indexed by Reference, change that value to Retrieve.

    Indexing by retrieve
Web scraping issue

Context and symptoms:

Click for likely cause and resolution

Cause: A web scraping configuration may be removing the missing sections.

Resolution: Open your source and review your web scraping configurations. Focus on the pages the web scraping configurations are targeting and whether you can configure more restrictive exclusion selectors (i.e., the path values) to avoid removing sections from your web page.

Missing dynamic content

Context and symptoms:

  • When accessing the quick view of an item, sections of the actual web page are missing.

  • Your web page contains dynamically rendered content (e.g., responses to JavaScript API calls).

Click for likely cause and resolution

Cause: The source may be crawling your page before all its dynamic content is rendered.

Resolution: Open your source. In the Content to include section, make sure Render JavaScript is enabled. Increase the Loading delay value, if need be.

Login page content instead of proper page content

Context and symptoms:

  • When accessing the quick view of an item, you notice that a login page content appears instead of the content of the page specified by the URI. This symptom will likely repeat itself over many items.

  • When trying to access the page to index manually in a browser, you’re redirected to that login page.

Click for likely cause and resolution

Cause: The page to index is protected and automatic form authentication isn’t properly set up.

Resolution:

  1. Request the login page authentication credentials from the web server administrator.

  2. Open your source and configure automatic form authentication with the login page URL (i.e., the Form URL value) and the provided username and password.

  3. Set the Confirmation method to Redirection to and the Value to the login page URL.

  4. Rebuild your source.

  5. Validate that the item now contains the proper content.

Indexing pipeline extension

Context and symptoms:

Click for likely cause and resolution

Cause: An indexing pipeline extension (IPE) may be removing the missing sections.

Resolution: Review the logs for the items affected by the extensions. Make necessary adjustments to the extension script or conditions.

Unexpected item field values

Inexistent field

Context and symptoms:

Click for likely cause and resolution

Cause: The field doesn’t exist. You need to create the field and the field mapping.

Resolution:

  1. On the Fields (platform-ca | platform-eu | platform-au) page, at the top right, click Add field.

  2. Follow instructions in the Add or edit a field article to configure your field.

  3. On the Sources (platform-ca | platform-eu | platform-au) page, click your source, and then click More > View metadata.

  4. Choose the metadata you want to use to populate the field.

  5. On the Sources (platform-ca | platform-eu | platform-au) page, click your source, and then click More > Manage mappings in the Action bar.

  6. Follow instructions in the Add or edit a mapping rule section to configure your mapping.

Field mapping issue

Context and symptoms:

Click for likely cause and resolution

Cause: There may be a field mapping issue.

Resolution:

  1. On the Sources (platform-ca | platform-eu | platform-au) page, click your source, and then click More > View metadata.

  2. Make sure the metadata that should be used to populate your field appears. If the metadata is being used to populate a field, it will be shown as Mapped. If you see two entries under the same metadata name, take note of the mapped and unmapped Origin values for the final step in this procedure.

  3. On the Sources (platform-ca | platform-eu | platform-au) page, click your source, and then click More > Manage mappings in the Action bar.

  4. Make sure the mapping rule for the field references the right metadata name.

  5. Add or edit the Origin value in the field mapping rule (e.g., %[description:crawler]).

Metadata extraction issue

Context and symptoms:

Click for likely cause and resolution

Cause: There may be a metadata extraction issue specifically for that item.

Resolution: Search for reasons why the metadata extraction process wouldn’t be working on your specific item. For example, if you’re using a web scraping configuration, open your source. Then, validate the following:

Title field value selection

Symptom: The item title field value isn’t ideal.

Click for likely cause and resolution

Cause: Coveo has a title field selection process to ensure all indexed items have titles. This process may not return ideal titles in your use case.

Resolution: Coveo automatically extracts several pieces of metadata that can be used as item titles. See Item title selection mapping rule options to control the value selection process.

Metadata origin selection

Context and symptoms:

  • The indexed item has a value for the given field, but that value isn’t the expected one.

  • On the Sources (platform-ca | platform-eu | platform-au) page, when you click the source and then click More > View metadata, you see two entries under the same metadata name.

Example:

metadata value conflict
Click for likely cause and resolution

Cause: There’s a metadata origin selection issue.

For example, you have configured a web scraping configuration to extract a description metadata. The Sitemap source may also be automatically extracting description metadata from the page <meta> tags.

When values for the same metadata name are extracted in the crawling stage and in the processing (or converter) stage of the Coveo indexing pipeline, the latter value is used by default to populate the mapped field.

Example:

item field value

Resolution:

  • Use a unique metadata name and create a dedicated field for the custom metadata you’re extracting, OR

  • Specify the origin value in the field mapping rule (e.g., %[description:crawler]) to populate the field with the custom metadata you’re extracting.

Overwritten crawler metadata

Context and symptoms:

  • The indexed item has a value for the given field.

  • On the Sources (platform-ca | platform-eu | platform-au) page, when you click the source and then click More > View metadata, you see an entry under the metadata name you chose under the Crawler origin.

  • You specified the origin value in the field mapping rule (e.g., %[description:crawler]), but the field value isn’t the expected one.

Click for likely cause and resolution

Cause: There’s a metadata conflict.

You can have two configurations extracting values for the same metadata name at the crawling stage (e.g., one in a web scraping configuration, and another in the sitemap XML file). When this happens, one value overwrites the other and you only see one Crawler origin entry for that metadata name in the View Metadata subpage.

Resolution: Change the metadata name in your configuration to make it unique and adjust your field mapping rule accordingly.

Indexing is slow

Throttling

Context and symptoms:

Click for likely cause and resolution

Cause: By default, the Request interval delay value is 0 milliseconds and the Sitemap crawler doesn’t take into account website robots.txt Crawl-delay directives. The Sitemap crawler may be getting throttled by the web server.

Resolution: Open your source and increase the Request interval delay value (e.g., 1000 milliseconds).

Indexed content isn’t up to date

Refresh limitations

Context and symptoms:

Click for likely cause and resolution

Cause: A source refresh doesn’t consider deleted and new sitemap file entries. A rebuild or rescan is required to reflect these changes in your index.

Resolution: Make sure the Sitemap source rescan schedule is enabled. The default daily recurrence should suffice.

Last modification date refresh support conditions

Context and symptoms:

Click for likely cause and resolution

Cause: Your sitemap file doesn’t meet all Last Modification Date refresh support conditions specified in the source key characteristics table. The schedule is running but changes aren’t being indexed.

Resolution: As a workaround, consider changing the Sitemap source rescan schedule from a daily to an hourly recurrence. Disable the refresh schedule to save resources.

SkipOnSitemapError setting

Context and symptoms:

Click for likely cause and resolution

Cause: The source may be configured with SkipOnSitemapError set to true and encountering exceptions on one sitemap file during content update activities.

Resolution:

  • To help expose and isolate such issues, consider breaking up the Sitemap source into multiple sources, with SkipOnSitemapError set to false on each of them, OR

  • Make sure all referenced source sitemap files are accessible. Make adjustments to the source URLs, if necessary.