Troubleshooting Web source issues
Troubleshooting Web source issues
This article provides help troubleshooting common issues when indexing content with the Web source.
Identify the issue you’re facing using the case context and symptoms provided. Then, apply the recommended resolution steps and rebuild your source.
|
Troubleshooting symptoms are provided as a guide. Actual symptoms may vary. For example, Coveo may or may not return an error mentioned among the issue symptoms. Review the Activity Browser (platform-ca | platform-eu | platform-au) page for a fuller picture of an abnormal indexing activity. You can also download the source update logs for a chronological account of what happened during the indexing process. |
Issues are divided into categories. Click a category description below to reach the related section.
Missing pages
|
Blocklisting
Context and symptoms:
Click for likely cause and resolutionCause: Your network may be blocking inbound requests from Coveo. Resolution: Allow inbound requests from Coveo or consider installing the Coveo On-Premises Crawling Module on your infrastructure to push documents to Coveo instead. |
|
RespectUrlCasing setting issue
Context and symptoms:
Click for likely cause and resolutionCause: With Resolution: Access the Edit a source JSON configuration panel and set |
|
Crawling rules issue
Symptom: The Content Browser (platform-ca | platform-eu | platform-au) doesn’t show all the web pages you wanted to index. Click for likely cause and resolutionCause: Your current Crawling rules exclusions and inclusions are filtering out the pages you wanted to index. Resolution: Open your source and review your exclusions and inclusions. To be indexed, a page:
|
|
Starting URL exclusion
Context and symptoms:
Click for likely cause and resolutionCause: Your current Crawling rules exclusions and inclusions are filtering out that starting URL. Consequently, the crawler can’t index the pages that are reachable via that starting URL. Resolution: Open your source. Make adjustments to your exclusions and inclusions to ensure the starting URL and all pages accessible through it aren’t filtered out. To be indexed, a page:
|
|
301 Moved Permanently redirect
Context and symptoms:
Click for likely cause and resolutionCause: By default, the Web source only indexes pages that are internal to the website.
The Web source is considering the page it’s redirected to (e.g., Resolution: If you’re only getting started with a new Web source, you might simply want to delete the source, start fresh with a new one, and include the
|
|
Orphan pages
Context and symptoms:
Click for likely cause and resolutionCause: The missing pages may be orphan pages. Resolution:
|
|
Missing or invalid basic authentication configuration
Context and symptoms:
Click for likely cause and resolutionCause: Accessing the page content requires basic authentication. Resolution:
|
|
Missing or invalid form authentication configuration
Context and symptoms:
Click for likely cause and resolutionCause: Accessing the page requires form authentication. Resolution:
|
|
Authentication status validation issue
Context and symptoms:
Click for likely cause and resolutionCause: The authentication Validation method might not be configured properly. Resolution: Open your source. Make sure the Validation method you have selected and the associated value are adequate. |
|
Redirection to login page issue
Context and symptoms:
Click for likely cause and resolutionCause: The Resolution:
|
|
Content freshness issue
Symptom: Pages recently added to the website are still not appearing in the Content Browser (platform-ca | platform-eu | platform-au). Click for likely cause and resolutionCause and Resolution: See Indexed content isn’t up to date. |
Extra or unwanted pages
|
Query parameters
Context and symptoms:
Example: ![]() Click for likely cause and resolutionCause: You’re currently not specifying that the query string parameter should be ignored. Resolution: Open your source. In the Advanced settings tab, add the parameter to the Query parameters to ignore list. Example: ![]() |
|
Multiple URL variants
Context and symptoms:
Click for likely cause and resolutionCause: The Web source crawler discovers multiple variants of the same page, each with different URL casings. Resolution: If the web server is case insensitive, access the Edit a source JSON configuration panel and set ⚠️ Don’t set |
|
Missing filtering
Symptom: The Content Browser (platform-ca | platform-eu | platform-au) shows web pages you don’t want to index. Click for likely cause and resolutionCause: Your current Crawling rules exclusions and inclusions don’t filter out the unwanted pages. Resolution: Open your source and configure exclusions and inclusions to filter out the unwanted pages. To be indexed, a page:
|
|
Content freshness issue
Symptom: Pages recently deleted from the website are still appearing in the Content Browser (platform-ca | platform-eu | platform-au). Click for likely cause and resolutionCause and Resolution: See Indexed content isn’t up to date. |
Unexpected or missing content inside pages
|
Indexing by reference
Context and symptoms:
![]() Click for likely cause and resolutionCause: You may be indexing by reference. When indexing by reference, the body of the web page (used for the Quick view) isn’t retrieved and no excerpt (used for the item description) is generated. Resolution: Access the Edit a source JSON configuration panel.
If HTML documents are currently indexed by ![]() |
|
Web scraping issue
Context and symptoms:
Click for likely cause and resolutionCause: A web scraping configuration may be removing the missing sections. Resolution: Open your source and review your web scraping configurations:
|
|
Missing dynamic content
Context and symptoms:
Click for likely cause and resolutionCause: The source may be crawling your page before all its dynamic content is rendered. Resolution: Open your source. In the Advanced settings tab, make sure Execute JavaScript on pages is enabled. Increase the Time the crawler waits before considering a page as rendered value, if necessary. |
|
Login page content instead of proper page content
Context and symptoms:
Click for likely cause and resolutionCause: The page to index is protected and form authentication isn’t properly set up. Resolution:
|
|
Indexing pipeline extension
Context and symptoms:
Click for likely cause and resolutionCause: An indexing pipeline extension (IPE) may be removing the missing sections. Resolution: Review the logs for the items affected by the extensions. Make necessary adjustments to the extension script or conditions. |
Unexpected item field values
|
Inexistent field
Context and symptoms:
Click for likely cause and resolutionCause: The field doesn’t exist. You need to create the field and the field mapping. Resolution:
|
|
Field mapping issue
Context and symptoms:
Click for likely cause and resolutionCause: There may be a field mapping issue. Resolution:
|
|
Metadata extraction issue
Context and symptoms:
Click for likely cause and resolutionCause: There may be a metadata extraction issue specifically for that item. Resolution: Search for reasons why the metadata extraction process wouldn’t be working on your specific item. For example, if you’re using a web scraping configuration, open your source, and then validate the following:
|
|
Title field value selection
Symptom: The item Click for likely cause and resolutionCause: Coveo has a Resolution: Coveo automatically extracts several pieces of metadata that you can use as item titles.
See Item title selection mapping rule options to control the value selection process.
Edit the |
|
Metadata origin selection
Context and symptoms:
Example: ![]() Click for likely cause and resolutionCause: There’s a metadata origin selection issue. For example, you’ve configured a web scraping configuration to extract a When values for the same metadata name are extracted in the crawling stage and in the processing (or converter) stage of the Coveo indexing pipeline, the latter value is used by default to populate the mapped field. Example: ![]() Resolution:
|
|
Overwritten crawler metadata
Context and symptoms:
Click for likely cause and resolutionCause: There’s a metadata conflict. You can have two configurations extracting values for the same metadata name at the crawling stage.
When this happens, one value overwrites the other and you only see one Resolution: Change the metadata name in your configuration to make it unique and adjust your field mapping rule accordingly. |
Indexing is slow
|
Source scope
Symptom: Indexing the source pages is taking a long time. Click for likely cause and resolutionCause: The Web source may be crawling and indexing a very high number of pages, and maybe even unwanted pages. This may be due to a number of reasons (e.g., high number of starting URLs, too broad crawling rule exclusions and inclusions). Resolution:
|
|
Crawl delay
Symptom: Indexing the source pages is taking a long time. Click for likely cause and resolutionCause: The Time the crawler waits between requests to your server may be unnecessarily high. Resolution: Provide proof of website ownership. Then, open your source and, in the Advanced settings tab, reduce the Time the crawler waits between requests to your server value. |
|
ExpandBeforeFiltering setting
Symptom: Indexing the source pages is taking a long time. Click for likely cause and resolutionCause: The source may be configured with Resolution: Consider editing the source JSON configuration and setting |
Indexed content isn’t up to date
|
Source rescan schedule
Symptom: Recent changes to website pages aren’t reflected in the Content Browser (platform-ca | platform-eu | platform-au). Click for likely cause and resolutionCause:
Resolution: Make sure the rescan schedule is enabled and that its recurrence settings are adequate. |
|
Number of items limit reached
Context and symptoms:
Click for likely cause and resolutionCause: Indexing is blocked because you have reached the 200% license item usage threshold. Resolution:
|