Troubleshooting Web source issues
Troubleshooting Web source issues
This article provides troubleshooting best practices and lists common issues when indexing content with the Web source.
Important: Troubleshooting fundamentals
Though the information provided in the Common issues section will often help you identify and resolve a problem, keep the following in mind:
-
A given set of symptoms can be caused by different underlying issues.
-
When you expand a content update activity in the Activity panel or Activity Browser (platform-ca | platform-eu | platform-au) page, the error code and messages displayed may only be general indicators of the problem.
-
Coveo only halts an indexing operation and displays an error when specific conditions are met.
Consequently, finding the root cause of an issue may require more granular information, which only update logs can deliver.
To download an update log
-
On the Sources (platform-ca | platform-eu | platform-au) page, click the desired resource, and then click Activity in the Action bar.
-
In the Activity panel that opens, click the desired activity, and then click Download logs in the Action bar. The downloaded file is named after the unique operation ID representing the selected activity.
To locate issue root causes in logs
-
Open the log file in a text file viewer.
-
Look for
WARN,ERROR, andFATALmessages.
Use a log file viewer that supports highlighting by log level to make these messages more noticeable.
-
If necessary, review
NOTICEandINFOmessages. They sometimes reveal a configuration that you overlooked and that may be causing the issue.
Common issues
Issues are divided into categories. Click a category description below to reach the related section.
Missing items
User agent blocklisting
|
|
Context and symptoms
Likely cause and resolutionCause Your web server or CDN/Caching provider may be blocking requests from the Coveo crawler based on its user agent. This can happen, for example, if you’re using the Resolution
|
IP blocklisting
|
|
Context and symptoms
Likely cause and resolutionCause Your infrastructure may be restricting, filtering, or otherwise preventing inbound requests from Coveo Platform IP addresses. The source automatically detects the following CDN and caching providers, which can interfere with requests from the Coveo crawler: Akamai, Amazon CloudFront, Cloudflare, Fastly, Incapsula (Imperva), and Varnish. Resolution Configure your infrastructure to allow and properly handle inbound requests from the Coveo Platform. If the download logs are showing a CDN/Caching detection message, you may need to allow requests from Coveo Platform IP addresses at the CDN/Caching provider level. Alternatively, consider installing the Coveo Crawling Module on your infrastructure to push items to the Coveo Platform instead. |
Incompatible Chrome driver for the Crawling Module
|
|
Context and symptoms
Likely cause and resolutionCause The Crawling Module package ships with Web source components, including one that uses a Chrome driver to render web pages. Prior to Crawling Module version 2.19, this component didn’t specify the location of the embedded Chrome driver. As a result, your Crawling Module may be trying to download the driver, failing to do so (because your infrastructure doesn’t allow it), and then using an old version cached on your server. Resolution Update your Crawling Module to version 2.19 or later. |
RespectUrlCasing setting issue
|
|
Context and symptoms
Likely cause and resolutionCause With Resolution On the Sources (platform-ca | platform-eu | platform-au) page, click the source, and then click Edit configuration with JSON in the More menu.
Then, set |
Crawling rules issue
|
|
Context and symptoms
The Content Browser (platform-ca | platform-eu | platform-au) doesn’t show all the items you wanted to index. Likely cause and resolutionCause Your current Crawling rules exclusions and inclusions are filtering out the items you wanted to index. Resolution On the Sources (platform-ca | platform-eu | platform-au) page, review your source’s exclusions and inclusions. To be indexed, a page:
|
Starting URL exclusion
|
|
Context and symptoms
Likely cause and resolutionCause Your current Crawling rules exclusions and inclusions are filtering out that starting URL. Consequently, the crawler can’t index the items that are reachable via that starting URL. Resolution On the Sources (platform-ca | platform-eu | platform-au) page, adjust your source’s exclusions and inclusions to ensure the starting URL and all items accessible through it aren’t filtered out. To be indexed, a page:
|
301 Moved Permanently redirect
|
|
Context and symptoms
Likely cause and resolutionCause By default, the Web source only indexes items that are internal to the site.
The Web source is considering the page it’s redirected to (for example, Resolution If you’re only getting started with a new Web source, you might simply want to delete the source, start fresh with a new one, and include the
|
Orphan pages
|
|
Context and symptoms
Likely cause and resolutionCause The missing items may be orphan pages. Resolution
|
Missing or invalid basic authentication configuration
|
|
Context and symptoms
Likely cause and resolutionCause Accessing the page content requires basic authentication. Resolution
|
Missing or invalid form authentication configuration
|
|
Context and symptoms
Likely cause and resolutionCause Accessing the page requires form authentication. Resolution
|
Authentication status validation issue
|
|
Context and symptoms
Likely cause and resolutionCause The authentication Validation method might not be configured properly. Resolution On the Sources (platform-ca | platform-eu | platform-au) page, make sure your source’s Validation method and the associated value are adequate. |
Redirection to login page issue
|
|
Context and symptoms
Likely cause and resolutionCause The Resolution
|
Content freshness issue
|
|
Context and symptoms
Items recently added to the site are still not appearing in the Content Browser (platform-ca | platform-eu | platform-au). Likely cause and resolutionCause and Resolution |
Extra or unwanted items
Query parameters
|
|
Context and symptoms
Example:
Likely cause and resolutionCause You’re currently not specifying that the query string parameter should be ignored. Resolution On the Sources (platform-ca | platform-eu | platform-au) page, open your source. On the Advanced settings tab, add the parameter to the Query parameters to ignore list. Example:
|
Multiple URL variants
|
|
Context and symptoms
Likely cause and resolutionCause The Web source crawler discovers multiple variants of the same page, each with different URL casings. Resolution If the web server is case insensitive, make the Web source crawler ignore URL casing.
On the Sources (platform-ca | platform-eu | platform-au) page, click the source, and then click Edit configuration with JSON in the More menu.
Then, set
|
Missing filtering
|
|
Context and symptoms
The Content Browser (platform-ca | platform-eu | platform-au) shows items you don’t want to index. Likely cause and resolutionCause Your current Crawling rules exclusions and inclusions don’t filter out the unwanted items. Resolution On the Sources (platform-ca | platform-eu | platform-au) page, open your source and configure exclusions and inclusions to filter out the unwanted items. To be indexed, a page:
|
Content freshness issue
|
|
Context and symptoms
Items recently deleted from the website are still appearing in the Content Browser (platform-ca | platform-eu | platform-au). Likely cause and resolutionCause and Resolution |
Unexpected or missing content inside items
Indexing by reference
|
|
Context and symptoms
Likely cause and resolutionCause You may be indexing by reference. When indexing by reference, the body of the web page (used for the Quick view) isn’t retrieved and no excerpt (used for the item description) is generated. Resolution On the Sources (platform-ca | platform-eu | platform-au) page, click the source, and then click Edit configuration with JSON in the More menu.
If HTML documents are currently indexed by
|
Broken images in the Quick view
|
|
Context and symptoms
When accessing the Quick view of an item, images are broken.
Likely cause and resolutionCause The connector retrieves web page HTML as is and doesn’t retrieve the images referenced in the HTML.
The Content Browser Quick view displays this HTML without any alteration.
This means it doesn’t replace relative paths, such as Images that require authentication to be viewed also appear broken when browsing the web page item Quick view in the Content Browser. Resolution None. This is a known limitation of the Content Browser Quick view. The Quick view is intended to provide a preview of the item content, not a full rendering of the web page.
To view the full web page, users can open the original document by clicking the item |
YouTube player not available in the Quickview component
|
|
Context and symptoms
In the Quickview component of a Coveo JavaScript Search Framework search result, the YouTube player isn’t available. You notice the following symptoms:
Likely cause and resolutionCause For security reasons, the only way to view a YouTube video in the YouTube player within a Coveo JavaScript Search Framework result template is by:
Resolution
The following is a sample implementation:
|
Copy protection on PDF
|
|
Context and symptoms
When viewing a PDF item in the Content Browser (platform-ca | platform-eu | platform-au), you notice the following:
Likely cause and resolutionCause The PDF is password-protected.
Therefore, the source can’t retrieve the document binary content it needs to generate the description and the Quick view. Resolution
|
Web scraping issue
|
|
Context and symptoms
Likely cause and resolutionCause A web scraping configuration may be removing the missing sections. Resolution On the Sources (platform-ca | platform-eu | platform-au) page, review your source’s web scraping configurations:
|
Web scraping exclusions not applied
|
|
Context and symptoms
Likely cause and resolutionCause The web server or CDN is returning pages with an unexpected Resolution
|
Missing dynamic content
|
|
Context and symptoms
Likely cause and resolutionCause The source may be crawling your page before all its dynamic content is rendered. Resolution On the Sources (platform-ca | platform-eu | platform-au) page, open your source. In the Advanced settings subtab, make sure Execute JavaScript on pages is enabled. Increase the Time the crawler waits before considering a page as fully rendered value, if necessary. |
HTML pages indexed as txt items
|
|
Context and symptoms
When accessing the Content Browser (platform-ca | platform-eu | platform-au), pages are appearing under the Likely cause and resolutionCause The web page, at the moment it’s crawled, isn’t valid HTML. If the page includes dynamic content, it might not be fully rendered when the crawler processes it. Resolution
|
Login page content instead of proper page content
|
|
Context and symptoms
Likely cause and resolutionCause The page to index is protected and form authentication isn’t properly set up. Resolution
|
Indexing pipeline extension
|
|
Context and symptoms
Likely cause and resolutionCause An indexing pipeline extension (IPE) may be removing the missing sections. Resolution Review the logs for the items affected by the extensions. Make necessary adjustments to the extension script or conditions. |
Unexpected item field values
Inexistent field
|
|
Context and symptoms
Likely cause and resolutionCause The field doesn’t exist. You need to create the field and the field mapping. Resolution
|
Field mapping issue
|
|
Context and symptoms
Likely cause and resolutionCause There may be a field mapping issue. Resolution
|
Metadata extraction issue
|
|
Context and symptoms
Likely cause and resolutionCause There may be a metadata extraction issue specifically for that item. Resolution Search for reasons why the metadata extraction process wouldn’t be working on your specific item. For example, if you’re using a web scraping configuration, go to the Sources (platform-ca | platform-eu | platform-au) page, and then validate the following in your source configuration:
|
Title field value selection
|
|
Context and symptoms
The item Likely cause and resolutionCause Coveo has a Resolution Coveo automatically extracts several pieces of metadata that you can use as item titles.
See Item title selection mapping rule options to control the value selection process.
Edit the |
Metadata origin selection
|
|
Context and symptoms
Example:
Likely cause and resolutionCause There’s a metadata origin selection issue. For example, you’ve configured a web scraping configuration to extract a When values for the same metadata name are extracted in the crawling stage and in the processing (or converter) stage of the Coveo indexing pipeline, the latter value is used by default to populate the mapped field. Example:
Resolution
|
Overwritten crawler metadata
|
|
Context and symptoms
Likely cause and resolutionCause There’s a metadata conflict. You can have two configurations extracting values for the same metadata name at the crawling stage.
When this happens, one value overwrites the other and you only see one Resolution Change the metadata name in your configuration to make it unique and adjust your field mapping rule accordingly. |
Indexing is slow
Source scope
|
|
Context and symptoms
Indexing the source items is taking a long time. Likely cause and resolutionCause The Web source may be crawling and indexing a very high number of items, and maybe even unwanted items. This may be due to a number of reasons (for example, high number of starting URLs, too broad crawling rule exclusions and inclusions). Resolution
|
Crawl delay
|
|
Context and symptoms
Indexing the source items is taking a long time. Likely cause and resolutionCause The Time the crawler waits between requests to your server may be unnecessarily high. Resolution Provide proof of website ownership. Then, on the Sources (platform-ca | platform-eu | platform-au) page, open your source. On the Advanced settings tab, reduce the Time the crawler waits between requests to your server value. |
ExpandBeforeFiltering setting
|
|
Context and symptoms
Indexing the source items is taking a long time. Likely cause and resolutionCause The source may be configured with Resolution Consider editing the source JSON configuration and setting |
Indexed content is not up to date
Source rescan schedule
|
|
Context and symptoms
Recent changes to site items aren’t reflected in the Content Browser (platform-ca | platform-eu | platform-au). Likely cause and resolutionCause
Resolution Make sure the rescan schedule is enabled and that its recurrence settings are adequate. |
Caching
|
|
Context and symptoms
Likely cause and resolutionCause Your infrastructure may be serving a cached and outdated version of the web page. These issues can occur if you use a caching provider such as Varnish, which is known to interfere with requests from the Coveo crawler. Resolution Ensure your infrastructure serves up-to-date web pages to the Coveo crawler. For example:
|
Number of items limit reached
|
|
Context and symptoms
Likely cause and resolutionCause Indexing is blocked because you’ve reached the 200% license item usage threshold. Resolution
|