Add or edit a Sitemap source
Add or edit a Sitemap source
|
Coveo is discontinuing use of the PhantomJS web driver in its Web and Sitemap sources in January 2023. Learn more on what you need to do. |
Members with the required privileges can use a Sitemap source to make the content of listed web pages from a sitemap file or a Sitemaps index file searchable.
A sitemap file can be added to a website and is required when using a Sitemap source. The file contains a list of the website’s URLs along with their respective metadata which include the LMD (last-modified-date). This enables the Sitemap source to perform a refresh rather than a rescan, as is the case with a Web source. For this reason, although a Sitemap source requires the extra step of adding a sitemap file, it offers an increased performance compared to a Web source.
Source key characteristics
Features | Supported | Additional information | |
---|---|---|---|
Searchable content types |
Web pages (URL) |
||
Sitemap file format |
|
Sitemap files and sitemap index files must respect the Sitemap protocol. Strict validations can be enforced by enabling the ParseSitemapInStrictMode option. |
|
Configure inclusion and exclusion filters to index only specific pages. |
|||
Index metadata from third-party sitemap extensions or Coveo-specific metadata included in an XML sitemap file. |
|||
By default, the Coveo converter extracts metadata from HTML |
|||
Exclude irrelevant sections in pages and extract metadata. |
|||
The Sitemap source crawler can run JavaScript in a web page to dynamically render content before indexing the page. |
|||
The sitemap file must define the optional |
|||
Basic authentication |
Supported HTTP authentication schemes:
|
||
Form authentication |
|||
Limitations
-
Multi-factor authentication (MFA) and CAPTCHA aren’t supported.
-
Indexing page permissions isn’t supported.
-
Content in pop-up windows and page elements requiring interaction aren’t indexed.
-
The Coveo indexing pipeline can handle web pages up to 512 MB only. Larger pages are indexed by reference (i.e., their content is ignored by the Coveo crawler, and only their metadata and path are searchable). Therefore, no search result Quick View is available for these larger items.
-
When the Render JavaScript option is enabled:
-
The Sitemap source doesn’t support sending
AdditionalHeaders
. -
Basic authentication isn’t supported.
-
Leading practices
-
Ensure that you have the right to crawl the public content in the event where you aren’t the owner of the website. Crawling websites that you don’t own nor have the right to crawl could create reachability issues.
Furthermore, certain websites may use security mechanisms that can impact Coveo’s ability to retrieve the content. If you’re unfamiliar with these mechanisms, we recommend investigating and learning about them beforehand. For example, one impact this type of software (e.g., Akamai, Cloudflare) can have is detecting our crawler as an attack and blocking us from any further crawling.
-
Always review the Activity Browser (platform-ca | platform-eu | platform-au) page for the full context around an abnormal indexing activity. See the Troubleshooting article for help resolving indexing issues.
-
The number of items that a source processes per hour (crawling speed) depends on various factors, such as network bandwidth and source configuration. See About Crawling Speed for information on what can impact crawling speed, as well as possible solutions.
-
Break down large sitemap files into multiple sitemap files.
Add or edit a Sitemap source
To add a source
-
On the Sources (platform-ca | platform-eu | platform-au) page, click Add source.
-
In the Add a source of content panel, click the Cloud (
) or Crawling Module (
) tile, depending on whether you need to use the Coveo On-Premises Crawling Module to retrieve your content. See Content Retrieval Methods for details.
To edit a source, on the Sources (platform-ca | platform-eu | platform-au) page, click the desired source, and then click Edit in the Action bar.
The completion steps are especially important when creating or editing a source of this type.
"Configuration" tab
In the Add/Edit a Sitemap Source panel, the Configuration tab is selected by default. It contains your source’s general and authentication information, as well as other parameters.
General information
Source name
Enter a name for your source.
|
Leading practice
A source name can’t be modified once it’s saved, therefore be sure to use a short and descriptive name, using letters, numbers, hyphens ( |
URLs
Enter the URL(s) to a sitemap index file or sitemap files in either the http://
or https://
form.
Enter the direct sitemap URL, and not the sitemap website address.
Otherwise, the source can interpret the URL(s) as HTML format sitemap file(s) and crawl the links they contain.
For example, enter the following URL: http://myorgwebsite.com/sitemap.xml
instead of http://myorgwebsite.com/
.
Keep in mind that when adding multiple starting addresses, any indexing operation that would fail on one of the first starting addresses will abort the entire indexing operation.
If you encounter such an issue you can enable SkipOnSitemapError
or split the affected sitemaps into their own sources for troubleshooting.
-
Public website sitemap:
http://myorgwebsite.com/sitemap.xml
-
Public website sitemap compressed with GZIP:
http://myorgwebsite.com/sitemap.xml.gz
-
Web page containing links such as a sitemap:
http://myorgwebsite.com/sitemap
|
Notes
|
User agent
Enter the user agent string you want the Sitemap source to send with HTTP requests to identify itself when downloading pages.
The default value used is: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) (compatible; Coveobot/2.0;+https://www.coveo.com/bot.html)
.
Paired Crawling Module
If your source is a Crawling Module source, and if you have more than one Crawling Module linked to this organization, select the one with which you want to pair your source. If you change the Crawling Module instance paired with your source, a successful rebuild is required for your change to apply.
Optical character recognition (OCR)
If you want Coveo to extract text from image files or PDF files containing images, check the appropriate box. OCR-extracted text is processed as item data, meaning that it’s fully searchable and will appear in the item Quick View. See Enable optical character recognition for details on this feature.
|
Note
Contact Coveo Sales to add this feature to your organization license. |
"Authentication" section
If necessary, expand the Authentication section to configure the source credentials allowing your Coveo organization to gain access to the secured content you want to index. See Source credentials leading practices for additional information.
|
Multi-factor authentication (MFA) and CAPTCHA aren’t supported. |
The Sitemap source supports the following authentication types. Click the desired authentication method for details on the parameters to configure.
-
Select this option when the desired website uses the normal NTLM identity. See Understanding HTTP Authentication for details on how this option works.
-
Note
Manual form authentication is now only available for modifications to legacy sources. To create new sources, use Automatic form authentication instead.
Select this option when the desired website presents users with a form to fill to log in. You must specify the form input names and values.
-
Select this option when the desired website presents users with a form to fill to log in. This is a simplified version of the Manual form authentication. A web automation framework supporting JavaScript rendering executes the authentication process by autodetecting the form inputs.
Basic authentication
When selecting Basic authentication, enter the credentials of an account on the website you’re making searchable. See Source credentials leading practices.
|
Notes
|
Manual form authentication
|
Note
Manual form authentication is now only available for modifications to legacy sources. To create new sources, use Automatic form authentication instead. |
When selecting Manual form authentication:
-
In the Form URL box, enter the website login page URL.
-
(Optional) When there’s more than one form on the login page, enter the Form name.
-
Click the Action method dropdown menu, and then select the HTTP verb used to submit the authentication request, which is either POST or GET.
-
Click the Content type dropdown menu, and then select the content-type headers for HTTP POST requests used by the form authentication process.
-
(Optional) When the authentication request is sent to another URL than the specified form URL, enter the Action URL. Otherwise, leave empty.
-
In the Username input name and the Password input name inputs, inspect the form HTML code for both parameters, locate the corresponding
<input name='abc' type='text' />
element, and then enter thename
attribute value.ExampleBased on the HTML code below:
<input name="login" type="email" />
<input name="pwd" type="password" />
login
is the username input name andpwd
is the password input name. -
In the Username input value and the Password input value inputs, enter the username and password parameter values respectively.
-
When your form has other parameters than username and password, we recommend that you select the Autodetect hidden inputs check box. Otherwise, you must enter the Input name and Input value for each parameter.
Under Other inputs, input values are displayed in clear text. You must therefore ensure to enter your sensitive information, i.e., username and password, above the Other inputs, in the Username input name, Password input name, Username input value, and Password input value boxes (see steps 6 and 7).
Example -
Under Confirmation method, select the method that will determine if the authentication request failed.
Depending on the selected confirmation method, enter the appropriate value:
-
When selecting Redirection to, enter the URL where the Coveo crawler is redirected when the login fails.
Examplehttps://mycompany.com/login/failed.html
-
When selecting Missing cookie, in the Value input, enter the name of the cookie that’s set when an authentication is successful.
ExampleASP.NET_SessionId
-
If you select Missing URL, in the Value input, enter a regex. Coveo will consider that an authentication request has failed if its crawler is not redirected to an URL matching the specified pattern.
-
When selecting Missing text, in the Value input, enter a string to show to authenticated users.
Examples-
Hello, jsmith@mycompany.com!
-
Log out
-
-
When selecting Text presence, in the Value input, enter a string to show when a login fails.
Examples-
An error has occurred.
-
Your username or password is invalid.
-
-
When selecting URL presence, in the Value input, enter a regex. Coveo will consider that an authentication request has failed if its crawler is redirected to an URL matching the specified pattern.
-
Automatic form authentication
When selecting Automatic form authentication:
-
In the Form URL box, enter the website login page URL.
-
In the Username and Password boxes, enter the credentials to log in. See Source credentials leading practices.
-
Under Confirmation method, select the method that will determine if the authentication request failed.
Depending on the selected confirmation method, enter the appropriate value:
-
When selecting Redirection to, enter the URL where the Coveo crawler is redirected when the login fails.
Examplehttps://mycompany.com/login/failed.html
-
When selecting Missing cookie, in the Value input, enter the name of the cookie that’s set when an authentication is successful.
ExampleASP.NET_SessionId
-
If you select Missing URL, in the Value input, enter a regex. Coveo will consider that an authentication request has failed if its crawler is not redirected to an URL matching the specified pattern.
-
When selecting Missing text, in the Value input, enter a string to show to authenticated users.
Examples-
Hello, jsmith@mycompany.com!
-
Log out
-
-
When selecting Text presence, in the Value input, enter a string to show when a login fails.
Examples-
An error has occurred.
-
Your username or password is invalid.
-
-
When selecting URL presence, in the Value input, enter a regex. Coveo will consider that an authentication request has failed if its crawler is redirected to an URL matching the specified pattern.
-
-
If you want Coveo’s first request to be for authentication, regardless of whether authentication is actually required, check the Force login box.
-
If the website relies heavily on asynchronous requests, under Loading delay, enter the maximum delay in milliseconds to wait for the JavaScript to execute on each web page.
-
(Optional) If your automatic form authentication configuration doesn’t work, or if the web page requires specific actions during the login process, you might have to enter a custom login sequence configuration. Contact the Coveo Support team for help.
"Content to Include" section
When the website pages indexed by your Sitemap source are rendered dynamically with JavaScript, expand the Content to Include section, and then select the Render JavaScript check box. By default, the Sitemap source doesn’t execute the JavaScript code in crawled website pages.
|
Selecting the Render JavaScript check box may significantly increase the time needed to crawl pages. |
|
Notes
|
"Crawling Settings" section
When a target website sometimes responds slower:
-
In the Request timeout box, use the + and - buttons to select the web request timeout value in seconds. The default is
100
seconds. When the value is0
, there’s no timeout. By increasing the timeout value, you increase the delay tolerance and avoid timeout errors. -
Under Request interval delay, enter the number of milliseconds there should be between each request sent to retrieve your Sitemap content. The default value is
0
, which means there’s no speed limitation. To decrease the crawling speed, enter a higher number. The maximum possible delay is5000
milliseconds.
"Web Scraping" section
When you want to exclude page sections (such as headers and footers) or extract information from the pages to create metadata, expand the Web Scraping section to use this powerful feature.
You have a Sitemap source for which you want to exclude HTML item header and footer sections, so you enter the following in the Web scraping configuration input.
{
"sensitive": false,
"value": "[\n {\n \"for\": {\n \"urls\": [\".*\"]\n },\n \"exclude\": [\n { \"path\": \"#ohHeader\" },\n { \"path\": \"#MainSection > div.col-md-3\" },\n { \"path\": \"#answerLink\" }\n ],\n \"metadata\": {\n \"topicTitle\": { \"path\": \"div.topic h1::text\" },\n \"topicLastUpdate\": { \"path\": \"#LastUpdate::text\" }\n }\n }\n]"
}
"Content security" tab
Select who will be able to access the source items through a Coveo-powered search interface. For details on this parameter, see Content security.
"Access" tab
In the Access tab, set whether each group (and API key, if applicable) in your Coveo organization can view or edit the current source.
For example, when creating a new source, you could decide that members of Group A can edit its configuration while Group B can only view it.
See Custom access level for more information.
Completion
-
Finish adding or editing your source:
-
When you want to save your source configuration changes without starting a build/rebuild, such as when you know you want to do other changes soon, click Add source/Save.
NoteOn the Sources (platform-ca | platform-eu | platform-au) page, you must click Launch build or Start required rebuild in the source Status column to add the source content or to make your changes effective, respectively.
-
When you’re done editing the source and want to make changes effective, click Add and build source/Save and rebuild source.
Back on the Sources (platform-ca | platform-eu | platform-au) page, you can follow the progress of your source addition or modification.
Once the source is built or rebuilt, you can review its content in the Content Browser.
-
-
Optionally, consider editing or adding mappings once your source is done building or rebuilding.
Refine the content to index
You may want to avoid indexing certain pages, or to index only a few of them. To do so:
-
If not already done, create and save your source with a broad URL.
-
In your source JSON configuration, enter an address filter to refine the targeted content.
Your URL must match one of your inclusion
addressPatterns
and not match any of your exclusionaddressPatterns
. Otherwise, Coveo will return aNo Items Indexed
error. -
Build or rebuild your source.
Required privileges
You can assign privileges to allow access to specific tools in the Coveo Administration Console. The following table indicates the privileges required to view or edit elements of the Sources (platform-ca | platform-eu | platform-au) page and associated panels. See Manage privileges and Privilege reference for more information.
|
Note
The Edit all privilege isn’t required to create sources. When granting privileges for the Sources domain, you can grant a group or API key the View all or Custom access level, instead of Edit all, and then select the Can Create check box to allow users to create sources. See Can Create ability dependence for more information. |
Actions | Service | Domain | Required access level |
---|---|---|---|
View sources, view source update schedules, and subscribe to source notifications |
Content |
Fields |
View |
Sources |
|||
Organization |
Organization |
||
Edit sources, edit source update schedules, and view the View Metadata page |
Content |
Fields |
Edit |
Sources |
|||
Content |
Source metadata |
View |
|
Organization |
Organization |
What’s next?
-
If you experienced issues while building the source, consider using Sitemap connector hidden parameters.
-
If you’re using the Crawling Module to retrieve your content, consider subscribing to deactivation notifications to receive an alert when a Crawling Module component becomes obsolete and stops the content crawling process.