-
Connector Directory
- Amazon S3 Source
- Box Business Source
- Catalog Source
- Confluence Cloud Source
- Confluence Self-Hosted Source
- Database Source
- Dropbox Business Source
- Exchange Enterprise Source
- File System Source
- Generic REST API Source
- Gmail for Work Source
- Google Drive for Work Source
- Jira Software Cloud Source
- Jira Software Self-Hosted Source
- Jive Cloud Source
- Jive Server Source
- Khoros Community Source
- Microsoft Dynamics 365 Source
- OneDrive for Business Source
- OTCS Source
- Push Source
- RSS Source
- Salesforce Source
- ServiceNow Source
- SharePoint Online Source
- SharePoint Online Legacy Source
- SharePoint Server Source
- Sitecore Source
- Sitemap Source
- Twitter Source
- Web Source
- YouTube Source
- Zendesk Source
- Connector Building Best Practices
Add or Edit a Sitemap Source
Members of the Administrators and Content Managers built-in groups can use a Sitemap source to make the content of listed web pages from a sitemap file or a Sitemaps index file searchable.
A sitemap file is a component that can be added to a website and is required when using a Sitemap source. The file contains a list of the website’s URLs along with their respective metadata which include the LMD (last-modified-date). This enables the Sitemap source to perform a refresh rather than a rescan, as is the case with a Web source. For this reason, although a Sitemap source requires the extra step of adding a sitemap file, it offers an increased performance compared to a Web source.
For secured websites (non-public accessible Sitemap), the source supports several authentication modes.
Source Key Characteristics
Features | Supported | Additional information | |
---|---|---|---|
Sitemap version | XML, Text, RSS 2.0, Atom 1.0, and HTML |
Sitemap files and sitemap index files must respect the Sitemap protocol (you can, however, disable validations with a parameter). Supports sitemap files containing custom metadata (see Index XML Sitemap Metadata). |
|
Searchable content type |
Web pages (URL) |
||
Content update operations | Refresh |
|
|
Rescan | Takes place every day by default. | ||
Rebuild | |||
Content security options | Determined by source permissions | ||
Source creator | |||
Everyone |
The content
attribute of meta
tags is indexed when the tag is keyed with one of the following attributes: name
, property
, itemprop
, or http-equiv
.
For example, in the tag <meta property="og:title" content="The Article Title"/>
, The Article Title is indexed.
Requirements
Supported Sitemap File Formats
The source can include web pages from the following sitemap file formats:
-
XML (sitemap and index)
-
Text
-
Syndication Feeds (Atom 1.0 and RSS 2.0)
-
HTML
Supported Authentication Schemes
The source can authenticate with the following authentication schemes:
-
Basic
-
Digest
-
NTLM
-
Negotiate/Kerberos
-
Form based
You can enter the authentication parameters in the “Authentication” Section.
Add or Edit a Sitemap Source
When adding a source, select the Sitemap option.
To edit a source, on the Sources page, click the desired source, and then, in the Action bar, click Edit.
The completion steps are especially important when creating or editing a source of this type.
“Configuration” Tab
In the Add/Edit a Sitemap Source panel, the Configuration tab is selected by default. It contains your source’s general and authentication information, as well as other parameters.
General Information
Source Name
Enter a name for your source.
Use a short and descriptive name, using letters, numbers, hyphens (-
), and underscores (_
). Avoid spaces and other special characters.
URLs
Enter the URL(s) to a sitemap index file or sitemap files in either the http://
or https://
form.
When you want to retrieve the content of listed web pages from a XML sitemap, enter the direct sitemap URL instead of the sitemap website address. Otherwise, the source can interpret the web page as a sitemap file in HTML and crawl the discovered links. For example, you enter the following URL: http://myorgwebsite.com/sitemap.xml
instead of http://myorgwebsite.com/
.
-
Public website sitemap:
http://myorgwebsite.com/sitemap.xml
-
Public website sitemap compressed with GZIP:
http://myorgwebsite.com/sitemap.xml.gz
-
Web page containing links such as a sitemap:
http://myorgwebsite.com/sitemap
Avoid including more than one sitemap in a given source. Instead, create one source for each sitemap.
-
By default, sitemap files and sitemap index files that don’t respect the following validations based on the sitemap protocol are ignored while the content is included:
-
An uncompressed sitemap file must be no larger than 10 MB (even if the file is compressed with GZIP).
-
A sitemap file can’t contain more than 50,000 URLs.
-
All referenced URLs must be less than 2,048 characters.
-
All referenced URLs must be relative to the sitemap that references them and in the same domain. The location of a sitemap file determines the set of URLs that can be included in that sitemap. For example, a sitemap file located at
http://myorgwebsite.com/tech/sitemap.xml
can include any URLs starting withhttp://myorgwebsite.com/tech/
but can’t include URLs starting withhttp://myorgwebsite/catalog/
.
-
-
When you don’t want your sitemap files and sitemaps index file to be validated, add the
ParseSitemapInStrictMode
hidden parameter and set it tofalse
in theparameters
section of the source JSON configuration. In this case, the above validations aren’t performed. Consequently, all web pages are included if their reference URL is valid and absolute. -
The Sitemap source can retrieve all links contained in a web page. The Sitemap source crawler doesn’t expand all discovered links, but only includes the web page as a sitemap file in HTML.
-
You can also select to include only a specific part of a web page by adding the
HtmlXPathSelectorExpression
hidden parameter in theparameters
section of the source JSON configuration.
User Agent
Enter the user agent string you want the Sitemap source to send with HTTP requests to identify itself when downloading pages.
The default value used is: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) (compatible; Coveobot/2.0;+https://www.coveo.com/bot.html)
.
Paired Crawling Module
If your source is a Crawling Module source and if you have more than one Crawling Module linked to this organization, select the one with which you want to pair your source. If you change the Crawling Module instance paired with your source, a successful rebuild is required for your change to apply.
Character Optical Recognition (OCR)
Check this box if you want Coveo Cloud to extract text from image files or PDF files containing images. OCR-extracted text is processed as item data, meaning that it’s fully searchable and will appear in the item Quick View. See Enable Optical Character Recognition for details on this feature.
Index
When adding a source, if you have more than one logical (non-Elasticsearch) index in your organization, select the index in which the retrieved content will be stored (see Leverage Many Coveo Indexes). If your organization only has one index, this drop-down menu isn’t visible and you have no decision to make.
-
To add a source storing content in an index different than
default
, you need the View access level on the Logical Index domain (see Manage Privileges and Logical Indexes Domain). -
Once the source is added, you can’t switch to a different index.
“Authentication” Section
When the Sitemap you want to make searchable uses one of the supported authentication types to secure access to its content, expand the Authentication section to configure the source credentials allowing your Coveo organization to gain access to the secured content. You can refer to the Source Credentials Leading Practices for additional information.
The Sitemap source supports the following authentication types. Click the desired authentication method for details on the parameters to configure.
-
Select this option when the desired website uses the normal NTLM identity. See Basic access authentication for details on how this option works.
-
Select this option when the desired website presents users with a form to fill to log in. You must specify the form input names and values.
-
Select this option when the desired website presents users with a form to fill to log in. This is a simplified version of the Manual form authentication. A web automation framework supporting JavaScript rendering executes the authentication process by autodetecting the form inputs.
Basic Authentication
When selecting Basic authentication, enter the credentials of an account on the website you’re making searchable. See Source Credentials Leading Practices.
-
For Basic authentication, to prevent exposing your credentials, provide username and password information only when the website uses a communication protocol secured with TLS or SSL (HTTPS). However, if you do enter basic authentication credentials, they will be provided regardless of whether the link requiring these credentials uses HTTP or HTTPS. It’s your responsibility to ensure that your Sitemap links requiring basic authentication credentials use HTTPS for increased security.
-
If your Sitemap contains a link to a page of a different domain or subdomain that also requires basic authentication, the Coveo Cloud Sitemap connector will also provide the credentials you entered.
Manual Form Authentication
When selecting Manual form authentication:
-
In the Form URL box, enter the website login page URL.
-
(Optional) When there’s more than one form on the login page, enter the Form name.
-
Click the Action method drop-down menu, and then select the HTTP verb used to submit the authentication request, which is whether POST or GET.
-
Click the Content type drop-down menu, and then select the content-type headers for HTTP POST requests used by the form authentication process.
-
(Optional) When the authentication request is sent to another URL than the specified form URL, enter the Action URL. Otherwise, leave empty.
-
In the Username input name and the Password input name inputs, inspect the form HTML code for both parameters, locate the corresponding
<input name='abc' type='text' />
element, and then enter thename
attribute value.Based on the HTML code below:
<input name="login" type="email" />
<input name="pwd" type="password" />
login
is the username input name andpwd
is the password input name. -
In the Username input value and the Password input value inputs, enter the username and password parameter values respectively.
-
When your form has other parameters than username and password, we recommend that you select the Autodetect hidden inputs check box. Otherwise, you must enter the Input name and Input value for each parameter.
Under Other inputs, input values are displayed in clear text. You must therefore ensure to enter your sensitive information, i.e., username and password, above the Other inputs, in the Username input name, Password input name, Username input value, and Password input value boxes (see steps 6 and 7).
-
Under Confirmation method, select the method that will determine if the authentication request failed.
Depending on the selected confirmation method, enter the appropriate value:
-
When selecting Redirection to, enter the URL where the Coveo crawler is redirected when the login fails.
https://mycompany.com/login/failed.html
-
When selecting Missing cookie, in the Value input, enter the name of the cookie that’s set when an authentication is successful.
ASP.NET_SessionId
-
If you select Missing URL, in the Value input, enter a regex. Coveo will consider that an authentication request has failed if its crawler is not redirected to an URL matching the specified pattern.
-
When selecting Missing text, in the Value input, enter a string to show to authenticated users.
-
Hello, jsmith@mycompany.com!
-
Log out
-
-
When selecting Text presence, in the Value input, enter a string to show when a login fails.
-
An error has occurred.
-
Your username or password is invalid.
-
-
When selecting URL presence, in the Value input, enter a regex. Coveo will consider that an authentication request has failed if its crawler is redirected to an URL matching the specified pattern.
-
Automatic Form Authentication
When selecting Automatic form authentication:
-
In the Form URL box, enter the website login page URL.
-
In the Username and Password boxes, enter the credentials to log in. See Source Credentials Leading Practices.
-
Under Confirmation method, select the method that will determine if the authentication request failed.
Depending on the selected confirmation method, enter the appropriate value:
-
When selecting Redirection to, enter the URL where the Coveo crawler is redirected when the login fails.
https://mycompany.com/login/failed.html
-
When selecting Missing cookie, in the Value input, enter the name of the cookie that’s set when an authentication is successful.
ASP.NET_SessionId
-
If you select Missing URL, in the Value input, enter a regex. Coveo will consider that an authentication request has failed if its crawler is not redirected to an URL matching the specified pattern.
-
When selecting Missing text, in the Value input, enter a string to show to authenticated users.
-
Hello, jsmith@mycompany.com!
-
Log out
-
-
When selecting Text presence, in the Value input, enter a string to show when a login fails.
-
An error has occurred.
-
Your username or password is invalid.
-
-
When selecting URL presence, in the Value input, enter a regex. Coveo will consider that an authentication request has failed if its crawler is redirected to an URL matching the specified pattern.
In addition, if you want Coveo’s first request to be for authentication, regardless of whether authentication is actually required, check the Force login box.
-
-
If the website relies heavily on asynchronous requests, under Loading delay, enter the maximum delay in milliseconds to wait for the JavaScript to execute on each web page.
-
(Optional) If your automatic form authentication configuration doesn’t work, or if the web page requires specific actions during the login process, you might have to enter a Custom login sequence configuration. Contact the Coveo Support team for help.
“Content to Include” Section
When the website pages indexed by your Sitemap source are rendered dynamically with JavaScript, expand the Content to Include section, and then select the JavaScript-rendered check box. By default, the Sitemap source doesn’t execute the JavaScript code in crawled website pages.
Selecting the JavaScript-rendered check box may significantly increase the time needed to crawl pages.
-
When the JavaScript takes longer to execute than normal or makes asynchronous calls, consider increasing the Loading delay parameter value to ensure that the pages with longest rendering time are indexed with all the rendered content. Enter the time in milliseconds allowed for dynamic content to be retrieved before indexing the content. When the value is
0
(default), the crawler doesn’t wait after the page is loaded. -
To configure page filters, you must edit the source JSON configuration and configure the
addressPatterns
hidden parameter.
“Crawling Settings” Section
When a target website sometimes responds slower:
-
In the Request timeout box, use the + and - buttons to select the web request timeout value in seconds. The default is
100
seconds. When the value is0
, there’s no timeout. By increasing the timeout value, you increase the delay tolerance and avoid timeout errors. -
Under Request interval delay, enter the number of milliseconds there should be between each request sent to retrieve your Sitemap content. The default value is
0
, which means there’s no speed limitation. To decrease the crawling speed, enter a higher number. The maximum possible delay is5000
milliseconds.
“Web Scraping” Section
When you want to exclude page sections (such as headers and footers) or extract information from the pages to create metadata, expand the Web Scraping section to use this powerful feature.
You have a Sitemap source for which you want to exclude HTML item header and footer sections, so you enter the following in the Web scraping configuration input.
"sensitive": false,
"value": "[\n {\n \"for\": {\n \"urls\": [\".*\"]\n },\n \"exclude\": [\n { \"path\": \"#ohHeader\" },\n { \"path\": \"#MainSection > div.col-md-3\" },\n { \"path\": \"#answerLink\" }\n ],\n \"metadata\": {\n \"topicTitle\": { \"path\": \"div.topic h1::text\" },\n \"topicLastUpdate\": { \"path\": \"#LastUpdate::text\" }\n }\n }\n]"
}
“Content Security” Tab
Select who will be able to access the source items through a Coveo-powered search interface. For details on this parameter, see Content Security.
“Access” Tab
In the Access tab, determine whether each group and API key can view or edit the source configuration (see Resource Access):
-
In the Access Level column, select View or Edit for each available group.
-
On the left-hand side of the tab, if available, click Groups or API Keys to switch lists.
Completion
-
Finish adding or editing your source:
-
When you want to save your source configuration changes without starting a build/rebuild, such as when you know you want to do other changes soon, click Add Source/Save.
To add the source content or to make your changes effective, on the Sources page, you must click Start initial build or Start required rebuild in the source Status column.
OR
-
When you’re done editing the source and want to make changes effective, click Add and Build Source/Save and Rebuild Source.
Back on the Sources page, you can review the progress of your source addition or modification.
Once the source is built or rebuilt, you can review its content in the Content Browser.
-
-
Optionally, consider editing or adding mappings.
You can only manage mapping rules once you build the source (see Refresh, Rescan, or Rebuild Sources).
What’s Next?
-
If you experienced issues while building the source, consider using Sitemap connector hidden parameters.
-
If you’re using the Crawling Module to retrieve your content, consider subscribing to deactivation notifications to receive an alert when a Crawling Module component becomes obsolete and stops the content crawling process.