Add or Edit a Web Source
Add or Edit a Web Source
|
Coveo is discontinuing use of the PhantomJS web driver in its Web and Sitemap sources in January 2023. Learn more on what you need to do. |
Members with the required privileges can use a Web source to make the content of a website searchable.
The Web source type behaves similarly to bots of web search engines such as Google. The source only needs a starting URL and then automatically discovers all the pages of the site following the site navigation and hyperlinks appearing in the pages. Consequently, only pages that are reachable are indexed in a random order. By default, the source doesn’t include pages that aren’t under the URL root.
Similarly to a Sitemap source, a Web source is used for indexing an HTML site, or data that can be exported to HTML. However, a Sitemap source supports the Refresh operation, which offers faster and more efficient indexing, and is generally preferred over using a Web source.
|
Leading practice
The number of items that a source processes per hour (crawling speed) depends on various factors, such as network bandwidth and source configuration. See About crawling speed for information on what can impact crawling speed, as well as possible solutions. |
Source Key Characteristics
Features | Supported | Additional information | |
---|---|---|---|
Searchable content types |
Web pages (complete) |
||
Content security options |
|||
Leading Practices
-
If possible, create one source per website to make searchable, as this is the most stable and intuitive configuration. However, if you want to index many websites (i.e., above 50) or if you have reached your Sources limit, you should consider creating sources that retrieve content from more than one website.
To optimize time and resource consumption, try balancing the size of your sources: a source may contain several websites with a few dozen pages each, or one or two larger websites. You can also leverage the Delay Between Requests parameter to increase crawling speed for the sites you own. Contact the Coveo Support team for help if needed.
-
Schedule rescan operations following the rate at which your source content changes.
-
When a connector exists for the technology powering the website, rather create a source based on that connector, as it will typically index content, metadata, and permissions more effectively (see Connector Directory).
You want to make an Atlassian Confluence-powered site content searchable. Create a Confluence source, not a Web source.
Features
Supported Authentication
The supported authentication methods are basic, manual form authentication (only for modifications on existing legacy sources), and form authentication.
Available Metadata
The default metadata for each item includes:
Metadata | Description | ||
---|---|---|---|
|
All
|
||
prefixed with |
All request and response headers in separate metadata |
||
prefixed with |
|||
|
All headers as a JSON dictionary |
||
|
|
Note
For meta names with colons (:), you must specify the origin explicitly in the mapping since the colon is the delimiter for the origin of the metadata (see Mapping Rule Syntax Reference). For example, |
Web Scraping
Using the web scraping feature, you can exclude sections of a page, extract metadata from the page, and even create separate index items from specific sections of a single web page (see Web Scraping Configuration).
JavaScript Support
The crawler can run the underlying JavaScript in website pages to dynamically render content to index.
Robots.txt Crawl-Delay and Page Restrictions Support
By default, the instructions of robots.txt
file associated to the website are respected.
|
Note
The source doesn’t support other parameters such as the |
Limitations
-
Multi-factor authentication (MFA) and CAPTCHA aren’t supported.
-
Refresh isn’t available. Therefore, a daily rescan is defined. You can enable this daily rescan on a per-source basis.
-
Indexing page permissions, if any, isn’t supported.
-
JavaScript menus and pop-up pages aren’t supported.
-
Only pages reachable through website page links are indexed.
-
Although, in the source JSON configuration, the
MaxPageSizeInBytes
is set to0
(unlimited size) by default, the Coveo indexing pipeline can handle web pages up to 512 MB only (see Edit a Source JSON Configuration). Larger pages are indexed by reference, i.e., their content is ignored by the Coveo crawler, and only their metadata and path are searchable. As a result, no Quick View is available for these larger items (see Search Result Quick View). -
Crawling performance depends heavily on the responding web server.
-
Pausing and resuming source updates isn’t yet supported. Therefore, Web source operations can’t be paused on error.
-
When the
Render-Javascript
option is enabled, the Web connector doesn’t support sendingAdditionalHeaders
. -
When the
Render-Javascript
option is enabled, Basic Authentication isn’t supported.
|
Note
The Sitemap source may be a better solution when the website features a sitemap file. |
Add or Edit a Web Source
To add a source
-
On the Sources (platform-ca | platform-eu | platform-au) page, click Add source.
-
In the Add a source of content panel, click the Cloud (
) or Crawling Module (
) tile, depending on whether you need to use the Coveo On-Premises Crawling Module to retrieve your content. See Content Retrieval Methods for details.
To edit a source, on the Sources (platform-ca | platform-eu | platform-au) page, click the desired source, and then click Edit in the Action bar.
|
Leading practice
It’s best to create or edit your source in your sandbox organization first. Once you’ve confirmed that it indexes the desired content, you can copy your source configuration to your production organization, either with a snapshot or manually. See About non-production organizations for more information and best practices regarding sandbox organizations. |
"Configuration" Tab
In the Add/Edit a Web Source panel, the Configuration tab is selected by default. It contains your source’s general and authentication information, as well as other parameters.
General Information
Source Name
Enter a name for your source.
|
Leading practice
A source name can’t be modified once it’s saved, therefore be sure to use a short and descriptive name, using letters, numbers, hyphens ( |
Site URL
The URL of a starting website page, typically a home page, from which the crawler starts discovering the website following links found in pages.
You can enter more than one starting website page, for example, to allow the crawler to see links leading to all the website pages that you want to index.
Avoid crawling more than one site in a given source. Rather create one source for each website. This way, you can optimize source parameters for each website.
|
Note
Ensure that you have the right to crawl the public content in the event where you aren’t the owner of the website. Crawling websites that you don’t own nor have the right to crawl could create accessibility issues. Furthermore, certain websites may use security mechanisms that can impact Coveo’s ability to retrieve the content. If you’re unfamiliar with these mechanisms, we recommend investigating and learning about them beforehand. For example, one impact this type of software can have is detecting our crawler as an attack and blocking us from any further crawling. |
|
Leading practice
If you want to index only one or a few specific pages of a site, such as for a test, enter the pages to index in the Site URL box, and then edit the source JSON configuration to set the MaxCrawlDepth parameter value to |
Paired Crawling Module
If your source is a Crawling Module source, and if you have more than one Crawling Module linked to this organization, select the one with which you want to pair your source. If you change the Crawling Module instance paired with your source, a successful rebuild is required for your change to apply.
"Content to Include" Section
Consider changing the default value of any of the parameters in this section to fine-tune how web pages included in this source are crawled.
When specifying inclusion or exclusion filters, ensure the page specified in the Site URL box is not filtered out. Otherwise, no items are indexed because the starting page is excluded and the crawling process never starts. In case the Site URL redirects to another URL, you must ensure neither one is excluded by your filter settings.
The www.mycompany.com
website you crawl contains versions in several languages and you want to have one source per language.
For the US English
source, your parameter values could be as shown in the following table.
Parameter | Value |
---|---|
Site URL |
|
Inclusion filters |
|
You can index pages that are only referenced in excluded pages by setting the ExpandBeforeFiltering
parameter to true
in the parameters
section of the source JSON configuration.
This way, even if your Site URL is excluded by your filters, pages referenced in the Site URL page are retrieved before the filtering is applied.
|
Note
Setting the |
Inclusion Filters
Your source indexes only the pages that match a URL expression specified in this section.
|
|
-
Enter a URL expression to apply as the inclusion filter.
-
Select whether the URL expression uses a Wildcard or a Regex (regular expression) pattern.
|
Leading practice
You can test your regexes to ensure that they match the desired URLs with tools such as Regex101. You can customize regexes to meet your use case focusing on aspects such as:
For example, you want to index HTML pages on your company staging and dev websites without taking the case sensitivity or the trailing slash (/) into account, so you use the following regex:
The regex matches the following URLs:
but doesn’t match the following ones:
|
The www.mycompany.com
website you crawl contains versions in several languages and you want to have one source per language.
For the US English
source, if the source URL is www.mycompany.com/en-us/welcome.html
, the inclusion filter would be www.mycompany.com/en-us/*
.
Exclusion Filters
Your source ignores content from pages that match a URL expression specified in this section.
|
|
-
Enter a URL expression to apply as the exclusion filter.
Notes-
Exclusion filters also apply to shortened and redirected URLs.
-
By default, if pages are only accessible via excluded pages, those pages will also be excluded.
-
Exclusion filters for Sharepoint Online sources are not case sensitive when using a Regex (regular expression). For example,
(company-(dev|staging)).*html.?$
will matchhttp:// ComPanY-dev/important/document.html
without adding any additional symbols to account for case sensitivity. Exclusion filters are case sensitive when using Wildcard expressions.
-
-
Select whether the URL expression uses a Wildcard or a Regex (regular expression) pattern.
-
There’s no point in indexing the search page of your website, so you exclude its URL:
www.mycompany.com/en-us/search.html
-
You don’t want to index ZIP files that are linked from website pages:
www.mycompany.com/en-us/*.zip
Query Parameters to Ignore
Enter query string parameters that the source should ignore when determining whether a URL corresponds to a distinct item.
By default, the source considers the whole URL to determine whether it’s a distinct item. The URLs of the website you index can contain one or more query parameters after the host name and the path. Some of them may contribute to change the content of the page, and therefore legitimately contribute to a distinct URL. Others however may not affect the content, and in such a case, the source may include a page more than once, creating search result duplicates, when they’re not entered here to be ignored.
The URL of a website page for which you get search result duplicates looks as follows:
http://www.mysite.com/v1/getitdone.html?lang=en¶m1=abc¶m2=123
The values of param1
and param2
can change for the /v1/getitdone.html
page without affecting its content while the lang
value changes the language in which the page appears.
You want to ignore the param1
and param2
query parameters to eliminate search result duplicates, not lang
.
You enter one parameter name per line:
param1
param2
|
Note
Wildcards aren’t supported in query parameter names.
For instance, in the example above, should you want to cover both the |
Additional Content
Check the Include Subdomains box to index the site subdomains.
"Authentication" Section
When the website you want to make searchable uses one of the supported authentication types to secure access to its content, expand the Authentication section to configure the source credentials allowing your Coveo organization to access the secured content. See Source Credentials Leading Practices.
|
Multi-factor authentication (MFA) and CAPTCHA aren’t supported. |
The Web source type supports the following authentication types. Click the desired authentication method for details on the parameters to configure.
-
(Only when indexing HTTPS URLs) Select this option when the desired website uses the normal NTLM identity (see Understanding HTTP Authentication).
-
Note
Manual form authentication is now only available for modifications to legacy sources. To create new sources, use Form authentication instead.
Select this option when the desired website presents users with a form to fill to log in. You must specify the form input names and values.
-
Select this option when the desired website presents users with a form to fill in to log in. This is a simplified version of the Manual form authentication. A web automation framework supporting JavaScript rendering executes the authentication process by autodetecting the form inputs.
Basic Authentication
When selecting Basic authentication, enter the credentials of an account on the website you’re making searchable. See Source Credentials Leading Practices.
|
Note
When the Coveo crawler follows links requiring basic authentication while indexing your website, it only provides the basic authentication credentials you entered if the link belongs to the same scheme, domain, or subdomain as the starting Site URL. Conversely, if the link doesn’t belong to one of these, the Coveo crawler doesn’t try to authenticate. If you want the Coveo crawler to authenticate and index a site from a different scheme, domain, and/or subdomain, you must include its address under Site URL. For example, your starting address is
Since you want your basic authentication credentials to be provided when the Coveo crawler follows a link starting with |
Manual Form Authentication
|
Note
Manual form authentication is now only available for modifications to legacy sources. To create new sources, use Form authentication instead. |
When selecting Manual form authentication:
-
In the Form URL box, enter the website login page URL.
-
(Optional) When there’s more than one form on the login page, enter the Form name.
-
Click the Action method dropdown menu, and then select the HTTP method to use to submit the authentication request. Available options are POST or GET.
-
Click the Content type dropdown menu, and then select the content-type headers for HTTP POST requests used by the form authentication process.
-
(Optional) When the authentication request should be sent to another URL than the specified Form URL, enter this other URL under Action URL. Otherwise, leave empty.
-
Inspect the form login page HTML code to locate the
<input name='abc' type='text' />
element corresponding to each parameter, and then enter theinput name
attribute values under Username input name and Password input name.ExampleBased on the following HTML code:
<input name="login" type="email" />
<input name="pwd" type="password" />
login
is the username input name andpwd
is the password input name. -
Under Username input value and Password input value, enter respectively the username and password parameter values.
-
When your form uses other parameters than username and password, we recommend that you select the Autodetect hidden inputs check box. Otherwise, you must enter the Input name and Input value for each parameter, as with the username and password.
Under Other inputs, inputs values are displayed in clear text. You must therefore ensure to enter your sensitive information, i.e., username and password, above the Other inputs, in the Username input name, Password input name, Username input value, and Password input value boxes (see steps 6 and 7).
Example -
Under Confirmation method, select the method that will determine if the authentication request failed.
Depending on the selected confirmation method, enter the appropriate value:
-
When selecting Redirection to, enter the URL where the Coveo crawler is redirected when the login fails.
Examplehttps://mycompany.com/login/failed.html
-
When selecting Missing cookie, in the Value input, enter the name of the cookie that’s set when an authentication is successful.
ExampleASP.NET_SessionId
-
If you select Missing URL, in the Value input, enter a regex. Coveo will consider that an authentication request has failed if its crawler is not redirected to an URL matching the specified pattern.
-
When selecting Missing text, in the Value input, enter a string to show to authenticated users.
Examples-
Hello, jsmith@mycompany.com!
-
Log out
-
-
When selecting Text presence, in the Value input, enter a string to show when a login fails.
Examples-
An error has occurred.
-
Your username or password is invalid.
-
-
When selecting URL presence, in the Value input, enter a regex. Coveo will consider that an authentication request has failed if its crawler is redirected to an URL matching the specified pattern.
-
Form Authentication
When selecting Form authentication:
-
In the Form URL box, enter the website login page URL.
-
In the Username and Password boxes, enter the credentials to use to log in. See Source Credentials Leading Practices.
-
Under Confirmation method, select the method that will determine if the authentication request failed.
Depending on the selected confirmation method, enter the appropriate value:
-
When selecting Redirection to, enter the URL where the Coveo crawler is redirected when the login fails.
Examplehttps://mycompany.com/login/failed.html
-
When selecting Missing cookie, in the Value input, enter the name of the cookie that’s set when an authentication is successful.
ExampleASP.NET_SessionId
-
If you select Missing URL, in the Value input, enter a regex. Coveo will consider that an authentication request has failed if its crawler is not redirected to an URL matching the specified pattern.
-
When selecting Missing text, in the Value input, enter a string to show to authenticated users.
Examples-
Hello, jsmith@mycompany.com!
-
Log out
-
-
When selecting Text presence, in the Value input, enter a string to show when a login fails.
Examples-
An error has occurred.
-
Your username or password is invalid.
-
-
When selecting URL presence, in the Value input, enter a regex. Coveo will consider that an authentication request has failed if its crawler is redirected to an URL matching the specified pattern.
-
-
If you want Coveo’s first request to be for authentication, regardless of whether authentication is actually required, check the Force login box.
-
If the website relies heavily on asynchronous requests, under Loading delay, enter the maximum delay in milliseconds to wait for the JavaScript to execute on each web page.
-
(Optional) If your form authentication configuration doesn’t work, or if the web page requires specific actions during the login process, you might have to enter a Custom login sequence configuration. Contact the Coveo Support team for help.
"Crawling Settings" Section
Specify how you want the Coveo crawler to behave when going through the desired websites.
Delay Between Requests
The number of milliseconds between each request sent to retrieve content from a specified domain. The default value is 1 request per 1000 milliseconds, which is the highest rate at which Coveo can crawl a public website.
If you want to increase the crawling speed for a site you own, for example if you need to retrieve the content of a large website, enter a lower number. For this crawling speed to apply, however, Coveo must verify that you’re the owner of the site.
-
If your source is of the Cloud type, create an empty text file named
coveo-ownership-orgid.txt
, replacingorgid
with your Coveo organization ID (see Organization ID and Other Information). Then, upload this file at the root of the website you want to index. Changing the default number of milliseconds between each request has no effect if you don’t also provide the expected text file proving your ownership. -
If your source retrieves the content of an internal website via the Coveo On-Premises Crawling Module, the specified crawling rate applies automatically, as Coveo detects that the crawled site has a private IP address (see Coveo On-Premises Crawling Module, Content Retrieval Methods, and Private IPv4 Addresses). You therefore don’t have anything to do to prove ownership.
|
Note
If your site has |
Respect Robots.txt Directives
Clear this check box only when you want the crawler to ignore restrictions specified in the website robots.txt
file (see The "Respect Robots.txt Directives" setting).
Respect Noindex Directives
Clear this check box if you want the Coveo crawler to index pages that have a noindex
directive in their meta
tag or in their X-Robots-Tag
HTTP response header (see The "Respect Noindex Directives" and "Respect Nofollow Directives" settings).
Respect Nofollow Directives
Clear this check box if you want the Coveo crawler to follow links in pages that have a nofollow
directive in their meta
tag or in their X-Robots-Tag
HTTP response header (see The "Respect Noindex Directives" and "Respect Nofollow Directives" settings).
Respect Nofollow Anchors
Clear this check box if you want the Coveo crawler to follow links that have a rel="nofollow"
attribute (see The "Respect Nofollow Anchors" setting).
Render JavaScript
Check this box only when some website content you want to include is dynamically rendered by JavaScript. By default, the Web source doesn’t execute the JavaScript code in crawled website pages.
|
Selecting the Render JavaScript check box may significantly increase the time needed to crawl pages. |
|
Note
When the JavaScript takes longer to execute than normal or makes asynchronous calls, consider increasing the Loading delay parameter value to ensure that the pages with longest rendering time are indexed with all the rendered content.
Enter the time in milliseconds allowed for dynamic content to be retrieved before indexing the content.
When the value is |
Make Text Found in Image Files Searchable (OCR)
Enable this option if you want Coveo to extract text from image files. OCR-extracted text is processed as item data, meaning that it’s fully searchable and will appear in the item Quick View. See Enable optical character recognition for details on this feature.
|
Note
Contact Coveo Sales to add this feature to your organization license. |
Make Text Found in PDF Files With Images Searchable (OCR)
Enable this option if you want Coveo to extract text from PDF files containing images. OCR-extracted text is processed as item data, meaning that it’s fully searchable and will appear in the item Quick View. See Enable optical character recognition for details on this feature.
|
Note
Contact Coveo Sales to add this feature to your organization license. |
User Agent
The user agent string that you want Coveo to send with HTTP requests to identify itself when downloading pages.
The default value used is: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) (compatible; Coveobot/2.0;+https://www.coveo.com/bot.html)
.
"Web Scraping" Section
In the JSON configuration box, enter a custom JSON configuration to precisely include page sections or extract metadata from the website pages (see Web Scraping Configuration).
"Content Security" Tab
Select who will be able to access the source items through a Coveo-powered search interface. For details on this parameter, see Content security.
"Access" Tab
In the Access tab, set whether each group (and API key, if applicable) in your Coveo organization can view or edit the current source.
For example, when creating a new source, you could decide that members of Group A can edit its configuration while Group B can only view it.
See Custom access level for more information.
Completion
-
Finish adding or editing your source:
-
When you want to save your source configuration changes without starting a build/rebuild, such as when you know you want to do other changes soon, click Add source/Save.
NoteOn the Sources (platform-ca | platform-eu | platform-au) page, you must click Launch build or Start required rebuild in the source Status column to add the source content or to make your changes effective, respectively.
-
When you’re done editing the source and want to make changes effective, click Add and build source/Save and rebuild source.
Back on the Sources (platform-ca | platform-eu | platform-au) page, you can follow the progress of your source addition or modification.
Once the source is built or rebuilt, you can review its content in the Content Browser.
-
-
Optionally, consider editing or adding mappings once your source is done building or rebuilding.
Required privileges
You can assign privileges to allow access to specific tools in the Coveo Administration Console. The following table indicates the privileges required to view or edit elements of the Sources (platform-ca | platform-eu | platform-au) page and associated panels. See Manage privileges and Privilege reference for more information.
|
Note
The Edit all privilege isn’t required to create sources. When granting privileges for the Sources domain, you can grant a group or API key the View all or Custom access level, instead of Edit all, and then select the Can Create check box to allow users to create sources. See Can Create ability dependence for more information. |
Actions | Service | Domain | Required access level |
---|---|---|---|
View sources, view source update schedules, and subscribe to source notifications |
Content |
Fields |
View |
Sources |
|||
Organization |
Organization |
||
Edit sources, edit source update schedules, and view the View Metadata page |
Content |
Fields |
Edit |
Sources |
|||
Content |
Source metadata |
View |
|
Organization |
Organization |
What’s Next?
-
When you extract additional metadata from website pages:
-
Optionally, give the Coveo crawler
noindex
ornofollow
directives. -
If you’re using the Crawling Module to retrieve your content, consider subscribing to deactivation notifications to receive an alert when a Crawling Module component becomes obsolete and stops the content crawling process.