Add or Edit a Web Source

Members of the Administrators and Content Managers built-in groups can use a Web source to make the content of a website searchable.

The Web source type behaves similarly to bots of web search engines such as Google. The source only needs a starting URL and then automatically discovers all the pages of the site following the site navigation and hyperlinks appearing in the pages. Consequently, only pages that are reachable are indexed in a random order. By default, the source doesn’t include pages that aren’t under the URL root.

Similarly to a Sitemap source, a Web source is used for indexing an HTML site, or data that can be exported to HTML. It performs a rescan to retrieve item changes such as addition, modification, or deletion (see Edit a Source Schedule). However, a Sitemap source supports refresh which offers faster and more efficient indexing, and is generally preferred over using a Web source.

Source Features Summary

Features Supported Additional information
Web page version N/A
Searchable content type

Web pages (complete)

Content update Refresh
Rescan
Rebuild
Content security options Determined by source permissions
Source creator
Everyone

Leading Practices

  • If possible, create one source per website to make searchable, as this is the most stable and intuitive configuration. However, if you want to index many websites (i.e., above 50) or if you have reached your Sources limit, you should consider creating sources that retrieve content from more than one website (see Content Licensing Limits).

    To optimize time and resource consumption, try balancing the size of your sources: a source may contain several websites with a few dozen pages each, or one or two larger websites. You can also leverage the Crawling limit rate parameter to increase crawling speed for the sites you own. Contact the Coveo Support team for help if needed.

  • Because refresh isn’t available for a Web source, ensure that the rescan schedule is set at a frequency that’s a good compromise between more recent search results and acceptable performance and resource impact (see Edit a Source Schedule).

  • When a connector exists for the technology powering the website, rather create a source based on that connector, as it will typically index content, metadata, and permissions more effectively (see Connectivity Directory).

    You want to make an Atlassian Confluence-powered site content searchable. Create a Confluence source, not a Web source.

Features

Supported Authentication

The supported authentication includes basic, manual form, and automatic form authentication.

Available Metadata

The default metadata for each item includes:

Metadata Description

description
keywords
author
Content-type
...

All meta tag included in the head of the page

The content attribute of meta tags is indexed when the tag is keyed with one of the following attributes: name, property, itemprop, or http-equiv.

In the tag <meta property="og:title" content="The Article Title"/>, The Article Title is indexed.

prefixed with RequestHeader All request and response headers in separate metadata
prefixed with ResponseHeader
coveo_AllRequestHeaders All headers as a JSON dictionary
coveo_AllResponseHeaders

For meta names with colons (:), you must specify the origin explicitly in the mapping since the colon is the delimiter for the origin of the metadata (see Mapping Rule Syntax Reference).

For example, og:title:crawler.

Web Scraping

Using the web scraping feature, you can exclude sections of a page, extract metadata from the page, and even create separate index items from specific sections of a single web page (see Web Scraping Configuration).

JavaScript Support

The crawler can run the underlying JavaScript in website pages to dynamically render content to index.

Robots.txt Crawl-Delay and Page Restrictions Support

By default, the instructions of robots.txt file associated to the website are respected.

The source doesn’t support other parameters such as the visit-time and request-rate.

Limitations

  • Refresh isn’t available, therefore by default, a daily rescan is performed to retrieve changes.

  • Indexing page permissions, if any, isn’t supported.

  • JavaScript menus and pop-up pages aren’t supported.

  • Only pages reachable through website page links are indexed.

  • Although, in the source JSON configuration, the MaxPageSizeInBytes is set to 0 (unlimited size) by default, the Coveo Cloud indexing pipeline can handle web pages up to 512 MB only (see Edit a Source JSON Configuration). Larger pages are indexed by reference, i.e., their content is ignored by the Coveo Cloud crawler, and only their metadata and path are searchable (see Indexing by Reference). As a result, no Quick View is available for these larger items (see Search Result Quick View).

  • Crawling performance depends heavily on the responding web server.

  • Pause and resume source operations aren’t yet supported (see Resume a Paused Source Update. Therefore, Web source operations can’t be paused on error.

The Sitemap source may be a better solution when the website features a sitemap file.

Add or Edit a Web Source

When choosing the type of source you want to add, select the Web option with the appropriate content retrieval method, depending on whether you need to use the Coveo On-Premises Crawling Module to retrieve your content.

To edit a source, on the Sources page, click the desired source, and then, in the Action bar, click Edit.

“Configuration” Tab

In the Add/Edit a Web Source panel, the Configuration tab is selected by default. It contains your source’s general and authentication information, as well as other parameters.

General Information

Source Name

Enter a name for your source.

Use a short and descriptive name, using letters, numbers, hyphens (-), and underscores (_). Avoid spaces and other special characters.

Site URL

The URL of a starting website page, typically a home page, from which the crawler starts discovering the website following links found in pages.

You can enter more than one starting website page, for example, to allow the crawler to see links leading to all the website pages that you want to index.

Avoid crawling more than one site in a given source. Rather create one source for each website. This way, you can optimize source parameters for each website.

If you want to index only one or a few specific pages of a site, such as for a test, enter the pages to index in the Site URL box, and then in the Content to Include section set the Maximum depth parameter value to 0, instructing the crawler to only index the specified pages, and none of the linked pages.

User Agent

The user agent string that you want Coveo Cloud to send with HTTP requests to identify itself when downloading pages.

The default is: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) (compatible; Coveobot/2.0;+https://www.coveo.com/bot.html).

Paired Crawling Module

If your source is a Crawling Module source and if you have more than one Crawling Module linked to this organization, select the one with which you want to pair your source. If you change the Crawling Module instance paired with your source, a successful rebuild is required for your change to apply.

Character Optical Recognition (OCR)

Check this box if you want Coveo Cloud to extract text from image files or PDF files containing images. OCR-extracted text is processed as item data, meaning that it’s fully searchable and will appear in the item Quick View. See Enable Optical Character Recognition for details on this feature.

Index

When adding a source, if you have more than one logical (non-Elasticsearch) index in your organization, select the index in which the retrieved content will be stored (see Leverage Many Coveo Indexes). If your organization only has one index, this drop-down menu isn’t visible and you have no decision to make.

  • To add a source storing content in an index different than default, you need the View access level on the Logical Index domain (see Manage Privileges and Logical Indexes Domain).

  • Once the source is added, you can’t switch to a different index.

“Authentication” Section

When the website you want to make searchable uses one of the supported authentication types to secure access to its content, expand the Authentication section to configure the source credentials allowing your Coveo organization to access the secured content (see Source Credentials Leading Practices).

The Web source type supports the following authentication types. Click the desired authentication method for details on the parameters to configure.

  • Basic authentication

    (Only when indexing HTTPS URLs) Select this option when the desired website uses the normal NTLM identity (see Basic access authentication).

  • Manual form authentication

    Select this option when the desired website presents users with a form to fill in to log in. You must specify the form input names and values.

  • Automatic form authentication

    Select this option when the desired website presents users with a form to fill in to log in. This is a simplified version of the Manual form authentication. A web automation framework supporting JavaScript rendering executes the authentication process by autodetecting the form inputs.

Basic Authentication

When selecting Basic authentication:

  1. In the Username box, enter the source credentials username as you would when you log in to the website you’re making searchable.

  2. In the Password box, enter the corresponding password.

When the Coveo Cloud crawler follows links requiring basic authentication while indexing your website, it only provides the basic authentication credentials you entered if the link belongs to the same scheme, domain, or subdomain as the starting Site URL. Conversely, if the link doesn’t belong to one of these, the Coveo Cloud crawler doesn’t try to authenticate. If you want the Coveo Cloud crawler to authenticate and index a site from a different scheme, domain, and/or subdomain, you must include its address under Site URL.

Your starting address is https://www.example.com. The Coveo Cloud crawler doesn’t provide the basic authentication credentials you provided if the link requiring them belongs to:

  • A different scheme, i.e., it uses HTTP instead of HTTPS

  • A different domain, such as https://www.mysite.com

  • A different subdomain, such as https://www.intranet.example.com

Since you want your basic authentication credentials to be provided when the Coveo Cloud crawler follows a link starting with https://www.intranet.example.com, you enter this URL under Site URL.

Manual Form Authentication

When selecting Manual form authentication:

  1. In the Form URL box, enter the website login page URL.

  2. (Optional) When there’s more than one form on the login page, enter the Form name.

  3. Click the Action method drop-down menu, and then select the HTTP method to use to submit the authentication request. Available options are POST or GET.

  4. Click the Content type drop-down menu, and then select the content-type headers for HTTP POST requests used by the form authentication process.

  5. (Optional) When the authentication request should be sent to another URL than the specified Form URL, enter this other URL under Action URL. Otherwise, leave empty.

  6. Inspect the form login page HTML code to locate the <input name='abc' type='text' /> element corresponding to each parameter, and then enter the input name attribute values under Username input name and Password input name.

    Based on the following HTML code:

    <input name="login" type="email" />

    <input name="pwd" type="password" />

    login is the username input name and pwd is the password input name.

  7. Under Username input value and Password input value, enter respectively the username and password parameter values.

  8. When your form uses other parameters than username and password, we recommend that you select the Autodetect hidden inputs check box. Otherwise, you must enter the Input name and Input value for each parameter, as with the username and password.

    Under Other inputs, inputs values are displayed in clear text. You must therefore ensure to enter your sensitive information, i.e., username and password, above the Other inputs, in the Username input name, Password input name, Username input value, and Password input value boxes (see steps f and g).

    FormAuthOtherInputsEx

  9. Under Confirmation method, select the method to use to determine whether the authentication request failed: Redirection to, Missing cookie, Missing text, or Text.

    Depending on the selected confirmation method, enter the appropriate value:

    • When selecting Redirection to, enter the URL where you redirect users when the login fails.

      https://mycompany.com/login/failed.html

    • When selecting Missing cookie, in the Value input, enter the name of the cookie that’s set when an authentication is successful.

      ASP.NET_SessionId

    • When selecting Missing text, in the Value input, enter a string to show to authenticated users.

      • Hello, jsmith@mycompany.com!

      • Log out

    • When selecting Text, in the Value input, enter a string to show when a login fails.

      • An error has occurred.

      • Your username or password is invalid.

Automatic Form Authentication

When selecting Automatic form authentication:

  1. In the Form URL box, enter the website login page URL.

  2. In the Username and Password boxes, enter the credentials to use to log in.

  3. Under Confirmation method, select the method to use to determine whether the authentication request failed: Redirection to, Missing cookie, Missing text, or Text presence.

    Depending on the selected confirmation method, enter the appropriate value:

    • When selecting Redirection to, enter the URL where you redirect users when the login fails.

      https://mycompany.com/login/failed.html

    • When selecting Missing cookie, in the Value input, enter the name of the cookie that’s set when an authentication is successful.

      ASP.NET_SessionId

    • When selecting Missing text, in the Value input, enter a string to show to authenticated users.

      • Hello, jsmith@mycompany.com!

      • Log out

    • When selecting Text, in the Value input, enter a string to display on the page when a login fails.

      • An error has occurred.

      • Your username or password is invalid.

  4. If the website relies heavily on asynchronous requests, under Loading delay, enter the maximum delay in milliseconds to wait for the JavaScript to execute on each web page.

  5. (Optional) If your automatic form authentication configuration doesn’t work, or if the web page requires specific actions during the login process, you might have to enter a Custom login sequence configuration. Contact the Coveo Support team for help.

“Content to Include” Section

Expand the Content to Include section and consider changing the default value of any of the following parameters to fine-tune how web pages included in this source are crawled.

Maximum Depth

Enter the maximum number of link levels followed below the Site URL root page to include in the source.

Value Crawling depth
0 Home page content included in the source, but not the content of its linked pages.
1 Content of the home page and its linked pages included in the source, but not the content of the subpage links.
 
100 Default - Content up to the 100th link level is included in the source.
Pages to Include

Select how the Web source follows links to web pages from external domains and includes them:

  • Exclude external pages (default)

    The typical desired behavior where linked web pages that aren’t part of the domain of the specified Site URL aren’t included.

    Your Site URL is http://www.mycompany.com and one of its pages contains a link to an https://en.wikipedia.org/ page. The linked Wikipedia page isn’t crawled and included in your searchable content.

  • Include external pages, but not their subpages

    Linked web pages that aren’t part of the domain of the specified Site URL are indexed, but not any pages linked within those external pages.

    Subdomains are part of the external pages, meaning that some items couldn’t be indexed if located in subdomains.

  • Include external pages and their subpages

    Pages linked in included external pages are also included.

    Including pages linked in external pages can lead anywhere on the Internet, infinitely discovering linked pages on other websites.

    If you do select this option, you should add one or more filters, preferably Inclusion filters, to restrict the discovery to identifiable sites.

Inclusion Filters

Enter a filter to apply, and then indicate whether the filter includes a Wildcard or a Regex (regular expression) patterns. Pages matching the specified URL expression are included.

  • You can test your regexes to ensure that they match the desired URLs with tools such as Regex101.

  • You can customize regexes to meet your use case focusing on aspects such as:

    • Case insensitivity

    • Capturing groups

    • Trailing slash inclusion

    • File extension

    For example, you want to index HTML pages on your company staging and dev websites without taking the case sensitivity or the trailing slash (/) into account, so you use the following regex:

    (?i)^.*(company-(dev|staging)).*html.?$

    The regex matches the following URLs:

    • http://company-dev/important/document.html/

    • http://ComPanY-DeV/important/document.html/ (because of (?i), the case insensitive flag)

    • http://company-dev/important/document.html (with or without trailing / because of .?)

    • http://company-staging/important/document.html/ (because of dev|staging)

    but doesn’t match the following ones:

    • http://besttech-dev/important/document.html/ (besttech isn’t included in the regex)

    • http://company-dev/important/document.pdf/ (only html files are included)

    • http://company-prod/important/document.html/ (prod isn’t included in the regex)

When you specify an inclusion filter, the page specified in Site URL box must be part of the inclusion filter scope, otherwise no items are indexed because the starting page is excluded and the crawling process stops. In case the Site URL redirects to another URL, both of them must be part of the inclusion filter scope.

The www.mycompany.com website you crawl contains versions in several languages and you want to have one source per language. For the US English source, your parameter values could be as shown in the following table.

Parameter Value
Site URL www.mycompany.com/en-us/welcome.html
Inclusion filters www.mycompany.com/en-us/*
Exclusion Filters

Enter a filter to apply, and then select if the filter includes a Wildcard or a Regex (regular expression) patterns. Pages matching the specified URL expression are ignored.

  • Exclusion filters also apply to shortened and redirected URLs.

  • Ensure that the Site URL you specified isn’t excluded by one of your exclusion filters.

  • By default, if pages are only accessible via excluded pages, those pages will also be excluded.

    You can still include pages that are only referenced in excluded pages by setting the ExpandBeforeFiltering hidden parameter to true in the parameters section of the source JSON configuration (see Add a Hidden Source Parameter). However, setting the parameter to true can significantly reduce the crawling speed since the crawler fetches many pages that can be rejected in the end.

      "ExpandBeforeFiltering": {
      "sensitivity": false,
      "value": "true"
      }
    
  • There’s no point in indexing the search page of your website, so you exclude its URL:

    www.mycompany.com/en-us/search.html

  • You don’t want to index ZIP files that are linked from website pages:

    www.mycompany.com/en-us/*.zip

Query Parameters to Ignore

Enter query string parameters that the source should ignore when determining whether a URL corresponds to a distinct item.

By default, the source considers the whole URL to determine whether it’s a distinct item. The URLs of the website you index can contain one or more query parameters after the host name and the path. Some of them may contribute to change the content of the page, and therefore legitimately contribute to a distinct URL. Others however may not affect the content, and in such a case, the source may include a page more than once, creating search result duplicates, when they’re not entered here to be ignored.

The URL of a website page for which you get search result duplicates looks as follows:

http://www.mysite.com/v1/getitdone.html?lang=en&param1=abc&param2=123

The values of param1 and param2 can change for the /v1/getitdone.html page without affecting its content while the lang value changes the language in which the page appears. You want to ignore the param1 and param2 query parameters to eliminate search result duplicates, not lang. You enter one parameter name per line:

param1

param2

Additional Content

Select the JavaScript-rendered check box only when some website content you want to include is dynamically rendered by JavaScript. By default, the Web source doesn’t execute the JavaScript code in crawled website pages.

Selecting the JavaScript-rendered check box may significantly increase the time needed to crawl pages.

When the JavaScript takes longer to execute than normal or makes asynchronous calls, consider increasing the Loading delay parameter value to ensure that the pages with longest rendering time are indexed with all the rendered content. Enter the time in milliseconds allowed for dynamic content to be retrieved before indexing the content. When the value is 0 (default), the crawler doesn’t wait after the page is loaded.

“Crawling Settings” Section

Specify how you want the Coveo Cloud crawler to behave when going through the desired websites.

Crawling Limit Rate

The number of milliseconds between each request sent to retrieve content from a specified domain. The default value is 1 request per 1000 milliseconds, which is the highest rate at which Coveo Cloud can crawl a public website.

If you want to increase the crawling speed for a site you own, for example if you need to retrieve the content of a large website, enter a lower number. For this crawling speed to apply, however, Coveo must verify that you’re the owner of the site.

  • If your source is of the Cloud type, create an empty text file named coveo-ownership-orgid.txt, replacing orgid with your Coveo organization ID (see Organization ID and Other Information). Then, upload this file at the root of the website you want to index. Changing the default number of milliseconds between each request has no effect if you don’t also provide the expected text file proving your ownership.

  • If your source retrieves the content of an internal website via the Coveo On-Premises Crawling Module, the specified crawling rate applies automatically, as Coveo detects that the crawled site has a private IP address (see Coveo On-Premises Crawling Module, Content Retrieval Methods, and Private IPv4 Addresses). You therefore don’t have anything to do to prove ownership.

If your site has robot.txt directives that include a Crawl-delay parameter with a different value, the slowest crawling speed applies. See also the Respect robots.txt directives option.

Respect URL Casing

Clear this check box when web page URLs that you include aren’t case sensitive, meaning one unique page can be accessed with different URL casings.

When web page URLs aren’t case sensitive (Respect URL casing check box cleared), inclusion and exclusion filters are also not case sensitive.

The file system of the website you index is case insensitive and the page URLs typically use camel casing. However, some links to these pages (followed by the source crawling) rather use all lowercase URLs. When the Respect URL casing check box is selected (default value), you get page duplicates in your source, one copy for each found URL casing variant. Clear the Respect URL casing option to eliminate duplicates.

Respect Robot.txt Directives

Clear this check box only when you want the crawler to bypass restrictions specified in the website robots.txt file (see Robots exclusion standard).

Respect Noindex Directives

Clear this check box if you want the Coveo Cloud crawler to index pages that have a noindex directive in their meta tag or in their X-Robots-Tag HTTP response header (see Noindex and Nofollow Directives).

Respect Nofollow Directives

Clear this check box if you want the Coveo Cloud crawler to follow links in pages that have a nofollow directive in their meta tag or in their X-Robots-Tag HTTP response header (see Noindex and Nofollow Directives).

Respect Nofollow Anchors

Clear this check box if you want the Coveo Cloud crawler to follow links that have a rel="nofollow" attribute (see Noindex and Nofollow Directives).

Request Retry Count

Enter the maximum number of retries for a URL if an error is encountered. The default is 3. When the value is 0, there are no retries.

Request Retry Delay

Enter the minimum delay in milliseconds to wait between a failed HTTP request and the next retry. The default is 1000 milliseconds.

Request Timeout

Enter the web request timeout value in seconds. The default is 60 seconds. When the value is 0, there’s no timeout.

Consider increasing the value if a target website sometimes responds slower. This prevents getting errors that can be avoided.

“Web Scraping” Section

In the JSON configuration box, enter a custom JSON configuration to precisely include page sections or extract metadata from the website pages (see Web Scraping Configuration).

“Content Security” Tab

In the Content Security tab, select who will be able to access the source items through a Coveo-powered search interface. For details on this parameter, see Content Security.

“Access” Tab

In the Access tab, determine whether each group and API key can view or edit the source configuration (see Resource Access):

  1. In the Access Level column, select View or Edit for each available group.

  2. On the left-hand side of the tab, if available, click Groups or API Keys to switch lists.

Completion

  1. Finish adding or editing your source:

    • When you want to save your source configuration changes without starting a build/rebuild, such as when you know you want to do other changes soon, click Add Source/Save.

      On the Sources page, you must click Start initial build or Start required rebuild in the source Status column to add the source content or make your changes effective, respectively.

      OR

    • When you’re done editing the source and want to make changes effective, click Add and Build Source/Save and Rebuild Source.

      Back on the Sources page, you can review the progress of your source addition or modification.

    Once the source is built or rebuilt, you can review its content in the Content Browser.

  2. Optionally, consider editing or adding mappings.

    You can only manage mapping rules once you build the source (see Refresh, Rescan, or Rebuild Sources).

What’s Next?

Recommended Articles