Add or Edit a Web Source

Members of the Administrators and Content Managers built-in groups can use a Web source to make the content of a website searchable.

The Web source type behaves similarly to bots of web search engines such as Google. The source only needs a starting URL and then automatically discovers all the pages of the site following the site navigation and hyperlinks appearing in the pages. Consequently, only pages that are reachable are indexed in a random order. By default, the source doesn’t include pages that aren’t under the URL root.

Similarly to a Sitemap source, a Web source is used for indexing an HTML site, or data that can be exported to HTML. However, a Sitemap source supports the Refresh operation, which offers faster and more efficient indexing, and is generally preferred over using a Web source.

Source Key Characteristics

Features Supported Additional information
Web page version N/A
Searchable content type

Web pages (complete)

Content update operations Refresh
Rescan Takes place every day by default.
Rebuild
Content security options Determined by source permissions
Source creator
Everyone

Leading Practices

  • If possible, create one source per website to make searchable, as this is the most stable and intuitive configuration. However, if you want to index many websites (i.e., above 50) or if you have reached your Sources limit, you should consider creating sources that retrieve content from more than one website (see Content Licensing Limits).

    To optimize time and resource consumption, try balancing the size of your sources: a source may contain several websites with a few dozen pages each, or one or two larger websites. You can also leverage the Delay Between Requests parameter to increase crawling speed for the sites you own. Contact the Coveo Support team for help if needed.

  • Because refresh isn’t available for a Web source, ensure that the rescan schedule is set at a frequency that’s a good compromise between more recent search results and acceptable performance and resource impact (see Edit a Source Schedule).

  • When a connector exists for the technology powering the website, rather create a source based on that connector, as it will typically index content, metadata, and permissions more effectively (see Connector Directory).

    You want to make an Atlassian Confluence-powered site content searchable. Create a Confluence source, not a Web source.

Features

Supported Authentication

The supported authentication methods are basic, manual form authentication (only for modifications on existing legacy sources), and form authentication.

Available Metadata

The default metadata for each item includes:

Metadata Description

description
keywords
author
Content-type
...

All meta tag included in the head of the page

The content attribute of meta tags is indexed when the tag is keyed with one of the following attributes: name, property, itemprop, or http-equiv.

In the tag <meta property="og:title" content="The Article Title"/>, The Article Title is indexed.

prefixed with RequestHeader All request and response headers in separate metadata
prefixed with ResponseHeader
coveo_AllRequestHeaders All headers as a JSON dictionary
coveo_AllResponseHeaders

For meta names with colons (:), you must specify the origin explicitly in the mapping since the colon is the delimiter for the origin of the metadata (see Mapping Rule Syntax Reference).

For example, og:title:crawler.

Web Scraping

Using the web scraping feature, you can exclude sections of a page, extract metadata from the page, and even create separate index items from specific sections of a single web page (see Web Scraping Configuration).

JavaScript Support

The crawler can run the underlying JavaScript in website pages to dynamically render content to index.

Robots.txt Crawl-Delay and Page Restrictions Support

By default, the instructions of robots.txt file associated to the website are respected.

The source doesn’t support other parameters such as the visit-time and request-rate.

Limitations

  • Refresh isn’t available, therefore by default, a daily rescan is performed to retrieve changes.

  • Indexing page permissions, if any, isn’t supported.

  • JavaScript menus and pop-up pages aren’t supported.

  • Only pages reachable through website page links are indexed.

  • Although, in the source JSON configuration, the MaxPageSizeInBytes is set to 0 (unlimited size) by default, the Coveo Cloud indexing pipeline can handle web pages up to 512 MB only (see Edit a Source JSON Configuration). Larger pages are indexed by reference, i.e., their content is ignored by the Coveo Cloud crawler, and only their metadata and path are searchable (see Indexing by Reference). As a result, no Quick View is available for these larger items (see Search Result Quick View).

  • Crawling performance depends heavily on the responding web server.

  • Pause and resume source operations aren’t yet supported (see Resume a Paused Source Update. Therefore, Web source operations can’t be paused on error.

  • When the Render-Javascript option is enabled, the Web connector doesn’t support sending AdditionalHeaders.

  • When the Render-Javascript option is enabled, Basic Authentication isn’t supported.

The Sitemap source may be a better solution when the website features a sitemap file.

Add or Edit a Web Source

When adding a source, in the Add a source of content panel, click the Cloud or the Crawling Module tab, depending on whether you need to use the Coveo On-Premises Crawling Module to retrieve your content. See Content Retrieval Methods for details.

To edit a source, on the Sources page, click the desired source, and then, in the Action bar, click Edit.

“Configuration” Tab

In the Add/Edit a Web Source panel, the Configuration tab is selected by default. It contains your source’s general and authentication information, as well as other parameters.

General Information

Source Name

Enter a name for your source.

A source name can’t be modified once it’s saved, therefore be sure to use a short and descriptive name, using letters, numbers, hyphens (-), and underscores (_). Avoid spaces and other special characters.

Site URL

The URL of a starting website page, typically a home page, from which the crawler starts discovering the website following links found in pages.

You can enter more than one starting website page, for example, to allow the crawler to see links leading to all the website pages that you want to index.

Avoid crawling more than one site in a given source. Rather create one source for each website. This way, you can optimize source parameters for each website.

Ensure that you have the right to crawl the public content in the event where you aren’t the owner of the website. Crawling websites that you don’t own nor have the right to crawl could create accessability issues.

Furthermore, certain websites may use security mechanisms that can impact Coveo’s ability to retrieve the content. If you are unfamiliar with these mechanisms, we recommend investigating and learning about them beforehand. For example, one impact these softwares can have is detecting our crawler as an attack and blocking us from any further crawling.

If you want to index only one or a few specific pages of a site, such as for a test, enter the pages to index in the Site URL box, and then edit the source JSON configuration to set the MaxCrawlDepth parameter value to 0, instructing the crawler to only index the specified pages, and none of their linked pages.

Paired Crawling Module

If your source is a Crawling Module source and if you have more than one Crawling Module linked to this organization, select the one with which you want to pair your source. If you change the Crawling Module instance paired with your source, a successful rebuild is required for your change to apply.

Index

When adding a source, if you have more than one logical (non-Elasticsearch) index in your organization, select the index in which the retrieved content will be stored (see Leverage Many Coveo Indexes). If your organization only has one index, this drop-down menu isn’t visible and you have no decision to make.

  • To add a source storing content in an index different than default, you need the View access level on the Logical Index domain (see Manage Privileges and Logical Indexes Domain).

  • Once the source is added, you can’t switch to a different index.

“Content to Include” Section

Consider changing the default value of any of the following parameters to fine-tune how web pages included in this source are crawled.

Inclusion Filters

Enter a filter to apply, and then indicate whether the filter includes a Wildcard or a Regex (regular expression) patterns. Pages matching the specified URL expression are included.

  • You can test your regexes to ensure that they match the desired URLs with tools such as Regex101.

  • You can customize regexes to meet your use case focusing on aspects such as:

    • Case insensitivity

    • Capturing groups

    • Trailing slash inclusion

    • File extension

    For example, you want to index HTML pages on your company staging and dev websites without taking the case sensitivity or the trailing slash (/) into account, so you use the following regex:

    (?i)^.*(company-(dev|staging)).*html.?$

    The regex matches the following URLs:

    • http://company-dev/important/document.html/

    • http://ComPanY-DeV/important/document.html/ (because of (?i), the case insensitive flag)

    • http://company-dev/important/document.html (with or without trailing / because of .?)

    • http://company-staging/important/document.html/ (because of dev|staging)

    but doesn’t match the following ones:

    • http://besttech-dev/important/document.html/ (besttech isn’t included in the regex)

    • http://company-dev/important/document.pdf/ (only html files are included)

    • http://company-prod/important/document.html/ (prod isn’t included in the regex)

When you specify an inclusion filter, the page specified in Site URL box must be part of the inclusion filter scope, otherwise no items are indexed because the starting page is excluded and the crawling process stops. In case the Site URL redirects to another URL, both of them must be part of the inclusion filter scope.

The www.mycompany.com website you crawl contains versions in several languages and you want to have one source per language. For the US English source, your parameter values could be as shown in the following table.

Parameter Value
Site URL www.mycompany.com/en-us/welcome.html
Inclusion filters www.mycompany.com/en-us/*
Exclusion Filters

Enter a filter to apply, and then select if the filter includes a Wildcard or a Regex (regular expression) patterns. Pages matching the specified URL expression are ignored.

  • Exclusion filters also apply to shortened and redirected URLs.

  • Ensure that the Site URL you specified isn’t excluded by one of your exclusion filters.

  • By default, if pages are only accessible via excluded pages, those pages will also be excluded.

    You can still include pages that are only referenced in excluded pages by setting the ExpandBeforeFiltering hidden parameter to true in the parameters section of the source JSON configuration (see Add a Hidden Source Parameter). However, setting the parameter to true can significantly reduce the crawling speed since the crawler fetches many pages that can be rejected in the end.

    For example:

      "ExpandBeforeFiltering": {
      "sensitivity": false,
      "value": "true"
      }
    
  • There’s no point in indexing the search page of your website, so you exclude its URL:

    www.mycompany.com/en-us/search.html

  • You don’t want to index ZIP files that are linked from website pages:

    www.mycompany.com/en-us/*.zip

Query Parameters to Ignore

Enter query string parameters that the source should ignore when determining whether a URL corresponds to a distinct item.

By default, the source considers the whole URL to determine whether it’s a distinct item. The URLs of the website you index can contain one or more query parameters after the host name and the path. Some of them may contribute to change the content of the page, and therefore legitimately contribute to a distinct URL. Others however may not affect the content, and in such a case, the source may include a page more than once, creating search result duplicates, when they’re not entered here to be ignored.

The URL of a website page for which you get search result duplicates looks as follows:

http://www.mysite.com/v1/getitdone.html?lang=en&param1=abc&param2=123

The values of param1 and param2 can change for the /v1/getitdone.html page without affecting its content while the lang value changes the language in which the page appears. You want to ignore the param1 and param2 query parameters to eliminate search result duplicates, not lang. You enter one parameter name per line:

param1

param2

Wildcards aren’t supported in query parameter names. For instance, in the example above, should you want to cover both the param1 and param2 query string parameters using param* instead, your param* query string parameter will be ignored and you will get search result duplicates of the /v1/getitdone.html page, each having a different combination of param1, param2, and lang values.

“Crawling Settings” Section

Check this box to index the site subdomains.

“Authentication” Section

When the website you want to make searchable uses one of the supported authentication types to secure access to its content, expand the Authentication section to configure the source credentials allowing your Coveo organization to access the secured content (see Source Credentials Leading Practices).

The Web source type supports the following authentication types. Click the desired authentication method for details on the parameters to configure.

  • Basic authentication

    (Only when indexing HTTPS URLs) Select this option when the desired website uses the normal NTLM identity (see Basic access authentication).

  • Manual form authentication

    Manual form authentication is now only available for modifications to legacy sources. To create new sources, use Form authentication instead.

    Select this option when the desired website presents users with a form to fill to log in. You must specify the form input names and values.

  • Form authentication

    Select this option when the desired website presents users with a form to fill in to log in. This is a simplified version of the Manual form authentication. A web automation framework supporting JavaScript rendering executes the authentication process by autodetecting the form inputs.

Basic Authentication

When selecting Basic authentication, enter the credentials of an account on the website you’re making searchable. See Source Credentials Leading Practices.

When the Coveo Cloud crawler follows links requiring basic authentication while indexing your website, it only provides the basic authentication credentials you entered if the link belongs to the same scheme, domain, or subdomain as the starting Site URL. Conversely, if the link doesn’t belong to one of these, the Coveo Cloud crawler doesn’t try to authenticate. If you want the Coveo Cloud crawler to authenticate and index a site from a different scheme, domain, and/or subdomain, you must include its address under Site URL.

For example, your starting address is https://www.example.com. The Coveo Cloud crawler doesn’t provide the basic authentication credentials you provided if the link requiring them belongs to:

  • A different scheme, i.e., it uses HTTP instead of HTTPS

  • A different domain, such as https://www.mysite.com

  • A different subdomain, such as https://www.intranet.example.com

Since you want your basic authentication credentials to be provided when the Coveo Cloud crawler follows a link starting with https://www.intranet.example.com, you enter this URL under Site URL.

Manual Form Authentication

Manual form authentication is now only available for modifications to legacy sources. To create new sources, use Form authentication instead.

When selecting Manual form authentication:

  1. In the Form URL box, enter the website login page URL.

  2. (Optional) When there’s more than one form on the login page, enter the Form name.

  3. Click the Action method drop-down menu, and then select the HTTP method to use to submit the authentication request. Available options are POST or GET.

  4. Click the Content type drop-down menu, and then select the content-type headers for HTTP POST requests used by the form authentication process.

  5. (Optional) When the authentication request should be sent to another URL than the specified Form URL, enter this other URL under Action URL. Otherwise, leave empty.

  6. Inspect the form login page HTML code to locate the <input name='abc' type='text' /> element corresponding to each parameter, and then enter the input name attribute values under Username input name and Password input name.

    Based on the following HTML code:

    <input name="login" type="email" />

    <input name="pwd" type="password" />

    login is the username input name and pwd is the password input name.

  7. Under Username input value and Password input value, enter respectively the username and password parameter values.

  8. When your form uses other parameters than username and password, we recommend that you select the Autodetect hidden inputs check box. Otherwise, you must enter the Input name and Input value for each parameter, as with the username and password.

    Under Other inputs, inputs values are displayed in clear text. You must therefore ensure to enter your sensitive information, i.e., username and password, above the Other inputs, in the Username input name, Password input name, Username input value, and Password input value boxes (see steps 6 and 7).

    FormAuthOtherInputsEx

  9. Under Confirmation method, select the method that will determine if the authentication request failed.

    Depending on the selected confirmation method, enter the appropriate value:

    • When selecting Redirection to, enter the URL where the Coveo crawler is redirected when the login fails.

      https://mycompany.com/login/failed.html

    • When selecting Missing cookie, in the Value input, enter the name of the cookie that’s set when an authentication is successful.

      ASP.NET_SessionId

    • If you select Missing URL, in the Value input, enter a regex. Coveo will consider that an authentication request has failed if its crawler is not redirected to an URL matching the specified pattern.

    • When selecting Missing text, in the Value input, enter a string to show to authenticated users.

      • Hello, jsmith@mycompany.com!

      • Log out

    • When selecting Text presence, in the Value input, enter a string to show when a login fails.

      • An error has occurred.

      • Your username or password is invalid.

    • When selecting URL presence, in the Value input, enter a regex. Coveo will consider that an authentication request has failed if its crawler is redirected to an URL matching the specified pattern.

Form Authentication

When selecting Form authentication:

  1. In the Form URL box, enter the website login page URL.

  2. In the Username and Password boxes, enter the credentials to use to log in. See Source Credentials Leading Practices.

  3. Under Confirmation method, select the method that will determine if the authentication request failed.

    Depending on the selected confirmation method, enter the appropriate value:

    • When selecting Redirection to, enter the URL where the Coveo crawler is redirected when the login fails.

      https://mycompany.com/login/failed.html

    • When selecting Missing cookie, in the Value input, enter the name of the cookie that’s set when an authentication is successful.

      ASP.NET_SessionId

    • If you select Missing URL, in the Value input, enter a regex. Coveo will consider that an authentication request has failed if its crawler is not redirected to an URL matching the specified pattern.

    • When selecting Missing text, in the Value input, enter a string to show to authenticated users.

      • Hello, jsmith@mycompany.com!

      • Log out

    • When selecting Text presence, in the Value input, enter a string to show when a login fails.

      • An error has occurred.

      • Your username or password is invalid.

    • When selecting URL presence, in the Value input, enter a regex. Coveo will consider that an authentication request has failed if its crawler is redirected to an URL matching the specified pattern.

    In addition, if you want Coveo’s first request to be for authentication, regardless of whether authentication is actually required, check the Force login box.

  4. If the website relies heavily on asynchronous requests, under Loading delay, enter the maximum delay in milliseconds to wait for the JavaScript to execute on each web page.

  5. (Optional) If your form authentication configuration doesn’t work, or if the web page requires specific actions during the login process, you might have to enter a Custom login sequence configuration. Contact the Coveo Support team for help.

“Crawling Settings” Section

Specify how you want the Coveo Cloud crawler to behave when going through the desired websites.

Delay Between Requests

The number of milliseconds between each request sent to retrieve content from a specified domain. The default value is 1 request per 1000 milliseconds, which is the highest rate at which Coveo Cloud can crawl a public website.

If you want to increase the crawling speed for a site you own, for example if you need to retrieve the content of a large website, enter a lower number. For this crawling speed to apply, however, Coveo must verify that you’re the owner of the site.

  • If your source is of the Cloud type, create an empty text file named coveo-ownership-orgid.txt, replacing orgid with your Coveo organization ID (see Organization ID and Other Information). Then, upload this file at the root of the website you want to index. Changing the default number of milliseconds between each request has no effect if you don’t also provide the expected text file proving your ownership.

  • If your source retrieves the content of an internal website via the Coveo On-Premises Crawling Module, the specified crawling rate applies automatically, as Coveo detects that the crawled site has a private IP address (see Coveo On-Premises Crawling Module, Content Retrieval Methods, and Private IPv4 Addresses). You therefore don’t have anything to do to prove ownership.

If your site has robot.txt directives that include a Crawl-delay parameter with a different value, the slowest crawling speed applies. See also the Respect robots.txt directives option.

Respect Robot.txt Directives

Clear this check box only when you want the crawler to bypass restrictions specified in the website robots.txt file (see Robots exclusion standard).

Respect Noindex Directives

Clear this check box if you want the Coveo Cloud crawler to index pages that have a noindex directive in their meta tag or in their X-Robots-Tag HTTP response header (see Noindex and Nofollow Directives).

Respect Nofollow Directives

Clear this check box if you want the Coveo Cloud crawler to follow links in pages that have a nofollow directive in their meta tag or in their X-Robots-Tag HTTP response header (see Noindex and Nofollow Directives).

Respect Nofollow Anchors

Clear this check box if you want the Coveo Cloud crawler to follow links that have a rel="nofollow" attribute (see Noindex and Nofollow Directives).

Render JavaScript

Check this box only when some website content you want to include is dynamically rendered by JavaScript. By default, the Web source doesn’t execute the JavaScript code in crawled website pages.

Selecting the Render JavaScript check box may significantly increase the time needed to crawl pages.

When the JavaScript takes longer to execute than normal or makes asynchronous calls, consider increasing the Loading delay parameter value to ensure that the pages with longest rendering time are indexed with all the rendered content. Enter the time in milliseconds allowed for dynamic content to be retrieved before indexing the content. When the value is 0 (default), the crawler doesn’t wait after the page is loaded.

Make Text Found in Image Files Searchable (OCR)

Check this box if you want Coveo Cloud to extract text from image files.

OCR-extracted text is processed as item data, meaning that it’s fully searchable and will appear in the item Quick View. See Enable Optical Character Recognition for details on this feature.

Make Text Found in PDF Files With Images Searchable (OCR)

Check this box if you want Coveo Cloud to extract text from PDF files containing images.

OCR-extracted text is processed as item data, meaning that it’s fully searchable and will appear in the item Quick View. See Enable Optical Character Recognition for details on this feature.

User Agent

The user agent string that you want Coveo Cloud to send with HTTP requests to identify itself when downloading pages.

The default value used is: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) (compatible; Coveobot/2.0;+https://www.coveo.com/bot.html).

“Web Scraping” Section

In the JSON configuration box, enter a custom JSON configuration to precisely include page sections or extract metadata from the website pages (see Web Scraping Configuration).

“Content Security” Tab

Select who will be able to access the source items through a Coveo-powered search interface. For details on this parameter, see Content Security.

“Access” Tab

In the Access tab, determine whether each group and API key can view or edit the source configuration (see Resource Access):

  1. In the Access Level column, select View or Edit for each available group.

  2. On the left-hand side of the tab, if available, click Groups or API Keys to switch lists.

Completion

  1. Finish adding or editing your source:

    • When you want to save your source configuration changes without starting a build/rebuild, such as when you know you want to do other changes soon, click Add Source/Save.

      To add the source content or to make your changes effective, on the Sources page, you must click Launch build or Start required rebuild in the source Status column.

      OR

    • When you’re done editing the source and want to make changes effective, click Add and Build Source/Save and Rebuild Source.

      Back on the Sources page, you can review the progress of your source addition or modification.

    Once the source is built or rebuilt, you can review its content in the Content Browser.

  2. Optionally, consider editing or adding mappings.

    You can only manage mapping rules once you build the source (see Refresh, Rescan, or Rebuild Sources).

What’s Next?

Recommended Articles