Add or Edit a Web Source

Important

Coveo is discontinuing use of the PhantomJS web driver in its Web and Sitemap sources in January 2023.

Learn more on what you need to do.

Members with the required privileges can use a Web source to make the content of a website searchable.

The Web source type behaves similarly to bots of web search engines such as Google. The source only needs a starting URL and then automatically discovers all the pages of the site following the site navigation and hyperlinks appearing in the pages. Consequently, only pages that are reachable are indexed in a random order. By default, the source doesn’t include pages that aren’t under the URL root.

Similarly to a Sitemap source, a Web source is used for indexing an HTML site, or data that can be exported to HTML. However, a Sitemap source supports the Refresh operation, which offers faster and more efficient indexing, and is generally preferred over using a Web source.

Tip
Leading practice

The number of items that a source processes per hour (crawling speed) depends on various factors, such as network bandwidth and source configuration. See About crawling speed for information on what can impact crawling speed, as well as possible solutions.

Source Key Characteristics

Leading Practices

  • If possible, create one source per website to make searchable, as this is the most stable and intuitive configuration. However, if you want to index many websites (i.e., above 50) or if you have reached your Sources limit, you should consider creating sources that retrieve content from more than one website.

    To optimize time and resource consumption, try balancing the size of your sources: a source may contain several websites with a few dozen pages each, or one or two larger websites. You can also leverage the Delay Between Requests parameter to increase crawling speed for the sites you own. Contact the Coveo Support team for help if needed.

  • Schedule rescan operations following the rate at which your source content changes.

  • When a connector exists for the technology powering the website, rather create a source based on that connector, as it will typically index content, metadata, and permissions more effectively (see Connector Directory).

Example

You want to make an Atlassian Confluence-powered site content searchable. Create a Confluence source, not a Web source.

Features

Supported Authentication

The supported authentication methods are basic, manual form authentication (only for modifications on existing legacy sources), and form authentication.

Available Metadata

The default metadata for each item includes:

Metadata Description

description
keywords
author
Content-type
…​

All meta tags included in the head of the page

Note

The content attribute of meta tags is indexed when the tag is keyed with one of the following attributes: name, property, itemprop, or http-equiv.

For example, in the tag <meta property="og:title" content="The Article Title"/>`, The Article Title is indexed.

prefixed with RequestHeader

All request and response headers in separate metadata

prefixed with ResponseHeader

coveo_AllRequestHeaders

All headers as a JSON dictionary

coveo_AllResponseHeaders

Note

For meta names with colons (:), you must specify the origin explicitly in the mapping since the colon is the delimiter for the origin of the metadata (see Mapping Rule Syntax Reference).

For example, og:title:crawler.

Web Scraping

Using the web scraping feature, you can exclude sections of a page, extract metadata from the page, and even create separate index items from specific sections of a single web page (see Web Scraping Configuration).

JavaScript Support

The crawler can run the underlying JavaScript in website pages to dynamically render content to index.

Robots.txt Crawl-Delay and Page Restrictions Support

By default, the instructions of robots.txt file associated to the website are respected.

Note

The source doesn’t support other parameters such as the visit-time and request-rate.

Limitations

  • Multi-factor authentication (MFA) and CAPTCHA aren’t supported.

  • Refresh isn’t available. Therefore, a daily rescan is defined. You can enable this daily rescan on a per-source basis.

  • Indexing page permissions, if any, isn’t supported.

  • JavaScript menus and pop-up pages aren’t supported.

  • Only pages reachable through website page links are indexed.

  • Although, in the source JSON configuration, the MaxPageSizeInBytes is set to 0 (unlimited size) by default, the Coveo indexing pipeline can handle web pages up to 512 MB only (see Edit a Source JSON Configuration). Larger pages are indexed by reference, i.e., their content is ignored by the Coveo crawler, and only their metadata and path are searchable. As a result, no Quick View is available for these larger items (see Search Result Quick View).

  • Crawling performance depends heavily on the responding web server.

  • Pausing and resuming source updates isn’t yet supported. Therefore, Web source operations can’t be paused on error.

  • When the Render-Javascript option is enabled, the Web connector doesn’t support sending AdditionalHeaders.

  • When the Render-Javascript option is enabled, Basic Authentication isn’t supported.

Note

The Sitemap source may be a better solution when the website features a sitemap file.

Add or Edit a Web Source

  1. On the Sources (platform-ca | platform-eu | platform-au) page, click Add source.

  2. In the Add a source of content panel, click the Cloud (cloud-blue) or Crawling Module (crawling-bot-blue) tile, depending on whether you need to use the Coveo On-Premises Crawling Module to retrieve your content. See Content Retrieval Methods for details.

To edit a source, on the Sources (platform-ca | platform-eu | platform-au) page, click the desired source, and then click Edit in the Action bar.

Tip
Leading practice

It’s best to create or edit your source in your sandbox organization first. Once you’ve confirmed that it indexes the desired content, you can copy your source configuration to your production organization, either with a snapshot or manually.

See About non-production organizations for more information and best practices regarding sandbox organizations.

"Configuration" Tab

In the Add/Edit a Web Source panel, the Configuration tab is selected by default. It contains your source’s general and authentication information, as well as other parameters.

General Information

Source Name

Enter a name for your source.

Tip
Leading practice

A source name can’t be modified once it’s saved, therefore be sure to use a short and descriptive name, using letters, numbers, hyphens (-), and underscores (_). Avoid spaces and other special characters.

Site URL

The URL of a starting website page, typically a home page, from which the crawler starts discovering the website following links found in pages.

You can enter more than one starting website page, for example, to allow the crawler to see links leading to all the website pages that you want to index.

Avoid crawling more than one site in a given source. Rather create one source for each website. This way, you can optimize source parameters for each website.

Note

Ensure that you have the right to crawl the public content in the event where you aren’t the owner of the website. Crawling websites that you don’t own nor have the right to crawl could create accessibility issues.

Furthermore, certain websites may use security mechanisms that can impact Coveo’s ability to retrieve the content. If you’re unfamiliar with these mechanisms, we recommend investigating and learning about them beforehand. For example, one impact this type of software can have is detecting our crawler as an attack and blocking us from any further crawling.

Tip
Leading practice

If you want to index only one or a few specific pages of a site, such as for a test, enter the pages to index in the Site URL box, and then edit the source JSON configuration to set the MaxCrawlDepth parameter value to 0, instructing the crawler to only index the specified pages, and none of their linked pages.

Paired Crawling Module

If your source is a Crawling Module source, and if you have more than one Crawling Module linked to this organization, select the one with which you want to pair your source. If you change the Crawling Module instance paired with your source, a successful rebuild is required for your change to apply.

"Content to Include" Section

Consider changing the default value of any of the parameters in this section to fine-tune how web pages included in this source are crawled.

When specifying inclusion or exclusion filters, ensure the page specified in the Site URL box is not filtered out. Otherwise, no items are indexed because the starting page is excluded and the crawling process never starts. In case the Site URL redirects to another URL, you must ensure neither one is excluded by your filter settings.

Example

The www.mycompany.com website you crawl contains versions in several languages and you want to have one source per language. For the US English source, your parameter values could be as shown in the following table.

Parameter Value

Site URL

www.mycompany.com/en-us/welcome.html

Inclusion filters

www.mycompany.com/en-us/*

You can index pages that are only referenced in excluded pages by setting the ExpandBeforeFiltering parameter to true in the parameters section of the source JSON configuration. This way, even if your Site URL is excluded by your filters, pages referenced in the Site URL page are retrieved before the filtering is applied.

Note

Setting the ExpandBeforeFiltering parameter to true can significantly reduce the crawling speed since the crawler retrieves many pages that can be rejected in the end.

Inclusion Filters

Your source indexes only the pages that match a URL expression specified in this section.

Important
  • The Site URL that you specified for your source must be part of the inclusion filter scope, otherwise the corresponding content won’t be indexed.

  • Case sensitivity or insensitivity of inclusion filters depends on your RespectUrlCasing value.

  1. Enter a URL expression to apply as the inclusion filter.

  2. Select whether the URL expression uses a Wildcard or a Regex (regular expression) pattern.

Tip
Leading practice

You can test your regexes to ensure that they match the desired URLs with tools such as Regex101.

You can customize regexes to meet your use case focusing on aspects such as:

  • Case insensitivity

  • Capturing groups

  • Trailing slash inclusion

  • File extension

For example, you want to index HTML pages on your company staging and dev websites without taking the case sensitivity or the trailing slash (/) into account, so you use the following regex:

(?i)^.*(company-(dev|staging)).*html.?$

The regex matches the following URLs:

  • http://company-dev/important/document.html/

  • http://ComPanY-DeV/important/document.html/ (because of (?i), the case insensitive flag)

  • http://company-dev/important/document.html (with or without trailing / because of .?)

  • http://company-staging/important/document.html/ (because of dev|staging)

but doesn’t match the following ones:

  • http://besttech-dev/important/document.html/ (besttech isn’t included in the regex)

  • http://company-dev/important/document.pdf/ (only html files are included)

  • http://company-prod/important/document.html/ (prod isn’t included in the regex)

Example

The www.mycompany.com website you crawl contains versions in several languages and you want to have one source per language. For the US English source, if the source URL is www.mycompany.com/en-us/welcome.html, the inclusion filter would be www.mycompany.com/en-us/*.

Exclusion Filters

Your source ignores content from pages that match a URL expression specified in this section.

Important
  • The Site URL that you specified for your source must not be part of the exclusion filter scope, otherwise the corresponding content won’t be indexed.

  • Case sensitivity or insensitivity of exclusion filters depends on your RespectUrlCasing value.

  1. Enter a URL expression to apply as the exclusion filter.

    Notes
    • Exclusion filters also apply to shortened and redirected URLs.

    • By default, if pages are only accessible via excluded pages, those pages will also be excluded.

    • Exclusion filters for Sharepoint Online sources are not case sensitive when using a Regex (regular expression). For example,(company-(dev|staging)).*html.?$ will match http:// ComPanY-dev/important/document.html without adding any additional symbols to account for case sensitivity. Exclusion filters are case sensitive when using Wildcard expressions.

  2. Select whether the URL expression uses a Wildcard or a Regex (regular expression) pattern.

Examples
  • There’s no point in indexing the search page of your website, so you exclude its URL:

    www.mycompany.com/en-us/search.html

  • You don’t want to index ZIP files that are linked from website pages:

    www.mycompany.com/en-us/*.zip

Query Parameters to Ignore

Enter query string parameters that the source should ignore when determining whether a URL corresponds to a distinct item.

By default, the source considers the whole URL to determine whether it’s a distinct item. The URLs of the website you index can contain one or more query parameters after the host name and the path. Some of them may contribute to change the content of the page, and therefore legitimately contribute to a distinct URL. Others however may not affect the content, and in such a case, the source may include a page more than once, creating search result duplicates, when they’re not entered here to be ignored.

Example

The URL of a website page for which you get search result duplicates looks as follows:

http://www.mysite.com/v1/getitdone.html?lang=en&param1=abc&param2=123

The values of param1 and param2 can change for the /v1/getitdone.html page without affecting its content while the lang value changes the language in which the page appears. You want to ignore the param1 and param2 query parameters to eliminate search result duplicates, not lang. You enter one parameter name per line:

param1

param2

Note

Wildcards aren’t supported in query parameter names. For instance, in the example above, should you want to cover both the param1 and param2 query string parameters using param* instead, your param* query string parameter will be ignored and you will get search result duplicates of the /v1/getitdone.html page, each having a different combination of param1, param2, and lang values.

Additional Content

Check the Include Subdomains box to index the site subdomains.

"Authentication" Section

When the website you want to make searchable uses one of the supported authentication types to secure access to its content, expand the Authentication section to configure the source credentials allowing your Coveo organization to access the secured content. See Source Credentials Leading Practices.

Important

Multi-factor authentication (MFA) and CAPTCHA aren’t supported.

The Web source type supports the following authentication types. Click the desired authentication method for details on the parameters to configure.

  • Basic authentication

    (Only when indexing HTTPS URLs) Select this option when the desired website uses the normal NTLM identity (see Understanding HTTP Authentication).

  • Manual form authentication

    Note

    Manual form authentication is now only available for modifications to legacy sources. To create new sources, use Form authentication instead.

    Select this option when the desired website presents users with a form to fill to log in. You must specify the form input names and values.

  • Form authentication

    Select this option when the desired website presents users with a form to fill in to log in. This is a simplified version of the Manual form authentication. A web automation framework supporting JavaScript rendering executes the authentication process by autodetecting the form inputs.

Basic Authentication

When selecting Basic authentication, enter the credentials of an account on the website you’re making searchable. See Source Credentials Leading Practices.

Note

When the Coveo crawler follows links requiring basic authentication while indexing your website, it only provides the basic authentication credentials you entered if the link belongs to the same scheme, domain, or subdomain as the starting Site URL. Conversely, if the link doesn’t belong to one of these, the Coveo crawler doesn’t try to authenticate. If you want the Coveo crawler to authenticate and index a site from a different scheme, domain, and/or subdomain, you must include its address under Site URL.

For example, your starting address is https://www.example.com. The Coveo crawler doesn’t provide the basic authentication credentials you provided if the link requiring them belongs to:

  • A different scheme, i.e., it uses HTTP instead of HTTPS

  • A different domain, such as https://www.mysite.com

  • A different subdomain, such as https://www.intranet.example.com

Since you want your basic authentication credentials to be provided when the Coveo crawler follows a link starting with https://www.intranet.example.com, you enter this URL under Site URL.

Manual Form Authentication
Note

Manual form authentication is now only available for modifications to legacy sources. To create new sources, use Form authentication instead.

When selecting Manual form authentication:

  1. In the Form URL box, enter the website login page URL.

  2. (Optional) When there’s more than one form on the login page, enter the Form name.

  3. Click the Action method dropdown menu, and then select the HTTP method to use to submit the authentication request. Available options are POST or GET.

  4. Click the Content type dropdown menu, and then select the content-type headers for HTTP POST requests used by the form authentication process.

  5. (Optional) When the authentication request should be sent to another URL than the specified Form URL, enter this other URL under Action URL. Otherwise, leave empty.

  6. Inspect the form login page HTML code to locate the <input name='abc' type='text' /> element corresponding to each parameter, and then enter the input name attribute values under Username input name and Password input name.

    Example

    Based on the following HTML code:

    <input name="login" type="email" />

    <input name="pwd" type="password" />

    login is the username input name and pwd is the password input name.

  7. Under Username input value and Password input value, enter respectively the username and password parameter values.

  8. When your form uses other parameters than username and password, we recommend that you select the Autodetect hidden inputs check box. Otherwise, you must enter the Input name and Input value for each parameter, as with the username and password.

    Important

    Under Other inputs, inputs values are displayed in clear text. You must therefore ensure to enter your sensitive information, i.e., username and password, above the Other inputs, in the Username input name, Password input name, Username input value, and Password input value boxes (see steps 6 and 7).

    Example
    FormAuthOtherInputsEx
  9. Under Confirmation method, select the method that will determine if the authentication request failed.

    Depending on the selected confirmation method, enter the appropriate value:

    • When selecting Redirection to, enter the URL where the Coveo crawler is redirected when the login fails.

      Example

      https://mycompany.com/login/failed.html

    • When selecting Missing cookie, in the Value input, enter the name of the cookie that’s set when an authentication is successful.

      Example

      ASP.NET_SessionId

    • If you select Missing URL, in the Value input, enter a regex. Coveo will consider that an authentication request has failed if its crawler is not redirected to an URL matching the specified pattern.

    • When selecting Missing text, in the Value input, enter a string to show to authenticated users.

      Examples
      • Hello, jsmith@mycompany.com!

      • Log out

    • When selecting Text presence, in the Value input, enter a string to show when a login fails.

      Examples
      • An error has occurred.

      • Your username or password is invalid.

    • When selecting URL presence, in the Value input, enter a regex. Coveo will consider that an authentication request has failed if its crawler is redirected to an URL matching the specified pattern.

Form Authentication

When selecting Form authentication:

  1. In the Form URL box, enter the website login page URL.

  2. In the Username and Password boxes, enter the credentials to use to log in. See Source Credentials Leading Practices.

  3. Under Confirmation method, select the method that will determine if the authentication request failed.

    Depending on the selected confirmation method, enter the appropriate value:

    • When selecting Redirection to, enter the URL where the Coveo crawler is redirected when the login fails.

      Example

      https://mycompany.com/login/failed.html

    • When selecting Missing cookie, in the Value input, enter the name of the cookie that’s set when an authentication is successful.

      Example

      ASP.NET_SessionId

    • If you select Missing URL, in the Value input, enter a regex. Coveo will consider that an authentication request has failed if its crawler is not redirected to an URL matching the specified pattern.

    • When selecting Missing text, in the Value input, enter a string to show to authenticated users.

      Examples
      • Hello, jsmith@mycompany.com!

      • Log out

    • When selecting Text presence, in the Value input, enter a string to show when a login fails.

      Examples
      • An error has occurred.

      • Your username or password is invalid.

    • When selecting URL presence, in the Value input, enter a regex. Coveo will consider that an authentication request has failed if its crawler is redirected to an URL matching the specified pattern.

  4. If you want Coveo’s first request to be for authentication, regardless of whether authentication is actually required, check the Force login box.

  5. If the website relies heavily on asynchronous requests, under Loading delay, enter the maximum delay in milliseconds to wait for the JavaScript to execute on each web page.

  6. (Optional) If your form authentication configuration doesn’t work, or if the web page requires specific actions during the login process, you might have to enter a Custom login sequence configuration. Contact the Coveo Support team for help.

"Crawling Settings" Section

Specify how you want the Coveo crawler to behave when going through the desired websites.

Delay Between Requests

The number of milliseconds between each request sent to retrieve content from a specified domain. The default value is 1 request per 1000 milliseconds, which is the highest rate at which Coveo can crawl a public website.

If you want to increase the crawling speed for a site you own, for example if you need to retrieve the content of a large website, enter a lower number. For this crawling speed to apply, however, Coveo must verify that you’re the owner of the site.

  • If your source is of the Cloud type, create an empty text file named coveo-ownership-orgid.txt, replacing orgid with your Coveo organization ID (see Organization ID and Other Information). Then, upload this file at the root of the website you want to index. Changing the default number of milliseconds between each request has no effect if you don’t also provide the expected text file proving your ownership.

  • If your source retrieves the content of an internal website via the Coveo On-Premises Crawling Module, the specified crawling rate applies automatically, as Coveo detects that the crawled site has a private IP address (see Coveo On-Premises Crawling Module, Content Retrieval Methods, and Private IPv4 Addresses). You therefore don’t have anything to do to prove ownership.

Note

If your site has robot.txt directives that include a Crawl-delay parameter with a different value, the slowest crawling speed applies. See also Robots.txt Crawl-Delay and Page Restrictions Support.

Respect Robots.txt Directives

Clear this check box only when you want the crawler to ignore restrictions specified in the website robots.txt file (see The "Respect Robots.txt Directives" setting).

Respect Noindex Directives

Clear this check box if you want the Coveo crawler to index pages that have a noindex directive in their meta tag or in their X-Robots-Tag HTTP response header (see The "Respect Noindex Directives" and "Respect Nofollow Directives" settings).

Respect Nofollow Directives

Clear this check box if you want the Coveo crawler to follow links in pages that have a nofollow directive in their meta tag or in their X-Robots-Tag HTTP response header (see The "Respect Noindex Directives" and "Respect Nofollow Directives" settings).

Respect Nofollow Anchors

Clear this check box if you want the Coveo crawler to follow links that have a rel="nofollow" attribute (see The "Respect Nofollow Anchors" setting).

Render JavaScript

Check this box only when some website content you want to include is dynamically rendered by JavaScript. By default, the Web source doesn’t execute the JavaScript code in crawled website pages.

Important

Selecting the Render JavaScript check box may significantly increase the time needed to crawl pages.

Note

When the JavaScript takes longer to execute than normal or makes asynchronous calls, consider increasing the Loading delay parameter value to ensure that the pages with longest rendering time are indexed with all the rendered content. Enter the time in milliseconds allowed for dynamic content to be retrieved before indexing the content. When the value is 0 (default), the crawler doesn’t wait after the page is loaded.

Make Text Found in Image Files Searchable (OCR)

Enable this option if you want Coveo to extract text from image files. OCR-extracted text is processed as item data, meaning that it’s fully searchable and will appear in the item Quick View. See Enable optical character recognition for details on this feature.

Note

Contact Coveo Sales to add this feature to your organization license.

Make Text Found in PDF Files With Images Searchable (OCR)

Enable this option if you want Coveo to extract text from PDF files containing images. OCR-extracted text is processed as item data, meaning that it’s fully searchable and will appear in the item Quick View. See Enable optical character recognition for details on this feature.

Note

Contact Coveo Sales to add this feature to your organization license.

User Agent

The user agent string that you want Coveo to send with HTTP requests to identify itself when downloading pages.

The default value used is: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) (compatible; Coveobot/2.0;+https://www.coveo.com/bot.html).

"Web Scraping" Section

In the JSON configuration box, enter a custom JSON configuration to precisely include page sections or extract metadata from the website pages (see Web Scraping Configuration).

"Content Security" Tab

Select who will be able to access the source items through a Coveo-powered search interface. For details on this parameter, see Content security.

"Access" Tab

In the Access tab, set whether each group (and API key, if applicable) in your Coveo organization can view or edit the current source.

For example, when creating a new source, you could decide that members of Group A can edit its configuration while Group B can only view it.

See Custom access level for more information.

Completion

  1. Finish adding or editing your source:

    • When you want to save your source configuration changes without starting a build/rebuild, such as when you know you want to do other changes soon, click Add source/Save.

      Note

      On the Sources (platform-ca | platform-eu | platform-au) page, you must click Launch build or Start required rebuild in the source Status column to add the source content or to make your changes effective, respectively.

    • When you’re done editing the source and want to make changes effective, click Add and build source/Save and rebuild source.

      Back on the Sources (platform-ca | platform-eu | platform-au) page, you can follow the progress of your source addition or modification.

      Once the source is built or rebuilt, you can review its content in the Content Browser.

  2. Optionally, consider editing or adding mappings once your source is done building or rebuilding.

Required privileges

You can assign privileges to allow access to specific tools in the Coveo Administration Console. The following table indicates the privileges required to view or edit elements of the Sources (platform-ca | platform-eu | platform-au) page and associated panels. See Manage privileges and Privilege reference for more information.

Note

The Edit all privilege isn’t required to create sources. When granting privileges for the Sources domain, you can grant a group or API key the View all or Custom access level, instead of Edit all, and then select the Can Create check box to allow users to create sources. See Can Create ability dependence for more information.

Actions Service Domain Required access level

View sources, view source update schedules, and subscribe to source notifications

Content

Fields

View

Sources

Organization

Organization

Edit sources, edit source update schedules, and view the View Metadata page

Content

Fields

Edit

Sources

Content

Source metadata

View

Organization

Organization

What’s Next?