Add a Web source

This is for:

In this article

Source key characteristics
Limitations
Leading practices
Add a Web source
Required privileges
Proof of website ownership
Migrate from manual form authentication
What’s next?

Members with the required privileges can use a Web source to make the content of a website searchable.

The Web source crawler behaves similarly to bots of web search engines such as Google. The source only needs a starting URL and then automatically discovers all the pages of the site following the site navigation and all <a href> hyperlinks in the page HTML, including those without visible link text. Only pages that are discovered are indexed, and in the order they’re discovered.

Source key characteristics

The following table presents the main characteristics of a Web source.

Features Supported Additional information

Features	Supported	Additional information
Indexable content	Webpages (complete)
Content update operations	refresh
rescan		Takes place every day by default The source uses changes to metadata such as the ETag and Last-Modified server response headers, as well as content size, to determine if an item has changed since the last update.
rebuild
Content security options	Same users and groups as in your content system
Specific users and groups
Everyone
Authentication methods	Basic authentication
Form authentication
Crawling rules		A variety of basic and advanced rules may be used to ignore the webpages you don’t want to index.
metadata indexing for search	Automapping of metadata to a field with a matching name	Disabled by default. To enable, access the JSON configuration of your source, and set `performFieldMappingUsingAllOrigins` to `true`.
Automatically indexed metadata	Sample of autopopulated fields (no user-defined metadata required): `clickableuri` `date` `fileextension` `filetype` `language` (autodetected from document content) `title` The `author` field will also be autopopulated if the content item contains an `author` metadata value. After a content update operation, inspect your item field values in the Content Browser.
Collected indexable metadata	The source automatically collects the `content` attribute of `<meta>` tags when the tag is keyed with one of the following attributes: `name`, `property`, `itemprop`, or `http-equiv`. For example, if the HTML of a page contains the following: `<meta name="author" content="jsmith"/>`, the Web source extracts jsmith as the `author` metadata. After a rebuild, review the View and map metadata subpage for the list of indexed metadata and to index additional metadata from those available.
Custom metadata collection	Available using the following source features: • Web scraping • IndexJsonLdMetadata JSON configuration parameter
JavaScript content rendering		The crawler can run JavaScript on a webpage to dynamically render content before indexing the page.
Shadow DOM content retrieval		If you choose to render JavaScript content, you can also specify whether the crawler should traverse and index attached Shadow DOM content.
Web scraping		Exclude irrelevant sections in pages, extract custom metadata, and generate sub-items.
Optical Character Recognition (OCR)		Available at an extra charge. Contact Coveo Sales to add this feature to your Coveo organization license.
Robots.txt crawl-delay and page restrictions		Some lesser-known `robots.txt` directives such as `visit-time` and `request-rate` aren’t supported.

Indexable content

Webpages (complete)

Content update operations

refresh

rescan

Takes place every day by default

The source uses changes to metadata such as the ETag and Last-Modified server response headers, as well as content size, to determine if an item has changed since the last update.

rebuild

Content security options

Same users and groups as in your content system

Specific users and groups

Everyone

Authentication methods

Basic authentication

Form authentication

Crawling rules

A variety of basic and advanced rules may be used to ignore the webpages you don’t want to index.

metadata indexing for search

Automapping of metadata to a field with a matching name

Disabled by default. To enable, access the JSON configuration of your source, and set performFieldMappingUsingAllOrigins to true.

Automatically indexed metadata

Sample of autopopulated fields (no user-defined metadata required):

clickableuri
date
fileextension
filetype
language (autodetected from document content)
title

The author field will also be autopopulated if the content item contains an author metadata value.

After a content update operation, inspect your item field values in the Content Browser.

Collected indexable metadata

The source automatically collects the content attribute of <meta> tags when the tag is keyed with one of the following attributes: name, property, itemprop, or http-equiv.

For example, if the HTML of a page contains the following: <meta name="author" content="jsmith"/>, the Web source extracts jsmith as the author metadata.

After a rebuild, review the View and map metadata subpage for the list of indexed metadata and to index additional metadata from those available.

Custom metadata collection

Available using the following source features:

• Web scraping

• IndexJsonLdMetadata JSON configuration parameter

JavaScript content rendering

The crawler can run JavaScript on a webpage to dynamically render content before indexing the page.

Shadow DOM content retrieval

If you choose to render JavaScript content, you can also specify whether the crawler should traverse and index attached Shadow DOM content.

Web scraping

Exclude irrelevant sections in pages, extract custom metadata, and generate sub-items.

Optical Character Recognition (OCR)

Available at an extra charge. Contact Coveo Sales to add this feature to your Coveo organization license.

Robots.txt crawl-delay and page restrictions

Some lesser-known robots.txt directives such as visit-time and request-rate aren’t supported.

Limitations

Only pages reachable through website page hyperlinks are indexed. For example, the Web source crawler doesn’t follow options in a <select> tag.
Refresh isn’t available. A daily rescan is defined, but not enabled by default. You can enable this daily rescan on a per-source basis.
Multi-factor authentication (MFA) and CAPTCHA aren’t supported.
The Web source crawler can handle up to 200 cookies for the same domain, and a total of 3000 cookies. If the crawled sites add cookies beyond these limits, the crawler will drop older cookies, which can cause issues (for example, if a dropped cookie is required for authentication).
Indexing page permissions isn’t supported.
Although the MaxPageSizeInBytes is set to 0 (unlimited size) by default in the source JSON configuration, the Coveo indexing pipeline can handle web pages up to 512 MB only. Larger pages are indexed by reference (that is, their content is ignored by the Coveo crawler, and only their metadata and path are searchable). Therefore, no search result Quick view is available for these larger items.
JavaScript usage and limitations:
- The Execute JavaScript on pages and Add time for the crawler to wait before considering a page as fully rendered settings only pertain to webpage content retrieval for indexing. When authenticating, the Web crawler applies the Loading delay or the custom login sequence wait delay values.
- Content in pop-up windows and elements that require interaction aren’t indexed.
- When the Execute JavaScript on pages option is enabled, the source doesn’t support the UseProxy parameter.
The UseProxy parameter can’t be used in combination with Form authentication.
When indexing content with the Crawling Module, ensure not to change space character encoding in your items' URIs, as Coveo uses these URIs to distinguish items.

For example, an item whose URI would change from example.com/my first item to example.com/my%20first%20item wouldn’t be recognized as the same by Coveo. As a result, it would be indexed twice, and the older version wouldn’t be deleted.

Item URIs are displayed in the Content Browser (platform-ca | platform-eu | platform-au). We recommend you check where these URIs come from before making changes that affect space character encoding. Depending on your source type, the URI may be an item’s URL, or it may be built out of pieces of metadata by your source mapping rules. For example, your item URIs may consist in the main site URL, plus the item filename, due to a mapping rule such as example.com/%[filename]. In such a case, changing space encoding in the item filename could impact the URI.

Leading practices

Favor using a Sitemap source when the site features a sitemap file.
When a connector exists for the technology powering the website, rather create a source based on that connector, as it will typically index content, metadata, and permissions more effectively.

Example

You want to make an Atlassian Confluence-powered site content searchable. Create a Confluence source, not a Web source.
It’s best to create or edit your source in your sandbox organization first. Once you have confirmed that it indexes the desired content, you can copy your source configuration to your production organization, either with a snapshot or manually. See About non-production organizations for more information and best practices regarding sandbox organizations.
Always try authenticating without a custom login sequence first. You should only start working on a custom login sequence when you’re sure your form authentication details (that is, login address, user credentials, validation method) are accurate and that the standard form authentication process doesn’t work.
Ensure that you have the right to crawl the public content in the event where you aren’t the owner of the website. Crawling sites that you don’t own nor have the right to crawl could create reachability issues.

Furthermore, certain sites may use security mechanisms that can impact Coveo’s ability to retrieve the content. If you’re unfamiliar with these mechanisms, investigate and learn about them beforehand. For example, one impact this type of software (for example, Akamai, Cloudflare) can have is detecting the Coveo crawler as an attack and blocking it from any further crawling.
The number of items that a source processes per hour (crawling speed) depends on various factors, such as network bandwidth and source configuration. See About crawling speed for information on what can impact crawling speed, as well as possible solutions.
Leverage the Time the crawler waits between requests to your server parameter to increase the crawling speed for the sites you own. Contact the Coveo Support team for help if needed.
Schedule rescan operations following the rate at which your source content changes.
To index only one or a few specific pages of a site such as for a test, enter the pages to index as Starting URLs. Then, set the Number of page levels to crawl from a starting URL parameter value to 0, instructing the crawler to only index the specified pages, and none of their linked pages.
Though it’s possible to index multiple domains by configuring the source outside the main user interface, doing so is a bad practice. Always create one source per domain. This helps:
- Prevent the crawler from using your source authentication credentials on an external site.
- Reduce the number and complexity of crawling and scraping rules.
- Optimize source configurations for each site.
- Avoid having a rebuild/rescan issue on one site cause the deletion of indexed items associated with the other sites.
Don’t enable ExpandBeforeFiltering unless it’s necessary. Setting the ExpandBeforeFiltering parameter to true can significantly reduce the crawling speed since the crawler retrieves many pages that can be rejected in the end.
Group your source and the other implementation resources together in a project. See Manage projects.

Add a Web source

On the Sources (platform-ca | platform-eu | platform-au) page, click Add source.
In the Add a source of content panel, click the Cloud () or Crawling Module () tile, depending on your content retrieval context. With the latter, you must install the Crawling Module to make your source operational.
In the Add a new Web source / Add a new Crawling Module Web source panel, fill in the following fields.

Name: Use a short and descriptive name, using only letters, numbers, hyphens (-), and underscores (_). The source name can’t be modified once it’s saved.

Starting URL: The URL of a website page from which the crawler starts discovering and following links found in pages, including:
- The protocol (for example, http, https)
- The subdomain, if applicable (for example, the www subdomain)
  Examples of valid starting URLs
  https://www.coveo.com
  
  https://docs.coveo.com/en
  With the cloud Web source, as soon as you’ve typed the website domain, the source looks for sitemap files in standard website locations. If sitemap files are found, they’re displayed and you’re prompted to switch to the Coveo Sitemap source. Switching to the Sitemap source is recommended.
Continue configurations as a Sitemap source or Web source.
- If available, click Switch to a Sitemap source and continue configuring your Sitemap source with the autodetected sitemap URLs.
  
  OR
- Continue configuring your Web source.
  1. Fill in the following fields:
    
    Crawling Module: If you’re creating a Crawling Module Web source, select the installed Crawling Module instance.
    
    Project: The project(s) you want to associate your source with.
    
    Note
    
    After source creation, you can update your Coveo project selection under the Identification subtab.
  2. Click Next.
  3. Select who has permission to access the content through the search interface and click Add source.
    
    Note
    
    This information is editable later in the Content security tab.
  4. Continue configuring your source.

"Configuration" tab

The Configuration tab lets you manage the crawling rules, web scraping configurations, advanced settings, and authentication methods of your source. These configuration groups are presented in subtabs.

"Crawling rules" subtab

The Crawling rules subtab lets you define the specific pages to index.

Starting URLs

The Starting URL you entered when creating the Web source is automatically added to the Starting URLs list. Add other starting URLs in the same domain to ensure that orphan pages and isolated sections of your website are crawled and indexed.

Exclusions and inclusions

Add exclusion and inclusion rules to crawl only specific items based on their URL.

The following diagram illustrates how the Web crawler applies the exclusion and inclusion rules. This flow applies to all pages, including the starting URLs. You must therefore pay attention to not filter out your starting URLs.

Note

The diagram shows the crawling process with the default source parameter settings. Certain parameters (for example, ExpandBeforeFiltering) can fundamentally change this behavior.

About the "Include all non-excluded pages" option

Crawling flow with the all-inclusive inclusion rule | Coveo

The Include all non-excluded pages option automatically adds an "include all" inclusion rule in the background. This ensures that all starting URLs meet the Does URL match at least one inclusion rule? condition and that all non-excluded pages get crawled.

You can use any of the six types of rules:

is and a URL that includes the protocol. For example, https://myfood.com/.
contains and a string found in the URL. For example, recipes.
begins with and a string found at the beginning of the URL and which includes the protocol. For example, https://myfood.
ends with and a string found at the end of the URL. For example, .pdf.
matches wilcard rule and a wildcard expression that matches the whole URL. For example, https://myfood.com/recipes*.
matches regex rule and a regex rule that matches the whole URL. For example, ^.*(company-(dev|staging)).*html.?$.

When using regex rules, make sure they match the desired URLs with a testing tool such as Regex101.

"Web scraping" subtab

The Web scraping subtab lists and lets you manage web scraping configurations for your source.

When the crawler is about to index a page, it checks whether it must apply web scraping configurations that have been defined. The crawler considers the Pages to target rules of each of your web scraping configurations, starting with the configuration at the top of your list. The crawler will either apply the first matching configuration or all matching configurations.

Indexing irrelevant page sections and not extracting custom metadata reduces the quality of search results. With this in mind, all new Web sources are created with a default web scraping configuration that excludes typical repetitive elements found in web pages that shouldn’t be indexed.

Default web scraping configuration | Coveo

Existing Web sources without a web scraping configuration prompt you to add the default configuration when you access the Web scraping subtab.

Apply the default web scraping configuration | Coveo

When no web scraping configuration is defined:

All crawling rules included pages are indexed in their entirety (that is, no sections are excluded).
No custom metadata is collected.
No sub-items are created.

The Web source features two web scraping configuration management modes: UI-assisted mode and Edit with JSON mode.

UI-assisted mode

You can add (+), edit (), and delete () one web scraping configuration at a time with a user interface that makes many technical aspects transparent. UI-assisted mode is easier to use and more mistake-proof than Edit with JSON mode.

Use this mode except for sub-item related configurations (which are only supported in Edit with JSON mode).

Note

The Web scraping tab displays a message when the aggregated web scraping configuration contains a sub-item related configuration.

Message shown in the Web scraping tab when sub-items are configured | Coveo

When you add or edit a web scraping configuration using UI-assisted mode, the Add/Edit a web scraping configuration panel is displayed. See Configurations in UI-assisted mode for more details.

Edit with JSON mode

The Edit with JSON button gives access to the aggregated web scraping JSON configuration of the source. Adding, editing, and deleting configurations directly in the JSON requires more technical skills than using UI-assisted mode.

Use this mode to perform sub-item related configurations and when you want to test your aggregated web scraping configuration with the Coveo Labs Web Scraper Helper.

Note

The Web scraping tab displays a message when the aggregated web scraping configuration contains a sub-item related configuration.

When you add or edit a web scraping configuration in Edit with JSON mode, the Edit a web scraping JSON configuration panel is displayed. See Configurations in Edit with JSON mode for more details.

Single-match vs multi-match

The Web source can apply web scraping configurations in two ways: single-match or multi-match.

In single-match mode, the crawler applies only the first matching web scraping configuration. In multi-match mode, the crawler applies all matching web scraping configurations.

The animation below demonstrates the application of three web scraping configurations on a culinary website featuring news articles and recipe pages, in single-match mode (left) and multi-match mode (right).

Animation showing the single-match and multi-match behaviors | Coveo

Web sources created before mid-December 2023 were created in single-match mode. All new Web sources are created in multi-match mode.

Coveo converted existing single-match sources containing zero or one web scraping configuration to multi-match mode. We recommend you convert any remaining single-match Web source to multi-match mode. If a Web source is currently in single-match mode, the Web scraping subtab displays a banner prompting you to convert to multi-match mode.

To convert a source to multi-match mode

In the Web scraping subtab, click Switch to multi-match mode.
Confirm you want to convert the source to multi-match mode. A green You’re currently in multi-match mode banner will then appear.
Click Save.

Once your source is fully converted, the Web scraping subtab no longer shows the green banner and the subtab description reflects the multi-match mode behavior.

Web scraping configuration options and description for targeting and processing pages | Coveo

"Advanced settings" subtab

The Advanced settings subtab lets you customize the Coveo crawler behavior. All advanced settings have default values which are adequate in most use cases.

Execute JavaScript on pages

Only enable this option when website content you want to consider for indexing is dynamically rendered by JavaScript. Enabling this option may significantly increase the time needed to crawl pages.

If you enable Execute JavaScript on pages, you’ll have the following options:

Add time for the crawler to wait before considering a page as fully rendered: The default value of this setting is 0, which means that the crawler doesn’t wait after the page is loaded to retrieve its content. If the JavaScript takes longer to execute than normal or makes asynchronous calls, consider increasing this value to ensure that pages with longer rendering times are indexed with their dynamically rendered content.
Enable Shadow DOM content retrieval: When you enable this option, the crawler builds a flattened DOM tree by combining the light DOM and the Shadow DOM. It then processes the resulting structure as it would any other web page.

Note

The crawler adds a custom attribute to the shadow root elements in the flattened DOM, allowing these elements to be targeted using a special web scraping CSS selector.

Building the composed DOM can significantly slow down indexing. Enable this option only if the Shadow DOM contains valuable content you need to index.

Query parameters to ignore

Add query string parameters that the source should ignore when determining whether a URL corresponds to a distinct item.

By default, the source considers the whole URL to determine whether the page is a distinct item. The URLs of the website you index can contain one or more query string parameters after the host name and the path. Some query string parameters may change the content of the page significantly, and therefore legitimately contribute to a distinct page. Other query string parameters may not affect the content of the page, or very little. In the latter case, you want to ignore the query string parameter to avoid creating search result duplicates.

Example

The URL of a website page for which you get search result duplicates looks as follows:

http://www.mysite.com/v1/getitdone.html?lang=en&param1=abc&param2=123

The values of param1 and param2 can change without affecting the page content while the lang value changes the language in which the page appears. You want to ignore the param1 and param2 query string parameters to eliminate search result duplicates, but not lang. In this example, you would therefore add the param1 and param2 parameters.

Note

Wildcards or ReGex aren’t supported in query string parameter names. For instance, in the example above, you can’t cover both the param1 and param2 query string parameters using param*.

Directives overrides

Check the robots.txt box if you want the Coveo crawler to ignore directives specified in the website’s robots.txt file.
Check the noindex box if you want the Coveo crawler to index pages that have a noindex directive in their meta tag or in their X-Robots-Tag HTTP response header.
Check the nofollow links box if you want the Coveo crawler to follow links in pages that have a nofollow directive in their meta tag or in their X-Robots-Tag HTTP response header.
Check the nofollow anchors box if you want the Coveo crawler to follow links that have a rel="nofollow" attribute.

Number of page levels to crawl from a starting URL

Indicate the number of page link levels (or clicks) the crawler can travel from any starting URL. A starting URL is level 0. All pages accessible from a starting URL are considered level 1.

Time the crawler waits between requests to your server

Indicate the number of milliseconds between consecutive HTTP requests to the website server. The default value is 1000 milliseconds, which represents a crawling rate of one page per second.

One page per second is the highest rate at which Coveo can crawl a public website for a cloud Web source without proof of ownership of the website. You can enter a number below 1000. However, the Coveo crawler will only apply a crawling delay below 1000 milliseconds if it can verify that you’re the owner of the site.

If you’re retrieving content of an internal website using the Crawling Module Web source, the crawling delay you specify applies automatically. You don’t need to prove site ownership as Coveo detects that the crawled site has a private IP address.

"Authentication" subtab

The Authentication settings, used by the source crawler, emulate the behavior of a user authenticating to access restricted website content. If authentication is required, select the authentication type your website uses, whether Basic authentication or Form authentication. Then, provide the corresponding login details.

Whether you use Basic or Form authentication, limit your source crawling scope to one domain that you own. This reduces the risk of exposing your authentication credentials.

Note

Manual form authentication is now only available on legacy sources. We recommend you migrate existing Manual form authentication sources to Form authentication.

Basic authentication

When selecting Basic authentication, enter the credentials of an account on the website you’re making searchable. See Source credentials leading practices.

When Execute JavaScript on pages is enabled on the source, basic authentication significantly impacts indexing performance.

When the Coveo crawler follows links requiring basic authentication while indexing your website, it only uses the basic authentication credentials you entered if the link URL matches the scheme and domain of the Starting URLs. If this condition isn’t met, the Coveo crawler doesn’t try to authenticate.

For example, if your starting URL is https://www.example.com, the Coveo crawler doesn’t even try to authenticate if the link URL uses:

A different scheme (that is, it uses HTTP instead of HTTPS).
A different domain (for example, https://www.mysite.com), unless UseHiddenBasicAuthentication is enabled.

Form authentication

You can choose between two form authentication workflows:

Force authentication disabled (recommended)

With Force authentication disabled, the workflow typically goes as follows:

Coveo’s crawler requests a protected page.
The web server redirects the crawler to the Login page address.
Using the configured Validation method, the crawler determines it’s not authenticated. This automatically triggers the next step.
The crawler performs a standard login sequence using the provided Login details, or the Custom login sequence if one is configured.
After successful authentication, the web server responds by redirecting back to the requested protected page and returning cookies.
The crawler follows the server redirect to get the protected page and indexes that page.
The crawler requests the other pages using the cookies.

This is the default and recommended workflow as it emulates human behavior the best and ensures crawler re-authentication, when needed.

Force authentication enabled

With Force authentication enabled, the workflow typically goes as follows:

The crawler performs a standard login sequence using the provided Login details, or the Custom login sequence if one is configured.
After successful authentication, the web server responds with cookies that the crawler will use to request other pages.
The crawler requests the first URL from the web server using the cookies and indexes that page.
The crawler requests other pages using the cookies.

If the crawler loses authentication at some point (for example, if a cookie expires), it has no way of knowing it must re-authenticate unless you have a proper authentication status validation method. As a result, you may notice at some point that your source has indexed some, but not all, protected pages.

Only use Force authentication when no reliable authentication status validation method can be configured.

Note

The crawler can interact with Shadow DOM elements in your login pages. If this is required, make sure the form authentication loading delay allows the Shadow DOM time to load before the crawler begins to interact with the page.

Username and password

Enter the credentials required to access the secured content. See Source credentials leading practices.

Enter the URL of the website login page where the username and password are to be used.

Loading delay

Enter the maximum time the crawler should allow for JavaScript to execute and go through the login sequence before timing out.

Validation method

The crawler uses the validation method after requesting a page from the web server to know if it’s authenticated or not. When the validation method reveals that the crawler isn’t authenticated, the crawler immediately tries to re-authenticate.

To configure the validation method

In the dropdown menu, select your preferred authentication status validation method.
In the Value(s) field, specify the corresponding URL, regex or text.
- For Cookie not found (recommended):
  
  Enter the name of the cookie returned by the server after successful authentication. If this cookie isn’t found, the crawler will immediately authenticate (or re-authenticate).
  
  Example
  
  ASP.NET_SessionId
- For Redirection to URL (recommended):
  
  Enter the URL where users trying to access protected content on the website are redirected to when they’re not authenticated. If the crawler is redirected to this URL, it will immediately authenticate (or re-authenticate).
  
  Example
  
  https://mycompany.com/login/failed.html
- For Text not found in page ^[1]:
  
  Enter the text that appears on the page after successful authentication. If this text isn’t found on the page, the crawler will immediately authenticate (or re-authenticate).
  
  Example
  
  When a user successfully logs in, the page shows a "Hello, <USERNAME>!" greeting text. If the login username you specified was jsmith@mycompany.com, the text to enter would be:
  
  Hello, jsmith@mycompany.com!
  
  Example
  
  Log out
- For Text found in page ^[1]:
  
  Enter the text that appears on the page when a user isn’t authenticated. If this text is found on the page, the crawler will immediately authenticate (or re-authenticate).
  Examples
  An error has occurred.
  
  Your username or password is invalid.
- For URL matches regex ^[1]:
  
  Enter a regex rule that matches the URL where users trying to access protected content are redirected to when they’re not authenticated. If the crawler is redirected to a URL that matches this regex, it will immediately authenticate (or re-authenticate).
  
  Example
  
  .+Account\/Login.*
- For URL doesn’t match regex ^[1]:
  
  Enter a regex rule that matches the URL where users trying to access protected content are redirected to after successful authentication. If the crawler isn’t redirected to a URL that matches this regex, it will immediately authenticate (or re-authenticate).

Force authentication

Select this option if you want Coveo’s first request to be for authentication, regardless of whether it is actually required.

You should only force authentication if you have no reliable authentication status validation method.

If the web page requires specific actions during the login process, you might have to configure a custom login sequence.

The standard source login sequence can handle various third-party login pages (for example, OneLogin, Google, Salesforce, Microsoft), and will try to automatically detect and log in on first-party login forms. Ensure that the standard source login sequence fails before configuring a custom login sequence.

"Crawling Module" subtab

If your source is a Crawling Module source, and if you have more than one Crawling Module linked to this organization, select the one with which you want to pair your source. If you change the Crawling Module instance paired with your source, a successful rebuild is required for your change to apply.

"Identification" subtab

The Identification subtab contains general information about the source.

Name

The source name. It can’t be modified once it’s saved.

Project

Use the Project selector to associate your source with one or more Coveo projects.

"Items" tab

In the Items tab, you can specify how the source handles items based on their file type or content type.

File types

File types let you define how the source handles items based on their file extension or content type. For each file type, you can specify whether to index the item content and metadata, only the item metadata, or neither.

You should fine-tune the file type configurations with the objective of indexing only the content that’s relevant to your users.

Example

Your repository contains .pdf files, but you don’t want them to appear in search results. You click Extensions and then, for the .pdf extension, you change the Default action and Action on error values to Ignore item.

For more details about this feature, see Customize the indexing process.

With file type handling, using the Index metadata default action on HTML items lets you index basic metadata for those items. On the other hand, web scraping is used to index custom metadata from the page content, which must first be retrieved. These metadata indexing mechanisms are complementary, and you can use them together within a source.

If there are some items for which you only need to index basic metadata, make sure you don’t have a web scraping rule that matches those items. This will prevent unnecessary processing and potential issues with retrieving protected page content.

Content and images

If you want Coveo to extract text from image files or PDF files containing images, enable the appropriate option. The extracted text is processed as item data, meaning that it’s fully searchable and will appear in the item Quick view.

Note

When OCR is enabled, ensure the source’s relevant file type configurations index the item content. Indexing the item’s metadata only or ignoring the item will prevent OCR from being applied.

See Enable optical character recognition for details on this feature.

"Content security" tab

Select who will be able to access the source items through a Coveo-powered search interface. For details on the content security options, see Content security.

"Access" tab

In the Access tab, specify whether each group (and API key, if applicable) in your Coveo organization can view or edit the current source.

For example, when creating a new source, you could decide that members of Group A can edit its configuration, while Group B can only view it.

For more information, see Custom access level.

Completion

Finish adding or editing your source:

When you want to save your source configuration changes without starting a build/rebuild, such as when you know you want to do other changes soon, click Add source/Save.

When you’re done editing the source and want to make changes effective, click Add and build source/Save and rebuild source.

Note

On the Sources (platform-ca | platform-eu | platform-au) page, you must click Launch build or Start required rebuild in the source Status column to add the source content or to make your changes effective, respectively.

Back on the Sources (platform-ca | platform-eu | platform-au) page, you can follow the progress of your source addition or modification.

Once the source is built or rebuilt, you can review its content in the Content Browser.

Once your source is done building or rebuilding, review the metadata Coveo is retrieving from your content.

Note

Not clear on the purpose of metadata? Watch this video.

On the Sources (platform-ca | platform-eu | platform-au) page, click your source, and then click More > View and map metadata in the Action bar.
If you want to use a currently not indexed metadata in a facet or result template, map it to a field.
1. Click the metadata and then, at the top right, click Add to Index.
2. In the Apply a mapping on all item types of a source panel, select the field you want to map the metadata to, or add a new field if none of the existing fields are appropriate. For advanced mapping configurations, like applying a mapping to a specific item type, see Manage mappings.
3. Click Apply mapping.

Depending on the source type you use, you may be able to extract additional metadata from your content. You can then map that metadata to a field, just like you did for the default metadata.

More on custom metadata extraction and indexing

Some source types let you define rules to extract metadata beyond the default metadata Coveo discovers during the initial source build.

For example:

Source type Custom metadata extraction methods

Source type	Custom metadata extraction methods
Push API	Define metadata key-value pairs in the `addOrUpdate` section of the `PUT` request payload used to upload push operations to an Amazon S3 file container.
REST API and GraphQL API	In the JSON configuration (REST API \| GraphQL API) of the source, define metadata names (REST API \| GraphQL API) and specify where to locate the metadata values in the JSON API response Coveo receives.
Database	Add `<CustomField>` elements in the XML configuration. Each element defines a metadata name and the database field to use to populate the metadata with.
Web	Configure web scraping configurations that contain metadata extraction rules using CSS or XPath selectors. Extract metadata from JSON-LD `<script>` tags.
Sitemap	Extract metadata included in the XML sitemap file. Configure web scraping configurations that contain metadata extraction rules using CSS or XPath selectors. Extract JSON-LD `<script>` tag metadata. Extract `<meta>` tag content using the `IndexHtmlMetadata` JSON parameter.

Push API

Define metadata key-value pairs in the addOrUpdate section of the PUT request payload used to upload push operations to an Amazon S3 file container.

REST API
and
GraphQL API

In the JSON configuration (REST API | GraphQL API) of the source, define metadata names (REST API | GraphQL API) and specify where to locate the metadata values in the JSON API response Coveo receives.

Database

Add <CustomField> elements in the XML configuration. Each element defines a metadata name and the database field to use to populate the metadata with.

Web

Configure web scraping configurations that contain metadata extraction rules using CSS or XPath selectors.
Extract metadata from JSON-LD <script> tags.

Sitemap

Extract metadata included in the XML sitemap file.
Configure web scraping configurations that contain metadata extraction rules using CSS or XPath selectors.
Extract JSON-LD <script> tag metadata.
Extract <meta> tag content using the IndexHtmlMetadata JSON parameter.

Some source types automatically map metadata to default or user created fields, making the mapping process unnecessary. Some source types automatically create mappings and fields for you when you configure metadata extraction.

See your source type documentation for more details.

When you’re done reviewing and mapping metadata, return to the Sources (platform-ca | platform-eu | platform-au) page.
To reindex your source with your new mappings, click Launch rebuild in the source Status column.
Once the source is rebuilt, you can review its content in the Content Browser.

Schedule source updates.

Troubleshooting

After a rebuild, you may notice that your source isn’t indexing as expected. For example, there may be missing or extra items, or the values of some fields may not meet your requirements.

To help you troubleshoot, refer to the list of common issues and solutions when using the Web source.

Required privileges

You can assign privileges to allow access to specific tools in the Coveo Administration Console. The following table indicates the privileges required to view or edit elements of the Sources (platform-ca | platform-eu | platform-au) page and associated panels. See Manage privileges and Privilege reference for more information.

Note

The Edit all privilege isn’t required to create sources. When granting privileges for the Sources domain, you can grant a group or API key the View all or Custom access level, instead of Edit all, and then select the Can Create checkbox to allow users to create sources. See Can Create ability dependence for more information.

Actions	Service	Domain	Required access level
View sources, view source update schedules, and subscribe to source notifications	Content	Fields	View
Sources
Organization	Organization
Edit sources, edit source update schedules, and edit source mappings	Organization	Organization	View
Content	Fields	Edit
Sources
View and map metadata	Content	Source metadata	View
Fields
Organization	Organization
Content	Sources	Edit

Actions

Service

Domain

Required access level

View sources, view source update schedules, and subscribe to source notifications

Content

Fields

View

Sources

Organization

Edit sources, edit source update schedules, and edit source mappings

Organization

View

Content

Fields

Edit

Sources

View and map metadata

Content

Source metadata

View

Fields

Organization

Content

Sources

Edit

Proof of website ownership

Coveo applies a Time the crawler waits between requests to your server value below 1000 milliseconds only when you prove ownership of the website you want to index.

To prove ownership of the website you want to index

Create an empty text file named coveo-ownership-orgid.txt, replacing orgid with your Coveo organization ID.
Upload this file at the root of the website you want to index.

Note

If your site has robots.txt directives that include a crawl-delay parameter with a different value, the slowest crawling speed applies. See also the robots.txt option.

Migrate from manual form authentication

If you’re using manual form authentication, you’ll see a "Manual form authentication deprecation" warning when viewing the Authentication subtab. You’ll want to migrate to form authentication. To do so, we recommend you create a duplicate of your source and configure form authentication on the duplicate. When the duplicate is configured and fully tested, you can copy its configuration to the original source.

If you’re using a sandbox organization and a snapshot-based phased rollout, the alternative is to copy your original source and related resources configurations to your sandbox using the resource snapshots feature. Once your sandbox source authentication configurations updated and fully tested, you can use a snapshot to apply your changes to your production organization source.

Though the following procedure uses the source duplicate method, steps 3 to 8 inclusively are common to both methods.

To migrate from manual form authentication to form authentication

On the Sources (platform-ca | platform-eu | platform-au) page, click your source, and then click More > Duplicate in the Action bar.
Name your duplicate.
Click your duplicate source, and then click Edit in the Action bar.
Select the Authentication subtab.
Select the Form authentication radio button.

The following fields will be populated automatically using your existing manual form authentication settings: Username, Password, Login page address, Validation method and Value(s), Force authentication.
Rebuild your duplicate source.
Make sure that your duplicate source contains properly indexed content. Things you should check for:
- Your duplicate source contains the same number of items as the original source.
- For pages that are authentication protected in your website, make sure the Quick view of the corresponding items in your duplicate source shows the content of the actual website page. If form authentication fails, the item Quick view may display the content of your form authentication login page instead of the actual website page.
If form authentication is failing, consider making the following adjustments to your duplicate source form authentication configuration:
- Changing the Validation method and associated Value(s) to a more reliable combination.
- Increasing the Loading delay.
- Setting up a custom login sequence.
Contact Coveo Support if you need help.
When you’re sure the authentication configuration on your duplicate source works, apply the changes to the original source.
1. On the Sources (platform-ca | platform-eu | platform-au) page, click your duplicate source, and then click More > Edit configuration with JSON in the Action bar.
2. Copy the FormAuthenticationConfiguration JSON object. The object looks like the following:
  "FormAuthenticationConfiguration": { "sensitive": false, "value": "{\"authenticationFailed\":{\"method\":\"RedirectedToUrl\",\"values\":[\"https://something.com/Account/Login\"]},\"inputs\":[], \"formUrl\":\"https://something.com/Account/Login\",\"enableJavaScript\":true,\"forceLogin\":false,\"javaScriptLoadingDelayInMilliseconds\":2000,\"customLoginSequence\":{}}" }
3. On the Sources (platform-ca | platform-eu | platform-au) page, click your original source, and then click More > Edit configuration with JSON in the Action bar.
4. Replace the FormAuthenticationConfiguration object with the one from your duplicate source.
5. Click Save.

What’s next?

If you’re using the Crawling Module to retrieve your content, consider subscribing to deactivation notifications to receive an alert when a Crawling Module component becomes obsolete and stops the content crawling process.

1. Less reliable than the recommended validation methods. Can result in false positives, making form authentication issues harder to troubleshoot.

Add a Web source

Add a Web source

Source key characteristics

Limitations

Leading practices

Add a Web source

"Configuration" tab

"Crawling rules" subtab

Starting URLs

Exclusions and inclusions

"Web scraping" subtab

UI-assisted mode

Edit with JSON mode

Single-match vs multi-match

"Advanced settings" subtab

Execute JavaScript on pages

Query parameters to ignore

Directives overrides

Number of page levels to crawl from a starting URL

Time the crawler waits between requests to your server

"Authentication" subtab

Basic authentication

Form authentication

Username and password

Login page address

Loading delay

Validation method

Force authentication

Custom login sequence

"Crawling Module" subtab

"Identification" subtab

Name

Project

"Items" tab

File types

Content and images

"Content security" tab

"Access" tab

Completion

Troubleshooting

Required privileges

Proof of website ownership

Migrate from manual form authentication

What’s next?