--- title: Add a Web source slug: malf0160 canonical_url: https://docs.coveo.com/en/malf0160/ collection: index-content source_format: adoc --- # Add a Web source :figure-caption!: Members with the [required privileges](#required-privileges) can use a Web [source](https://docs.coveo.com/en/246/) to make the content of a website searchable. The Web source [crawler](https://docs.coveo.com/en/2121/) behaves similarly to bots of web search engines such as Google. The source only needs a starting URL and then automatically discovers all the pages of the site following the site navigation and all `` hyperlinks in the page HTML, including those without visible link text. Only pages that are discovered are [indexed](https://docs.coveo.com/en/204/), and in the order they're discovered. ## Source key characteristics The following table presents the main characteristics of a Web source. [%header,cols="3,3,2,4"] |=== 2+|Features |Supported |Additional information 2+|Indexable content ^|Webpages (complete) | 2+|Compressed HTTP responses ^|[check] |The source automatically handles compressed web server HTTP responses with the following `Content-Encoding` header values: `gzip`, `deflate`, and `br`. .3+.^|[Content update operations](https://docs.coveo.com/en/2039/) |[refresh](https://docs.coveo.com/en/2710/) ^|[x] | |[rescan](https://docs.coveo.com/en/2711/) ^|[check] a|[Takes place every day by default](https://docs.coveo.com/en/1933/) The source uses changes to metadata such as the ETag and Last-Modified server response headers, as well as content size, to determine if an item has changed since the last update. |[rebuild](https://docs.coveo.com/en/2712/) ^|[check] | .3+.^|[Content security options](https://docs.coveo.com/en/1779/) |[Same users and groups as in your content system](https://docs.coveo.com/en/1779#same-users-and-groups-as-in-your-content-system) ^|[x] | |[Specific users and groups](https://docs.coveo.com/en/1779#specific-users-and-groups) ^|[check] | |[Everyone](https://docs.coveo.com/en/1779#everyone) ^|[check] | .2+.^|[Authentication methods](#authentication-subtab) |Basic authentication ^|[check] | |Form authentication ^|[check] | 2+|[Crawling rules](#crawling-rules-subtab) ^|[check] |A variety of basic and advanced rules may be used to ignore the webpages you don't want to index. .4+|[Metadata indexing for search](#index-metadata) |Automatic mapping of [metadata](https://docs.coveo.com/en/218/) to [fields](https://docs.coveo.com/en/200/) that have the same name 2+a|Disabled by default. To enable, [access the JSON configuration of your source](https://docs.coveo.com/en/1685#access-the-edit-configuration-with-json-panel), and set [`performFieldMappingUsingAllOrigins`](https://docs.coveo.com/en/1640#about-the-performfieldmappingusingallorigins-setting) to `true`. |Automatically indexed [metadata](https://docs.coveo.com/en/218/) 2+a|Examples of [auto-populated default fields](https://docs.coveo.com/en/1833#field-origin) (no user-defined metadata required): * `clickableuri` * `date` * `filetype` * `language` (autodetected from document content) * `title`   The [`author`](https://docs.coveo.com/en/1833#field-origin) field will also be auto-populated if the content item contains an `author` metadata value. After a content update, [inspect your item field values](https://docs.coveo.com/en/2053#inspect-search-results) in the **Content Browser**. |Extracted but not indexed metadata 2+a|The source automatically extracts the `content` attribute of `` tags when the tag includes one of the following attributes: `name`, `property`, `itemprop`, or `http-equiv`. For example, if the HTML of a page contains the following: ``, the Web source extracts _jsmith_ as the `author` metadata. After a rebuild, review the [**View and map metadata**](https://docs.coveo.com/en/m9ti0339#view-and-map-metadata-subpage) subpage for the list of indexed metadata, and [index additional metadata](https://docs.coveo.com/en/m9ti0339#index-metadata). |Custom metadata extraction 2+a|Available using the following source features: • [Web scraping](#web-scraping-subtab) • [IndexJsonLdMetadata](https://docs.coveo.com/en/mc1f0219#indexjsonldmetadata-boolean) JSON configuration parameter 2+|[JavaScript content rendering](#execute-javascript-on-pages) ^|[check] |The crawler can run JavaScript on a webpage to dynamically render content before indexing the page. 2+|[Shadow DOM content retrieval](#execute-javascript-on-pages) ^|[check] |If you choose to render JavaScript content, you can also specify whether the crawler should traverse and index attached Shadow DOM content. 2+|[Web scraping](#web-scraping-subtab) ^|[check] |Exclude irrelevant sections in pages, extract custom [metadata](https://docs.coveo.com/en/218/), and generate sub-items. 2+|[Optical Character Recognition (OCR)](#content-and-images) ^|[check] |Available at an extra charge. Contact [Coveo Sales](https://www.coveo.com/en/contact) to add this feature to your [Coveo organization](https://docs.coveo.com/en/185/) license. 2+|[Robots.txt crawl-delay and page restrictions](#directives-overrides) ^|[check] |Some lesser-known `robots.txt` directives such as `visit-time` and `request-rate` aren't supported. |=== ## Limitations * Only pages reachable through website page _hyperlinks_ are indexed. For example, the Web source crawler doesn't follow options in a `` element whose `id` or `name` attribute value is either `user`, `email`, `login`, `id`, or `name` (case-insensitive). **Password field**: A visible `` element whose `type` attribute value is `password` (case-insensitive). **Submit element**: A visible form submit element. The source looks for the following element types, in this order: * An `` element whose `type` attribute value is `submit` or `button`. * A ` ``` To emulate a user clicking _Got it_, you configure your source as follows: . On the [**Sources**](https://platform.cloud.coveo.com/admin/#/orgid/content/sources/) ([platform-ca](https://platform-ca.cloud.coveo.com/admin/#/orgid/content/sources/) | [platform-eu](https://platform-eu.cloud.coveo.com/admin/#/orgid/content/sources/) | [platform-au](https://platform-au.cloud.coveo.com/admin/#/orgid/content/sources/)) page, click your source, and then click **More** > **Edit source with JSON** in the Action bar. . In the **Edit configuration with JSON** panel, use the search tool ([search2]) to locate the `FormAuthenticationConfiguration` parameter. Its `value` object currently looks as follows: ```txt "value": "{\"authenticationFailed\":{\"method\":\"CookieNotSet\",.....\"customLoginSequence\":{}}" ``` . Remove the `}"` at the end of the `value` object so that you get: ```txt "value": "{\"authenticationFailed\":{\"method\":\"CookieNotSet\",.....\"customLoginSequence\":{} ``` . Append the following JSON snippet immediately after `\"customLoginSequence\":{}` to configure the post-login action: ```txt ,\"postLoginSequence\":{\"name\":\"Handle maintenance page\",\"url\":\"*\",\"urlContainsValue\":\"maintenanceandavailable.jsp\",\"steps\":[{\"name\":\"Dismiss maintenance modal\",\"waitDelayInMilliseconds\":500,\"actions\":[{\"type\":\"click\",\"elementIdentifier\":{\"identifier\":\"maintenance-confirm-button\",\"type\":\"default\",\"findType\":\"classname\"}}]}]}}" ``` **Example: Final FormAuthenticationConfiguration parameter value** [%collapsible] ```json "FormAuthenticationConfiguration": { "sensitive": false, "value": "{\"authenticationFailed\":{\"method\":\"CookieNotSet\",\"values\":[\"sid\"]},\"inputs\":[],\"formUrl\":\"https://somedomain.my.salesforce.com/apex/RedirectPage?siteType=lwr&basePath=/sitesDemo\",\"enableJavaScript\":true,\"forceLogin\":false,\"javaScriptLoadingDelayInMilliseconds\":1000,\"customLoginSequence\":{},\"postLoginSequence\":{\"name\":\"Handle maintenance page\",\"url\":\"*\",\"urlContainsValue\":\"maintenanceandavailable.jsp\",\"steps\":[{\"name\":\"Dismiss maintenance modal\",\"waitDelayInMilliseconds\":500,\"actions\":[{\"type\":\"click\",\"elementIdentifier\":{\"identifier\":\"maintenance-confirm-button\",\"type\":\"default\",\"findType\":\"classname\"}}]}]}}" }, ``` #### This post-login action configuration can be translated as the following instruction to the source crawler: "When a page whose URL contains `maintenanceandavailable.jsp` is encountered, wait 500 milliseconds, and then click the element with the class name `maintenance-confirm-button`." For more details on the post-login action parameters, see [Configure an action](https://docs.coveo.com/en/3289#configure-an-action). . Click **Save**. ##### ==== "Crawling Module" subtab If your source is a [Crawling Module source](https://docs.coveo.com/en/1612/), and if you have [more than one Crawling Module linked to this organization](https://docs.coveo.com/en/3271#deploying-multiple-crawling-module-instances), select the one with which you want to pair your source. If you change the Crawling Module instance paired with your source, a successful [rebuild](https://docs.coveo.com/en/3390#refresh-rescan-or-rebuild-sources) is required for your change to apply. #### "Identification" subtab The **Identification** subtab contains general information about the source. ## Name The source name. It can't be modified once it's saved. :leveloffset!: ##### Project Use the **Project** selector to associate your source with one or more Coveo [projects](https://docs.coveo.com/en/n7ef0517/). ### "Items" tab On the **Items** tab, you can specify how the source handles items based on their file type or content type. #### File types File types let you define how the source handles [items](https://docs.coveo.com/en/210/) based on their file extension or content type. For each file type, you can specify whether to index the item content and [metadata](https://docs.coveo.com/en/218/), only the item metadata, or neither. You should fine-tune the file type configurations with the objective of indexing only the content that's relevant to your users. **Example** Your repository contains `.pdf` files, but you don't want them to appear in search results. You click **Extensions** and then, for the `.pdf` extension, you change the **Default action** and **Action on error** values to `Ignore item`. For more details about this feature, see [File type handling](https://docs.coveo.com/en/l3qg9275/). > **Tip** > > With [file type handling](https://docs.coveo.com/en/l3qg9275/), using the `Index metadata` default action on HTML items lets you index basic metadata for those items. > On the other hand, web scraping is used to index custom metadata from the page content, which must first be retrieved. > These metadata indexing mechanisms are complementary, and you can use them together within a source. > > If there are some items for which you only need to index basic metadata, make sure you don't have a web scraping rule that matches those items. > This will prevent unnecessary processing and potential issues with retrieving protected page content. #### Content and images If you want Coveo to extract text from image files or PDF files containing images, enable the appropriate option. The extracted text is processed as item data, meaning that it's fully searchable and will appear in the item [Quick view](https://docs.coveo.com/en/2760#search-result-quick-view). > **Note** > > When OCR is enabled, ensure the source's relevant [file type configurations](https://docs.coveo.com/en/l3qg9275/) index the item content. > Indexing the item's metadata only or ignoring the item will prevent OCR from being applied. See [Enable optical character recognition](https://docs.coveo.com/en/2937/) for details on this feature. ### "Content security" tab Select who will be able to access the source items through a Coveo-powered [search interface](https://docs.coveo.com/en/2741/). For details on the content security options, see [Content security](https://docs.coveo.com/en/1779/). ### "Access" tab . On the **Access** tab, specify whether each group (and API key, if applicable) in your [Coveo organization](https://docs.coveo.com/en/185/) can view or edit the current source. For example, when creating a new source, you could decide that members of Group A can edit its configuration, while Group B can only view it. For more information, see [Custom access level](https://docs.coveo.com/en/3151#custom-access-level). On the **Access** tab, specify whether each group (and API key, if applicable) in your [Coveo organization](https://docs.coveo.com/en/185/) can view or edit the current source. For example, when creating a new source, you could decide that members of Group A can edit its configuration, while Group B can only view it. For more information, see [Custom access level](https://docs.coveo.com/en/3151#custom-access-level). ### Build the source . Finish adding or editing your source: ** When you're done editing the source and want to make your changes effective, click **Add and build source**/**Save and rebuild source**. ** When you want to save your source configuration changes without starting a build/rebuild, such as when you know you want to make other changes soon, click **Add source**/**Save**. On the [**Sources**](https://platform.cloud.coveo.com/admin/#/orgid/content/sources/) ([platform-ca](https://platform-ca.cloud.coveo.com/admin/#/orgid/content/sources/) | [platform-eu](https://platform-eu.cloud.coveo.com/admin/#/orgid/content/sources/) | [platform-au](https://platform-au.cloud.coveo.com/admin/#/orgid/content/sources/)) page, click **Launch build** or **Start required rebuild** when you're ready to make your changes effective and index your content. > **Leading practice** > > By default, a Jira Software source indexes the entire Jira Software instance content. > To index only certain projects, click **Save**, and then specify the desired address patterns in your [source JSON configuration](https://docs.coveo.com/en/1685/) before launching the initial build. > See [Add source filters](https://docs.coveo.com/en/2006#add-source-filters) for further information. . On the [**Sources**](https://platform.cloud.coveo.com/admin/#/orgid/content/sources/) ([platform-ca](https://platform-ca.cloud.coveo.com/admin/#/orgid/content/sources/) | [platform-eu](https://platform-eu.cloud.coveo.com/admin/#/orgid/content/sources/) | [platform-au](https://platform-au.cloud.coveo.com/admin/#/orgid/content/sources/)) page, follow the progress of your source addition or modification. . Once the source is built or rebuilt, [review its content in the Content Browser](https://docs.coveo.com/en/2053/). . Optionally, consider [editing or adding mappings](https://docs.coveo.com/en/1640/). > **Note** > > If you selected **Specific URLs** or **User profiles** in the [**Content**](https://docs.coveo.com/en/1739#content) section, some additional items will appear in the Content Browser. > To retrieve user profiles, Coveo must crawl your SharePoint Online instance, including your host site collection and the documents it contains. > Items encountered during this process are also retrieved and therefore appear in the Content Browser. ### Index metadata To use [metadata](https://docs.coveo.com/en/218/) values in [search interface](https://docs.coveo.com/en/2741/) [facets](https://docs.coveo.com/en/198/) or result templates, the metadata must be [mapped](https://docs.coveo.com/en/217/) to [fields](https://docs.coveo.com/en/200/). Coveo automatically [maps](https://docs.coveo.com/en/217/) only a subset of the metadata it extracts. You must map any additional metadata to fields manually. > **Note** > > Not clear on the purpose of indexing metadata? > Watch [this video](https://www.youtube.com/watch?v=BmmmVJ3AWi0). . On the [**Sources**](https://platform.cloud.coveo.com/admin/#/orgid/content/sources/) ([platform-ca](https://platform-ca.cloud.coveo.com/admin/#/orgid/content/sources/) | [platform-eu](https://platform-eu.cloud.coveo.com/admin/#/orgid/content/sources/) | [platform-au](https://platform-au.cloud.coveo.com/admin/#/orgid/content/sources/)) page, click your source, and then click **More** > **View and map metadata** in the Action bar. . Review the default [metadata](https://docs.coveo.com/en/218/) that your source is extracting from your content. . Map any currently _not indexed_ metadata that you want to use in facets or result templates to fields. .. Click the metadata and then, at the top right, click **Add to Index**. .. In the **Apply a mapping on all item types of a source** panel, select the field you want to map the metadata to, or [add a new field](https://docs.coveo.com/en/1833#add-a-field) if none of the existing fields are appropriate. > **Note** > > For advanced mapping configurations, like applying a mapping to a specific item type, see [Manage mappings](https://docs.coveo.com/en/1640#manage-mappings). .. Click **Apply mapping**. . Return to the [**Sources**](https://platform.cloud.coveo.com/admin/#/orgid/content/sources/) ([platform-ca](https://platform-ca.cloud.coveo.com/admin/#/orgid/content/sources/) | [platform-eu](https://platform-eu.cloud.coveo.com/admin/#/orgid/content/sources/) | [platform-au](https://platform-au.cloud.coveo.com/admin/#/orgid/content/sources/)) page. . To reindex your source with your new mappings, click your source, and then click **More** > **Rebuild** in the Action bar. . Once the source is rebuilt, review your item field values. They should now include the values of the metadata you selected to index. .. On the [**Sources**](https://platform.cloud.coveo.com/admin/#/orgid/content/sources/) ([platform-ca](https://platform-ca.cloud.coveo.com/admin/#/orgid/content/sources/) | [platform-eu](https://platform-eu.cloud.coveo.com/admin/#/orgid/content/sources/) | [platform-au](https://platform-au.cloud.coveo.com/admin/#/orgid/content/sources/)) page, click your source, and then click **More** > **Open in Content Browser** in the Action bar. .. Select the card of the item for which you want to inspect properties, and then click **Properties** in the Action bar. .. In the panel that appears, select the **Fields** tab. . If needed, extract and map additional metadata. **More on custom metadata extraction**
Details To extract custom metadata, you can use the following methods: * Configure [web scraping](https://docs.coveo.com/en/2767/) configurations that contain [metadata extraction rules](https://docs.coveo.com/en/mc1f3573#web-scraping-configuration-editing-modes) using CSS or XPath selectors. * Extract metadata from [JSON-LD `