--- title: Add a Sitemap source slug: '1967' canonical_url: https://docs.coveo.com/en/1967/ collection: index-content source_format: adoc --- # Add a Sitemap source :figure-caption!: Members with the [required privileges](#required-privileges) can use a Sitemap [source](https://docs.coveo.com/en/246/) to make the content of webpages listed in a sitemap file or sitemap index file searchable. A sitemap file can be added to a website and is required when using a Sitemap source. The file contains a list of the website's URLs along with their respective [metadata](https://docs.coveo.com/en/218/) which include the LMD (last-modified-date). This enables the Sitemap source to perform [refresh](https://docs.coveo.com/en/2710/) updates, which the [Web source](https://docs.coveo.com/en/malf0160/) doesn't support. For this reason, although a Sitemap source requires the extra step of adding a sitemap file, it offers [better performance than the Web source](https://docs.coveo.com/en/2680#sitemap-or-web-source). ## Source key characteristics The following table presents the main characteristics of a Sitemap source. [%header,cols="2,2,2,4"] |=== 2+|Features ^|Supported |Additional information 2+|Indexable content ^|Webpages (URL) | 2+|Sitemap file format a|* XML * Text * RSS 2.0 * Atom 1.0 * HTML * GZ a|Sitemap files and sitemap index files must respect the [Sitemap protocol](https://www.sitemaps.org/protocol.html). Strict validations can be enforced by enabling the [ParseSitemapInStrictMode](https://docs.coveo.com/en/3158#parsesitemapinstrictmode-boolean) option. HTML pages that use JavaScript to generate a sitemap or redirect to a sitemap or sitemap index are supported. For a .gz sitemap file, the web server response [`Content-Type` header](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Type) must be `application/gzip`. 2+|Compressed HTTP responses ^|[check] |The source automatically handles compressed web server HTTP responses with the following `Content-Encoding` header values: `gzip`, `deflate`, and `br`. .3+|[Content update operations](https://docs.coveo.com/en/2039/) |[refresh](https://docs.coveo.com/en/2710/) ^|[check] a|To be refreshed, an item in the sitemap file must have a last-modification date[.footnote]^[[1](#limitations)]^ (for example, the `lastmod` element in an XML sitemap, or the `updated` element in an Atom sitemap) whose value is more recent than the last refresh operation. A rescan or rebuild operation is required to take account of deleted sitemap entries. |[rescan](https://docs.coveo.com/en/2711/) ^|[check] a|[Takes place every day by default](https://docs.coveo.com/en/1933/). To be rescanned, an item in the sitemap file must have a last-modification date[.footnote]^[[1](#limitations)]^ (for example, the `lastmod` element in an XML sitemap, or the `updated` element in an Atom sitemap) whose value is more recent than the last time the item was indexed. |[rebuild](https://docs.coveo.com/en/2712/) ^|[check] | .3+|[Content security options](#content-security-tab) |[Same users and groups as in your content system](https://docs.coveo.com/en/1779#same-users-and-groups-as-in-your-content-system) ^|[x] | |[Specific users and groups](https://docs.coveo.com/en/1779#specific-users-and-groups) ^|[check] | |[Everyone](https://docs.coveo.com/en/1779#everyone) ^|[check] | .2+.^|[Authentication methods](#authentication-subtab) |Basic authentication ^|[check] .2+a|Supported HTTP authentication schemes: * Basic * Digest * NTLM * Negotiate/Kerberos * Form based |Form authentication ^|[check] 2+|[Crawling rules](#crawling-rules-subtab) ^|[check] |A variety of basic and advanced rules may be used to ignore the webpages you don't want to [index](https://docs.coveo.com/en/204/). .4+|[Metadata indexing for search](#index-metadata) |Automatic mapping of [metadata](https://docs.coveo.com/en/218/) to [fields](https://docs.coveo.com/en/200/) that have the same name 2+a|Disabled by default. To enable, [access the JSON configuration of your source](https://docs.coveo.com/en/1685#access-the-edit-configuration-with-json-panel), and set [`performFieldMappingUsingAllOrigins`](https://docs.coveo.com/en/1640#about-the-performfieldmappingusingallorigins-setting) to `true`. |Automatically indexed [metadata](https://docs.coveo.com/en/218/) 2+a|Examples of [auto-populated default fields](https://docs.coveo.com/en/1833#field-origin) (no user-defined metadata required): * `clickableuri` * `date` * `filetype` * `language` (autodetected from document content) * `title`   The [`author`](https://docs.coveo.com/en/1833#field-origin) field will also be auto-populated if the content item contains an `author` metadata value. After a content update, [inspect your item field values](https://docs.coveo.com/en/2053#inspect-search-results) in the **Content Browser**. |Extracted but not indexed metadata 2+a|The source automatically extracts the `content` attribute from `` tags that include a `name` attribute. For example, if the HTML of a page contains the following: ``, the Web source extracts _jsmith_ as the `author` metadata. After a rebuild, review the [**View and map metadata**](https://docs.coveo.com/en/m9ti0339#view-and-map-metadata-subpage) subpage for the list of indexed metadata, and [index additional metadata](https://docs.coveo.com/en/m9ti0339#index-metadata). |Custom metadata extraction 2+a|Available using the following source features: • XML sitemap file [custom metadata extraction](https://docs.coveo.com/en/2656/) • Webpage [scraping](#web-scraping-subtab) • Webpage [JSON-LD metadata extraction](#extract-json-ld-metadata) • Webpage metadata extraction using the [IndexHtmlMetadata](https://docs.coveo.com/en/o2ta0401/) JSON configuration parameter 2+|[JavaScript content rendering](#execute-javascript-on-pages) ^|[check] |The Sitemap source crawler can execute JavaScript in a webpage to dynamically render content before indexing the page. 2+|[Shadow DOM content retrieval](#execute-javascript-on-pages) ^|[check] |If you choose to render JavaScript content, you can also specify whether the crawler should traverse and index attached Shadow DOM content. 2+|[Web scraping](#web-scraping-subtab) ^|[check] |Exclude irrelevant sections in pages and extract [metadata](https://docs.coveo.com/en/218/). 2+|[Optical Character Recognition (OCR)](#content-and-images) ^|[check] |Available at an extra charge. Contact [Coveo Sales](https://www.coveo.com/en/contact) to add this feature to your [Coveo organization](https://docs.coveo.com/en/185/) [license](https://docs.coveo.com/en/2864/). |=== ## Limitations * The last-modification attribute must specify the modification time in [W3C DateTime format](https://www.w3.org/TR/NOTE-datetime), that is, `YYYY-MM-DDThh:mm:ss`. Moreover, unless you specify a time zone, you must express the modification time in Coordinated Universal Time (UTC). * Multi-factor authentication (MFA) and CAPTCHA aren't supported. * The Sitemap source crawler can handle up to 200 cookies for the same domain, and a total of 3000 cookies. If the crawled sites add cookies beyond these limits, the crawler will drop older cookies, which can cause issues (for example, if a dropped cookie is required for authentication). * Indexing page permissions isn't supported. * The Sitemap source doesn't support `robots.txt` file directives or `` tags. * The [Coveo indexing pipeline](https://docs.coveo.com/en/184/) can handle web pages up to 512 MB only. Larger pages are [indexed by reference](https://docs.coveo.com/en/l3qg9275#file-type-configurations) (that is, their content is ignored by the Coveo [crawler](https://docs.coveo.com/en/2121/), and only their metadata and path are searchable). Therefore, no search result [Quick view](https://docs.coveo.com/en/2760#search-result-quick-view) is available for these larger [items](https://docs.coveo.com/en/210/). * JavaScript usage and limitations: ** The [**Execute JavaScript on pages**](#execute-javascript-on-pages) and **Add time for the crawler to wait before considering a page as fully rendered** settings only pertain to webpage content retrieval for indexing. When authenticating, the Sitemap crawler applies the [Loading delay](#loading-delay) or the [custom login sequence](#custom-login-sequence) [wait delay](https://docs.coveo.com/en/3289#wait-delay) values. ** JavaScript-rendered sitemaps are supported provided that the `Content-Type` of the targeted sitemap file is `application/xhtml+xml` or `text/html`. ** Content in pop-up windows and elements that require interaction aren't indexed. ** When the [**Execute JavaScript on pages**](#execute-javascript-on-pages) option is enabled, the source doesn't support the [`UseProxy`](https://docs.coveo.com/en/3158#useproxy-boolean) parameter. * The [`UseProxy`](https://docs.coveo.com/en/3158#useproxy-boolean) parameter can't be used in combination with [**Form authentication**](https://docs.coveo.com/en/1967#form-authentication). ## Leading practices * Make sure you have the right to [crawl](https://docs.coveo.com/en/2121/) public content if you don't own the website. Crawling sites that you don't own nor have the right to crawl could create reachability issues. Some sites use infrastructure components such as CDN/Caching providers (for example, Akamai, Cloudflare, and Varnish) that can affect Coveo's ability to retrieve content. If you're unfamiliar with these mechanisms, learn about them before you configure your source. For example, a CDN/Caching provider can detect the Coveo crawler and block it from further crawling. * Always try authenticating without a [custom login sequence](https://docs.coveo.com/en/3289/) first. You should only start working on a custom login sequence when you're sure your form authentication details (that is, login address, user credentials, confirmation method) are accurate and that the standard form authentication process doesn't work. * It's best to create or edit your source in your sandbox organization first. Once you have confirmed that it indexes the desired content, you can copy your source configuration to your production organization, either [with a snapshot](https://docs.coveo.com/en/3239/) or manually. See [About non-production organizations](https://docs.coveo.com/en/2959/) for more information and best practices regarding sandbox organizations. * Though it's possible to index multiple domains by configuring the source outside the main user interface, doing so is a bad practice. Always create one source per domain. This helps: ** Prevent the crawler from using your source authentication credentials on an external site. ** Reduce the number and complexity of crawling and scraping rules. ** Optimize source configurations for each site. ** Avoid having a rebuild/rescan issue on one site cause the deletion of indexed items associated with the other sites. * The number of [items](https://docs.coveo.com/en/210/) that a source processes per hour (crawling speed) depends on various factors, such as network bandwidth and source configuration. See [About crawling speed](https://docs.coveo.com/en/2078/) for information on what can impact crawling speed, as well as possible solutions. * Break down large sitemap files into multiple sitemap files. * Group your source and the other implementation [resources](https://docs.coveo.com/en/2820/) together in a [project](https://docs.coveo.com/en/n7ed6189/). See [Manage projects](https://docs.coveo.com/en/n7ef0517/). ## Add a Sitemap source To add a source . On the [**Sources**](https://platform.cloud.coveo.com/admin/#/orgid/content/sources/) ([platform-ca](https://platform-ca.cloud.coveo.com/admin/#/orgid/content/sources/) | [platform-eu](https://platform-eu.cloud.coveo.com/admin/#/orgid/content/sources/) | [platform-au](https://platform-au.cloud.coveo.com/admin/#/orgid/content/sources/)) page, click **Add source**. . In the **Add a source of content** panel, click the **Cloud** (icon:cloud-icon[alt=cloud-icon,width=16]) or [**Crawling Module**](https://docs.coveo.com/en/3260/) ([crawlingmodule]) tile, depending on your [content retrieval context](https://docs.coveo.com/en/1612/). With the latter, you must [install the Crawling Module](https://docs.coveo.com/en/3263/) to make your source operational. ![Image of Sitemap cloud and Crawling Module tiles | Coveo](https://docs.coveo.com/en/assets/images/index-content/sitemap-cloud-and-crawling-module-tiles.png) . [[addstep3]]In the **Add a new Sitemap source** / **Add a new Crawling Module Sitemap source** panel, fill in the following fields. -- **Name**: Use a short and descriptive name, using only letters, numbers, hyphens (-), and underscores (_). The source name can't be modified once it's saved. **Sitemap URLs**: Enter the direct URL of your sitemap file, not the website address. Otherwise, the source can interpret the address as an HTML sitemap page and crawl the links it contains. -- -- Enter the direct URL of your sitemap file, not the website address. Otherwise, the source can interpret the address as an HTML sitemap page and crawl the links it contains. **Examples of sitemap URLs** * Public website sitemap: `+http://myorgwebsite.com/sitemap.xml+` * Public website sitemap compressed with GZIP: `+http://myorgwebsite.com/sitemap.xml.gz+` > **Notes** > > * If the sitemap URL is an HTML page that uses JavaScript to generate a sitemap or redirect to a sitemap or sitemap index, enable [JavaScript execution on pages](#execute-javascript-on-pages) in the source's advanced settings. > > * The [`ParseSitemapInStrictMode`](https://docs.coveo.com/en/3158#parsesitemapinstrictmode-boolean) JSON parameter dictates the extent of validation the Sitemap source applies on sitemap and sitemap index files, and on their referenced URLs. > > * The Sitemap source only crawls pages listed in a sitemap file. > It doesn't crawl links in the listed web pages themselves. -- **Crawling Module**: If you're creating a Crawling Module Sitemap source, select the installed Crawling Module instance. **Project**: Specify the [projects](https://docs.coveo.com/en/n7ef0517/) you want to associate your source with. > **Note** > > After source creation, you can update your Coveo project selection under the [**Identification**](#identification-subtab) subtab. . Click **Next**. . Select who has [permission to access the content](https://docs.coveo.com/en/1779/) through the search interface and click **Add source**. > **Note** > > This information is editable later in the [**Content security**](#content-security-tab) tab. . Configure your [source](https://docs.coveo.com/en/246/). > **Note** > > You can save your source settings at any time by clicking **Save**. ### "Configuration" tab The **Configuration** tab lets you manage the crawling rules, web scraping configurations, advanced settings, and authentication methods of your source. These configuration groups are presented in subtabs. #### "Crawling rules" subtab The **Crawling rules** subtab lets you define the specific pages to [index](https://docs.coveo.com/en/204/). ##### Sitemap URLs Enter the direct URL of your sitemap file, not the website address. Otherwise, the source can interpret the address as an HTML sitemap page and crawl the links it contains. **Examples of sitemap URLs** * Public website sitemap: `+http://myorgwebsite.com/sitemap.xml+` * Public website sitemap compressed with GZIP: `+http://myorgwebsite.com/sitemap.xml.gz+` > **Notes** > > * If the sitemap URL is an HTML page that uses JavaScript to generate a sitemap or redirect to a sitemap or sitemap index, enable [JavaScript execution on pages](#execute-javascript-on-pages) in the source's advanced settings. > > * The [`ParseSitemapInStrictMode`](https://docs.coveo.com/en/3158#parsesitemapinstrictmode-boolean) JSON parameter dictates the extent of validation the Sitemap source applies on sitemap and sitemap index files, and on their referenced URLs. > > * The Sitemap source only crawls pages listed in a sitemap file. > It doesn't crawl links in the listed web pages themselves. ##### Exclusions and inclusions Add exclusion and inclusion rules to crawl only specific items based on their URL. ![Exclusions and inclusions user interface screenshot | Coveo](https://docs.coveo.com/en/assets/images/index-content/exclusions-and-inclusions.png) The following diagram illustrates how the Sitemap [crawler](https://docs.coveo.com/en/2121/) applies the exclusion and inclusion rules. This flow applies to all pages, including the sitemap URLs. You must therefore pay attention to not filter out your sitemap URLs. ![Crawling workflow diagram | Coveo](https://docs.coveo.com/en/assets/images/index-content/crawl-rules-flow.png) > **About the "Include all non-excluded pages" option** > > [.float-group] > -- > ![Crawling flow with the all-inclusive inclusion rule | Coveo](:https://docs.coveo.com/en/assets/images/index-content/crawl-rules-flow-all-include.png) > > The **Include all non-excluded pages** option automatically adds an "include all" inclusion rule in the background. > This ensures that all sitemap URLs meet the `Does URL match at least one inclusion rule?` condition and that all non-excluded pages get crawled. > > -- You can use any of the six types of rules: * **is** and a URL that includes the protocol. For example, `+https://myfood.com/+`. * **contains** and a string found in the URL. For example, `recipes`. * **begins with** and a string found at the beginning of the URL and which includes the protocol. For example, `+https://myfood+`. * **ends with** and a string found at the end of the URL. For example, `.pdf`. * **matches wilcard rule** and a wildcard expression that matches the whole URL. For example, `+https://myfood.com/recipes*+`. * **matches regex rule** and a regex rule that matches the whole URL. For example, `^.**(company-(dev|staging)).**html.?$`. > **Tip** > > When using regex rules, make sure they match the desired URLs with a testing tool such as [Regex101](https://regex101.com/). #### "Web scraping" subtab The **Web scraping** subtab lists and lets you manage [web scraping](https://docs.coveo.com/en/2767/) configurations for your source. When the crawler is about to index a page, it checks whether it must apply web scraping configurations that have been defined. The crawler considers the [**Pages to target**](https://docs.coveo.com/en/mahe0350#configuration-info) rules of each of your web scraping configurations, starting with the configuration at the top of your list. The crawler will either apply [the first matching configuration or all matching configurations](#single-match-vs-multi-match). Indexing irrelevant page sections and not extracting custom metadata reduces the quality of search results. With this in mind, all new Sitemap sources are created with a default web scraping configuration that excludes typical repetitive elements found in web pages that shouldn't be indexed. ![Default web scraping configuration | Coveo](https://docs.coveo.com/en/assets/images/index-content/default-web-scraping.png) Existing Sitemap sources without a web scraping configuration prompt you to add the default configuration when you access the **Web scraping** subtab. ![Apply the default web scraping configuration | Coveo](https://docs.coveo.com/en/assets/images/index-content/apply-default-web-scraping.png) > **Important** > > When no web scraping configuration is defined: > > * All [crawling rules included pages](#crawling-rules-subtab) are indexed in their entirety (that is, no sections are excluded). > > * No custom metadata is extracted. The Sitemap source features two web scraping configuration management modes: [UI-assisted mode](#ui-assisted-mode) and [Edit with JSON mode](#edit-with-json-mode). ##### UI-assisted mode You can add (+), edit ([edit]), and delete ([delete]) _one_ web scraping configuration at a time with a user interface that makes many technical aspects transparent. UI-assisted mode is easier to use and more mistake-proof than Edit with JSON mode. This is now the recommended mode for all web scraping configurations. When you add or edit a web scraping configuration using UI-assisted mode, the **Add/Edit a web scraping configuration** panel is displayed. See [Configurations in UI-assisted mode](https://docs.coveo.com/en/mahe0350#configurations-in-ui-assisted-mode) for more details. ##### Edit with JSON mode The **Edit with JSON** button gives access to the _aggregated_ web scraping JSON configuration of the source. Adding, editing, and deleting configurations directly in the JSON requires more technical skills than using UI-assisted mode. When you add or edit a web scraping configuration in Edit with JSON mode, the **Edit a web scraping JSON configuration** panel is displayed. See [Configurations in Edit with JSON mode](https://docs.coveo.com/en/mahe0350#configurations-in-edit-with-json-mode) for more details. ##### Single-match vs multi-match The Sitemap source can apply web scraping configurations in two ways: single-match or multi-match. In single-match mode, the crawler applies only the first matching web scraping configuration. In multi-match mode, the crawler applies all matching web scraping configurations. The animation below demonstrates the application of three web scraping configurations on a culinary website featuring news articles and recipe pages, in single-match mode (left) and multi-match mode (right). ![Animation showing the single-match and multi-match behaviors | Coveo](https://docs.coveo.com/en/assets/images/index-content/single-vs-multi-match-animation.gif) Sitemap sources created before mid-December 2023 were created in single-match mode. All new Sitemap sources are created in multi-match mode. Coveo converted existing single-match sources containing zero or one web scraping configuration to multi-match mode. We recommend you convert any remaining single-match Sitemap source to multi-match mode. If a Sitemap source is currently in single-match mode, the **Web scraping** subtab displays a banner prompting you to convert to multi-match mode. ![Multi-match conversion banner | Coveo](https://docs.coveo.com/en/assets/images/index-content/single-match-switch-message.png) To convert a source to multi-match mode . In the **Web scraping** subtab, click **Switch to multi-match mode**. . Confirm you want to convert the source to multi-match mode. A green **You're currently in multi-match mode** banner will then appear. . Click **Save**. Once your source is fully converted, the **Web scraping** subtab no longer shows the green banner and the subtab description reflects the multi-match mode behavior. ![Web scraping configuration options and description for targeting and processing pages | Coveo](https://docs.coveo.com/en/assets/images/index-content/web-scraping-configuration-description.png) #### "Advanced settings" subtab The **Advanced settings** subtab lets you customize the Coveo crawler behavior. All advanced settings have default values, which are adequate in most use cases. ##### Execute JavaScript on pages Only enable this option when any of the following are true, as it can significantly increase the time needed to crawl pages: * The website content you want to consider for indexing is dynamically rendered by JavaScript. * A [sitemap URL](#sitemap-urls) you specified is an HTML page that uses JavaScript to generate or redirect to a sitemap or sitemap index file. If you enable **Execute JavaScript on pages**, you'll have the following options: * **Add time for the crawler to wait before considering a page as fully rendered**: The default value of this setting is `0`, which means that the crawler doesn't wait after the page is loaded to retrieve its content. If the JavaScript takes longer to execute than normal or makes asynchronous calls, consider increasing this value to ensure that pages with longer rendering times are indexed with their dynamically rendered content. * **Enable Shadow DOM content retrieval**: When you enable this option, the crawler builds a flattened DOM tree by combining the light DOM and the Shadow DOM. It then processes the resulting structure as it would any other web page. > **Note** > > The crawler adds a custom attribute to the shadow root elements in the flattened DOM, allowing these elements to be targeted using a [special web scraping CSS selector](https://docs.coveo.com/en/mahe0350#css-selectors). > **Important** > > Building the composed DOM can significantly slow down indexing. > Enable this option only if the Shadow DOM contains valuable content you need to index. ##### User Agent string The [user agent](https://en.wikipedia.org/wiki/User_agent) string that the Sitemap source crawler uses to identify itself when requesting pages from your web server. The default value is `Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) (compatible; Coveobot/2.0;+http://www.coveo.com/bot.html)`. ##### Extract JSON-LD metadata If you have [JSON-LD](https://json-ld.org/) metadata in your HTML pages that you want to index, enable the **Extract JSON-LD metadata** option. When enabled, JSON-LD objects in the webpage are extracted, flattened, and represented in `jsonld.parent.child` metadata format in your [Coveo organization](https://docs.coveo.com/en/185/). **Example** Given the following JSON-LD script tag in a webpage: ```javascript ``` On an [indexing](https://docs.coveo.com/en/204/) action, the Sitemap [connector](https://docs.coveo.com/en/2734/) would extract `BBC News` as the value for the `jsonld.publisher.name` [metadata](https://docs.coveo.com/en/218/). To [index this metadata](https://docs.coveo.com/en/m9ti0339#index-metadata), you would therefore need to use `%[jsonld.publisher.name]` as the [mapping rule](https://docs.coveo.com/en/1839/) for your field. ##### Time the crawler waits between requests to your server Indicate the number of milliseconds between consecutive HTTP requests to the website server. The default value is 1000 milliseconds, which represents a crawling rate of one page per second. #### "Authentication" subtab The **Authentication** settings, used by the source crawler, emulate the behavior of a user authenticating to access restricted website content. If authentication is required, select the authentication type your website uses, whether [**Basic authentication**](#basic-authentication) or [**Form authentication**](#form-authentication). Then, provide the corresponding login details. > **Warning** > > Whether you use Basic or Form authentication, limit your source [crawling scope](#crawling-rules-subtab) to one domain that you own. > This reduces the risk of exposing your authentication credentials. > **Note** > > Manual form authentication is now only available on legacy sources. > We recommend you [migrate existing Manual form authentication sources](#migrate-from-manual-form-authentication) to Form authentication. ##### Basic authentication When selecting **Basic authentication**, enter the credentials of an account on the website you're making searchable. See [Source credentials leading practices](https://docs.coveo.com/en/1920/). > **Important** > > When **Execute JavaScript on pages** is enabled on the source, basic authentication significantly impacts indexing performance. If your sitemap contains a link to a page of a different domain or subdomain that also requires basic authentication, the Sitemap source will provide the credentials you entered when challenged. > **Important** > > To prevent exposing your credentials, provide username and password information only when the site uses a communication protocol secured with TLS or SSL (HTTPS). > You are responsible for ensuring that your Sitemap links requiring basic authentication credentials use HTTPS for increased security. > The basic authentication credentials you enter will be provided, regardless of whether the link requiring these credentials uses HTTP or HTTPS. ##### Form authentication You can choose between two form authentication workflows: **Force authentication disabled (recommended)**
Details With [Force authentication](#force-authentication) disabled, the workflow typically goes as follows: . Coveo's crawler requests a protected page. . The web server redirects the crawler to the [**Login page address**](#login-page-address). . Using the configured [**Validation method**](#validation-method), the crawler determines it's not authenticated. This automatically triggers the next step. . The crawler performs a standard login sequence using the provided **Login details**, or the [**Custom login sequence**](#custom-login-sequence) if one is configured. . After successful authentication, the web server responds by redirecting back to the requested protected page and returning cookies. . The crawler follows the server redirect to get the protected page and indexes that page. . The crawler requests the other pages using the cookies. This is the default and recommended workflow as it emulates human behavior the best and ensures crawler re-authentication, when needed.
**Force authentication enabled**
Details With [Force authentication](#force-authentication) enabled, the workflow typically goes as follows: . The crawler performs a standard login sequence using the provided **Login details**, or the [**Custom login sequence**](#custom-login-sequence) if one is configured. . After successful authentication, the web server responds with cookies that the crawler will use to request other pages. . The crawler requests the first URL from the web server using the cookies and indexes that page. . The crawler requests other pages using the cookies. If the crawler loses authentication at some point (for example, if a cookie expires), it has no way of knowing it must re-authenticate unless you have a proper authentication status [validation method](#validation-method). As a result, you may notice at some point that your source has indexed some, but _not all_, protected pages. Only use [**Force authentication**](#force-authentication) when no reliable authentication status [validation method](#validation-method) can be configured.
> **Note** > > The crawler can interact with Shadow DOM elements in your login pages. > If this is required, make sure the form authentication [loading delay](#loading-delay) allows the Shadow DOM time to load before the crawler begins to interact with the page. ###### Username and password Enter the credentials required to access the secured content. See [Source credentials leading practices](https://docs.coveo.com/en/1920/). ###### Login page address Enter the URL of the website login page where the username and password are to be used. ###### Loading delay Enter the maximum time the crawler should allow for JavaScript to execute and go through the login sequence before timing out. ###### Validation method The crawler uses the validation method after requesting a page from the web server to know if it's authenticated or not. When the validation method reveals that the crawler isn't authenticated, the crawler immediately tries to re-authenticate. To configure the validation method . In the dropdown menu, select your preferred authentication status validation method. . In the **Value(s)** field, specify the corresponding URL, regex or text. ** For **Cookie not found** (recommended): Enter the name of the cookie returned by the server after _successful_ authentication. If this cookie isn't found, the crawler will immediately authenticate (or re-authenticate). **Example** `ASP.NET_SessionId` ** For **Redirection to URL** (recommended): Enter the URL where users trying to access protected content on the website are redirected to when they're _not_ authenticated. If the crawler is redirected to this URL, it will immediately authenticate (or re-authenticate). **Example** `+https://mycompany.com/login/failed.html+` ** For **Text not found in page** footnote:not-recommended[Less reliable than the recommended validation methods. Can result in false positives, making form authentication issues harder to troubleshoot.]: Enter the text that appears on the page after _successful_ authentication. If this text isn't found on the page, the crawler will immediately authenticate (or re-authenticate). **Example** When a user successfully logs in, the page shows a "Hello, !" greeting text. If the login [username you specified](#username-and-password) was `+jsmith@mycompany.com+`, the text to enter would be: `Hello, \jsmith@mycompany.com!` **Example** `Log out` ** For **Text found in page** footnote:not-recommended[]: Enter the text that appears on the page when a user _isn't_ authenticated. If this text is found on the page, the crawler will immediately authenticate (or re-authenticate). **Examples** * `An error has occurred.` * `Your username or password is invalid.` ** For **URL matches regex** footnote:not-recommended[]: Enter a regex rule that matches the URL where users trying to access protected content are redirected to when they're _not_ authenticated. If the crawler is redirected to a URL that matches this regex, it will immediately authenticate (or re-authenticate). **Example** `.+Account\/Login.*` ** For **URL doesn't match regex** footnote:not-recommended[]: Enter a regex rule that matches the URL where users trying to access protected content are redirected to after _successful_ authentication. If the crawler isn't redirected to a URL that matches this regex, it will immediately authenticate (or re-authenticate). ###### Force authentication Select this option if you want Coveo's first request to be for authentication, regardless of whether it is actually required. > **Important** > > You should only force authentication if you have no reliable authentication status [validation method](#validation-method). ###### Custom login sequence The default login sequence for Web and Sitemap sources supports various third-party login pages, such as OneLogin, Google, Salesforce, and Microsoft. The default login sequence also tries to detect and log in to first-party login forms. The login process uses the first `
` element that meets all requirements. **Form requirements for the default login sequence**
Details The form must contain the following elements: **User identity field**: A visible `` element whose `id` or `name` attribute value is either `user`, `email`, `login`, `id`, or `name` (case-insensitive). **Password field**: A visible `` element whose `type` attribute value is `password` (case-insensitive). **Submit element**: A visible form submit element. The source looks for the following element types, in this order: * An `` element whose `type` attribute value is `submit` or `button`. * A `
If the web page doesn't meet the requirements for the default login sequence, or if your form requires specific actions during the login process, you must configure a [custom login sequence](https://docs.coveo.com/en/3289/). If the web page doesn't meet the requirements for the default login sequence, or if your form requires specific actions during the login process, you must configure a custom login sequence. [Web](https://docs.coveo.com/en/malf0160/) and [Sitemap](https://docs.coveo.com/en/1967/) sources provide an interface that lets administrators build custom form [authentication](https://docs.coveo.com/en/2120/) workflows. > **Important** > > Ensure that the default source login sequence fails before you configure a custom login sequence. ###### Post-login sequence The Sitemap source also supports post-login sequences to handle actions that need to be performed after logging in to a website. Post-login sequences are configured using the `postLoginSequence` section of the `FormAuthenticationConfiguration` parameter in the source JSON configuration. **Example**
Details After logging into a Salesforce site that you want to crawl using a Sitemap source, you encounter the following Salesforce platform infrastructure popup: ![Salesforce scheduled maintenance popup](https://docs.coveo.com/en/assets/images/index-content/scheduled-maintenance-popup.png) You inspect the button using your browser developer tools and see the following HTML markup: ```html ``` To emulate a user clicking _Got it_, you configure your source as follows: . On the [**Sources**](https://platform.cloud.coveo.com/admin/#/orgid/content/sources/) ([platform-ca](https://platform-ca.cloud.coveo.com/admin/#/orgid/content/sources/) | [platform-eu](https://platform-eu.cloud.coveo.com/admin/#/orgid/content/sources/) | [platform-au](https://platform-au.cloud.coveo.com/admin/#/orgid/content/sources/)) page, click your source, and then click **More** > **Edit source with JSON** in the Action bar. . In the **Edit configuration with JSON** panel, use the search tool ([search2]) to locate the `FormAuthenticationConfiguration` parameter. Its `value` object currently looks as follows: ```txt "value": "{\"authenticationFailed\":{\"method\":\"CookieNotSet\",.....\"customLoginSequence\":{}}" ``` . Remove the `}"` at the end of the `value` object so that you get: ```txt "value": "{\"authenticationFailed\":{\"method\":\"CookieNotSet\",.....\"customLoginSequence\":{} ``` . Append the following JSON snippet immediately after `\"customLoginSequence\":{}` to configure the post-login action: ```txt ,\"postLoginSequence\":{\"name\":\"Handle maintenance page\",\"url\":\"*\",\"urlContainsValue\":\"maintenanceandavailable.jsp\",\"steps\":[{\"name\":\"Dismiss maintenance modal\",\"waitDelayInMilliseconds\":500,\"actions\":[{\"type\":\"click\",\"elementIdentifier\":{\"identifier\":\"maintenance-confirm-button\",\"type\":\"default\",\"findType\":\"classname\"}}]}]}}" ``` **Example: Final FormAuthenticationConfiguration parameter value** [%collapsible]
```json "FormAuthenticationConfiguration": { "sensitive": false, "value": "{\"authenticationFailed\":{\"method\":\"CookieNotSet\",\"values\":[\"sid\"]},\"inputs\":[],\"formUrl\":\"https://somedomain.my.salesforce.com/apex/RedirectPage?siteType=lwr&basePath=/sitesDemo\",\"enableJavaScript\":true,\"forceLogin\":false,\"javaScriptLoadingDelayInMilliseconds\":1000,\"customLoginSequence\":{},\"postLoginSequence\":{\"name\":\"Handle maintenance page\",\"url\":\"*\",\"urlContainsValue\":\"maintenanceandavailable.jsp\",\"steps\":[{\"name\":\"Dismiss maintenance modal\",\"waitDelayInMilliseconds\":500,\"actions\":[{\"type\":\"click\",\"elementIdentifier\":{\"identifier\":\"maintenance-confirm-button\",\"type\":\"default\",\"findType\":\"classname\"}}]}]}}" }, ``` #### This post-login action configuration can be translated as the following instruction to the source crawler: "When a page whose URL contains `maintenanceandavailable.jsp` is encountered, wait 500 milliseconds, and then click the element with the class name `maintenance-confirm-button`." For more details on the post-login action parameters, see [Configure an action](https://docs.coveo.com/en/3289#configure-an-action). . Click **Save**. ##### ==== "Crawling Module" subtab If your source is a [Crawling Module source](https://docs.coveo.com/en/1612/), and if you have [more than one Crawling Module linked to this organization](https://docs.coveo.com/en/3271#deploying-multiple-crawling-module-instances), select the one with which you want to pair your source. If you change the Crawling Module instance paired with your source, a successful [rebuild](https://docs.coveo.com/en/3390#refresh-rescan-or-rebuild-sources) is required for your change to apply. #### "Identification" subtab The **Identification** subtab contains general information about the source. ## Name The source name. It can't be modified once it's saved. :leveloffset!: ##### Project Use the **Project** selector to associate your source with one or more Coveo [projects](https://docs.coveo.com/en/n7ef0517/). ### "Items" tab On the **Items** tab, you can specify how the source handles items based on their file type or content type. #### File types File types let you define how the source handles [items](https://docs.coveo.com/en/210/) based on their file extension or content type. For each file type, you can specify whether to index the item content and [metadata](https://docs.coveo.com/en/218/), only the item metadata, or neither. You should fine-tune the file type configurations with the objective of indexing only the content that's relevant to your users. **Example** Your repository contains `.pdf` files, but you don't want them to appear in search results. You click **Extensions** and then, for the `.pdf` extension, you change the **Default action** and **Action on error** values to `Ignore item`. For more details about this feature, see [File type handling](https://docs.coveo.com/en/l3qg9275/). > **Tip** > > With [file type handling](https://docs.coveo.com/en/l3qg9275/), using the `Index metadata` default action on HTML items lets you index basic metadata for those items. > On the other hand, web scraping is used to index custom metadata from the page content, which must first be retrieved. > These metadata indexing mechanisms are complementary, and you can use them together within a source. > > If there are some items for which you only need to index basic metadata, make sure you don't have a web scraping rule that matches those items. > This will prevent unnecessary processing and potential issues with retrieving protected page content. #### Content and images If you want Coveo to extract text from image files or PDF files containing images, enable the appropriate option. The extracted text is processed as item data, meaning that it's fully searchable and will appear in the item [Quick view](https://docs.coveo.com/en/2760#search-result-quick-view). > **Note** > > When OCR is enabled, ensure the source's relevant [file type configurations](https://docs.coveo.com/en/l3qg9275/) index the item content. > Indexing the item's metadata only or ignoring the item will prevent OCR from being applied. See [Enable optical character recognition](https://docs.coveo.com/en/2937/) for details on this feature. ### "Content security" tab Select who will be able to access the source items through a Coveo-powered [search interface](https://docs.coveo.com/en/2741/). For details on the content security options, see [Content security](https://docs.coveo.com/en/1779/). ### "Access" tab . On the **Access** tab, specify whether each group (and API key, if applicable) in your [Coveo organization](https://docs.coveo.com/en/185/) can view or edit the current source. For example, when creating a new source, you could decide that members of Group A can edit its configuration, while Group B can only view it. For more information, see [Custom access level](https://docs.coveo.com/en/3151#custom-access-level). On the **Access** tab, specify whether each group (and API key, if applicable) in your [Coveo organization](https://docs.coveo.com/en/185/) can view or edit the current source. For example, when creating a new source, you could decide that members of Group A can edit its configuration, while Group B can only view it. For more information, see [Custom access level](https://docs.coveo.com/en/3151#custom-access-level). ### Build the source . Finish adding or editing your source: ** When you're done editing the source and want to make your changes effective, click **Add and build source**/**Save and rebuild source**. ** When you want to save your source configuration changes without starting a build/rebuild, such as when you know you want to make other changes soon, click **Add source**/**Save**. On the [**Sources**](https://platform.cloud.coveo.com/admin/#/orgid/content/sources/) ([platform-ca](https://platform-ca.cloud.coveo.com/admin/#/orgid/content/sources/) | [platform-eu](https://platform-eu.cloud.coveo.com/admin/#/orgid/content/sources/) | [platform-au](https://platform-au.cloud.coveo.com/admin/#/orgid/content/sources/)) page, click **Launch build** or **Start required rebuild** when you're ready to make your changes effective and index your content. > **Leading practice** > > By default, a Jira Software source indexes the entire Jira Software instance content. > To index only certain projects, click **Save**, and then specify the desired address patterns in your [source JSON configuration](https://docs.coveo.com/en/1685/) before launching the initial build. > See [Add source filters](https://docs.coveo.com/en/2006#add-source-filters) for further information. . On the [**Sources**](https://platform.cloud.coveo.com/admin/#/orgid/content/sources/) ([platform-ca](https://platform-ca.cloud.coveo.com/admin/#/orgid/content/sources/) | [platform-eu](https://platform-eu.cloud.coveo.com/admin/#/orgid/content/sources/) | [platform-au](https://platform-au.cloud.coveo.com/admin/#/orgid/content/sources/)) page, follow the progress of your source addition or modification. . Once the source is built or rebuilt, [review its content in the Content Browser](https://docs.coveo.com/en/2053/). . Optionally, consider [editing or adding mappings](https://docs.coveo.com/en/1640/). > **Note** > > If you selected **Specific URLs** or **User profiles** in the [**Content**](https://docs.coveo.com/en/1739#content) section, some additional items will appear in the Content Browser. > To retrieve user profiles, Coveo must crawl your SharePoint Online instance, including your host site collection and the documents it contains. > Items encountered during this process are also retrieved and therefore appear in the Content Browser. ### Index metadata To use [metadata](https://docs.coveo.com/en/218/) values in [search interface](https://docs.coveo.com/en/2741/) [facets](https://docs.coveo.com/en/198/) or result templates, the metadata must be [mapped](https://docs.coveo.com/en/217/) to [fields](https://docs.coveo.com/en/200/). Coveo automatically [maps](https://docs.coveo.com/en/217/) only a subset of the metadata it extracts. You must map any additional metadata to fields manually. > **Note** > > Not clear on the purpose of indexing metadata? > Watch [this video](https://www.youtube.com/watch?v=BmmmVJ3AWi0). . On the [**Sources**](https://platform.cloud.coveo.com/admin/#/orgid/content/sources/) ([platform-ca](https://platform-ca.cloud.coveo.com/admin/#/orgid/content/sources/) | [platform-eu](https://platform-eu.cloud.coveo.com/admin/#/orgid/content/sources/) | [platform-au](https://platform-au.cloud.coveo.com/admin/#/orgid/content/sources/)) page, click your source, and then click **More** > **View and map metadata** in the Action bar. . Review the default [metadata](https://docs.coveo.com/en/218/) that your source is extracting from your content. . Map any currently _not indexed_ metadata that you want to use in facets or result templates to fields. .. Click the metadata and then, at the top right, click **Add to Index**. .. In the **Apply a mapping on all item types of a source** panel, select the field you want to map the metadata to, or [add a new field](https://docs.coveo.com/en/1833#add-a-field) if none of the existing fields are appropriate. > **Note** > > For advanced mapping configurations, like applying a mapping to a specific item type, see [Manage mappings](https://docs.coveo.com/en/1640#manage-mappings). .. Click **Apply mapping**. . Return to the [**Sources**](https://platform.cloud.coveo.com/admin/#/orgid/content/sources/) ([platform-ca](https://platform-ca.cloud.coveo.com/admin/#/orgid/content/sources/) | [platform-eu](https://platform-eu.cloud.coveo.com/admin/#/orgid/content/sources/) | [platform-au](https://platform-au.cloud.coveo.com/admin/#/orgid/content/sources/)) page. . To reindex your source with your new mappings, click your source, and then click **More** > **Rebuild** in the Action bar. . Once the source is rebuilt, review your item field values. They should now include the values of the metadata you selected to index. .. On the [**Sources**](https://platform.cloud.coveo.com/admin/#/orgid/content/sources/) ([platform-ca](https://platform-ca.cloud.coveo.com/admin/#/orgid/content/sources/) | [platform-eu](https://platform-eu.cloud.coveo.com/admin/#/orgid/content/sources/) | [platform-au](https://platform-au.cloud.coveo.com/admin/#/orgid/content/sources/)) page, click your source, and then click **More** > **Open in Content Browser** in the Action bar. .. Select the card of the item for which you want to inspect properties, and then click **Properties** in the Action bar. .. In the panel that appears, select the **Fields** tab. . If needed, extract and map additional metadata. **More on custom metadata extraction**
Details To extract custom metadata, you can use the following methods: * Extract [metadata included in the XML sitemap file](https://docs.coveo.com/en/2656/). * Configure [web scraping](https://docs.coveo.com/en/2767/) configurations that contain [metadata extraction rules](https://docs.coveo.com/en/mahe0350#web-scraping-configuration-editing-modes) using CSS or XPath selectors. * Extract [JSON-LD `