Add or Edit a Web Source

Members of the Administrators and Content Managers built-in groups can use a Web source to make the content of a website searchable.

The Web source type behaves similarly to bots of other web search engines such as Google. The source only needs a starting URL and then automatically discovers all the pages of the site following the site navigation and hyperlinks appearing in the pages. Consequently, only pages that are reachable are indexed (in a random order). By default, the source does not include pages that are not under the URL root.

By default, a Web source starts a rescan every day to retrieve item changes (addition, modification, or deletion) (see Edit a Source Schedule).

Source Features Summary

Features Supported Additional information
Web page version N/A
Searchable content type

Web pages (complete)

Content update Refresh
Rescan
Rebuild
Content security options Secured
Private
Shared

Leading Practices

  • Plan your source names

    You cannot easily change source names once they are created, so before creating sources, draw a list of planned sources, and think of a descriptive and concise homogeneous syntax for source names (see Source Name).

  • If possible, create one source per website to make searchable, as this is the most stable and intuitive configuration. However, if you want to index many websites (i.e., above 50) or if you have reached your Sources limit, you should consider creating sources that retrieve content from more than one website (see Content Limits). To optimize time and resource consumption, try balancing the size of your sources: a source may contain several websites with a few dozen pages each, or one or two larger websites. You can also leverage the Crawling limit rate to increase crawling speed for the sites you own (see Crawling limit rate). Contact the Coveo Support team for assistance if needed.

  • Because refresh is not available for a Web source, ensure the rescan schedule is set at a frequency that is a good compromise between more recent search results and acceptable performance and resource impact (see Edit a Source Schedule).

  • When a connector exists for the technology powering the website, rather create a source based on that connector, as it will typically index content, metadata, and permissions more effectively (see Available Connectors).

    You want to make an Atlassian Confluence-powered site content searchable. Create a Confluence source, not a Web source.

Features

Notable Web source features are:

  • Supported authentication:

    • Basic authentication

    • Manual form authentication

    • Automatic form authentication

  • Available metadata

    For meta names with colons (:), you must specify the origin explicitly in the mapping since the colons is the delimiter for the origin of the metadata (see Mapping Rule Syntax Reference).

    og:title:crawler

    The out-of-the-box metadata for each item include:

    Metadata Description

    description
    keywords
    author
    Content-type
    ...

    All meta tag included in the head of the page

    The content attribute of meta tags is indexed when the tag is keyed with one of the following attributes: name, property, itemprop, or http-equiv.

    In the tag <meta property="og:title" content="The Article Title"/>, The Article Title is indexed.

    prefixed with RequestHeader All request and response headers in separate metadata
    prefixed with ResponseHeader
    coveo_AllRequestHeaders All headers as a JSON dictionary
    coveo_AllResponseHeaders
  • Web scraping

    Using the web scraping feature, you can exclude sections of a page, extract metadata from the page, and even create separate index items from specific sections a single web page (see Web Scraping Configuration).

  • JavaScript support

    The crawler can run the underlying JavaScript in website pages to dynamically render content to index.

  • Robots.txt crawl-delay and page restrictions support

    The source does not support other parameters such as the visit-time and request-rate.

Limitations

  • Refresh is not available.

  • Indexing page permissions, if any, is not supported.

  • Only pages reachable through website page links are indexed.

    The Sitemap source may be a better solution when the website features a sitemap file (see Add or Edit a Sitemap Source).

  • Although, in the source JSON configuration, the MaxPageSizeInBytes is set to 0 (unlimited size) by default, the Coveo Cloud indexing pipeline can handle web pages up to 512 MB only (see Edit a Source JSON Configuration). Larger pages are indexed by reference, i.e., their content is ignored by the Coveo Cloud crawler, and only their metadata and path are searchable (see Indexing by Reference). As a result, no Quick View is available for these larger items (see Search Result Quick View).

  • Crawling performance depends heavily on the responding web server.

  • Pause and resume source operations are not yet supported (see Resume a Paused Source Update. Thus, Web source operations cannot be paused on error.

Add or Edit a Web Source

  1. If not already in the Add/Edit a Web Source panel, go to the panel:

    • To add a source, in the main menu, under Content, select Sources > Add Source button > Web.

      OR

    • To edit a source, in the main menu, under Content, select Sources > source row > Edit in the Action bar.

  2. In the Configuration tab, enter appropriate values for the available parameters:

    • Source name

      A descriptive name for your source under 255 characters (not already in use for another source in the organization).

      Once a source is created, you cannot change its name. To achieve the same result as a name change, you would have to create a new source with a similar configuration and give it the desired name, and then delete the original source.

      Take the time to plan and pick good source names.

      This name appears in the list of sources in this administration console, but may also be used by developers in search page JavaScript code, for example, to specify the scope of a search interface.

      Use a short and descriptive name, using letters, numbers, - and _ characters, and avoid spaces and other special characters.

    • Site URL

      The URL of a starting website page, typically a home page, from which the crawler starts discovering the website following links found in pages.

      You can enter more than one starting website page, for example, to allow the crawler to see links leading to all the website pages that you want to index.

      Avoid crawling more than one site in a given source. Rather create one source for each website. This way, you can optimize source parameters for each website.

      If you want to index only one or a few specific pages of a site, such as for a test, enter the pages to index in the Site URL box, and then in the Content to Include section set the Maximum depth parameter value to 0, instructing the crawler to only index the specified pages, and none of the linked pages.

    • User agent

      The user agent string that you want Coveo Cloud to send with HTTP requests to identify itself when downloading pages (see User agent).

      The default is: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) (compatible; Coveobot/2.0;+https://www.coveo.com/bot.html).

    • Paired Crawling Module

      If your source is a Crawling Module source and if you have more than one Crawling Module linked to this organization, select the one with which you want to pair your source (see Deploying Multiple Crawling Modules). If you change the Crawling Module instance with which your source is paired, you must rebuild your source.

    • Character optical recognition (OCR)

      Check this box if you want Coveo Cloud to extract text from image files or PDF files containing images (see Enable Optical Character Recognition). OCR-extracted text is processed as item data, meaning that it is fully searchable and will appear in the item Quick View (see Search Result Quick View).

      Since the OCR feature is available at an extra charge, you must first contact Coveo Sales to add this feature to your organization license. You can then enable it for your source.

    • Index

      When adding a source, if you have more than one logical (non-Elasticsearch) index in your organization, select the index in which the retrieved content will be stored (see Leverage Many Coveo Indexes). If your organization only has one index, this drop-down menu is not visible and you have no decision to make.

      • To add a source storing content in an index different than default, you need the View access level on the Logical Index domain (see Privilege Management and Logical Indexes Domain).

      • Once the source is added, you cannot switch to a different index.

    • Content security

      Select a content security option to determine who can see items from this source in a search interface.

  3. In the Authentication section, when the website you want to make searchable uses one of the supported authentication types to secure access to its content, expand the Authentication section to configure the source credentials allowing your Coveo Cloud organization to access the secured content (see Supported Authentication Schemes and Source Credentials Leading Practices).

    The web source type supports the following authentication types:

    • Basic authentication

      (Only when indexing HTTPS URLs) Select this option when the desired website uses the normal NTLM identity (see Basic access authentication).

    • Manual form authentication

      Select this option when the desired website presents users with a form to fill in to log in. You must specify the form input names and values.

    • Automatic form authentication

      Select this option when the desired website presents users with a form to fill in to log in. This is a simplified version of the Manual form authentication. A web automation framework supporting JavaScript rendering executes the authentication process by autodetecting the form inputs.

    Depending on the selected authentication method, configure the appropriate parameters:

    • When selecting Basic authentication:

      When the Coveo Cloud crawler follows links requiring basic authentication while indexing your website, it only provides the basic authentication credentials you entered if the link belongs to the same scheme, domain, or subdomain as the starting Site URL. Conversely, if the link does not belong to one of these, the Coveo Cloud crawler does not try to authenticate. If you want the Coveo Cloud crawler to authenticate and index a site from a different scheme, domain, and/or subdomain, you must include its address under Site URL.

      Your starting address is https://www.example.com. The Coveo Cloud crawler does not provide the basic authentication credentials you provided if the link requiring them belongs to:

      • A different scheme, i.e., it uses HTTP instead of HTTPS
      • A different domain, such as https://www.mysite.com
      • A different subdomain, such as https://www.intranet.example.com

      Since you want your basic authentication credentials to be provided when the Coveo Cloud crawler follows a link starting with https://www.intranet.example.com, you enter this URL under Site URL.

      1. In the Username box, enter the source credentials username as you would when you log in to the website you are making searchable.

      2. In the Password box, enter the corresponding password.

    • When selecting Manual form authentication:

      1. In the Form URL box, enter the website login page URL.

      2. (Optional) When there is more than one form on the login page, enter the Form name.

      3. Click the Action method drop-down menu, and then select the HTTP method to use to submit the authentication request. Available options are POST or GET.

      4. Click the Content type drop-down menu, and then select the content-type headers for HTTP POST requests used by the form authentication process.

      5. (Optional) When the authentication request should be sent to another URL than the specified Form URL, enter this other URL under Action URL. Otherwise, leave empty.

      6. Inspect the form login page HTML code to locate the <input name='abc' type='text' /> element corresponding to each parameter, and then enter the input name attribute values under Username input name and Password input name.

        Based on the following HTML code:

        <input name="login" type="email" />

        <input name="pwd" type="password" />

        login is the username input name and pwd is the password input name.

      7. Under Username input value and Password input value, enter respectively the username and password parameter values.

      8. When your form uses other parameters than username and password, it is recommended to select the Autodetect hidden inputs check box. Otherwise, you must enter the Input name and Input value for each parameter, as with the username and password.

        Under Other inputs, inputs values are displayed in clear text. You must therefore ensure to enter your sensitive information, i.e., username and password, above the Other inputs, in the Username input name, Password input name, Username input value, and Password input value boxes (see steps f and g).

        FormAuthOtherInputsEx

      9. Under Confirmation method, select the method to use to determine whether the authentication request failed: Redirection to, Missing cookie, Missing text, or Text.

        Depending on the selected confirmation method, enter the appropriate value:

        • When selecting Redirection to, enter the URL where you redirect users when the login fails.

          https://mycompany.com/login/failed.html

        • When selecting Missing cookie, in the Value input, enter the name of the cookie that is set when an authentication is successful.

          ASP.NET_SessionId

        • When selecting Missing text, in the Value input, enter a string to show to authenticated users.

          • Hello, jsmith@mycompany.com!
          • Log out
        • When selecting Text, in the Value input, enter a string that appears on the page when a login fails.

          • An error has occurred.
          • Your username or password is invalid.
    • When selecting Automatic form authentication:

      1. In the Form URL box, enter the website login page URL.

      2. In the Username and Password boxes, enter the credentials to use to log in.

      3. Under Confirmation method, select the method to use to determine whether the authentication request failed: Redirection to, Missing cookie, Missing text, or Text presence.

        Depending on the selected confirmation method, enter the appropriate value:

        • When selecting Redirection to, enter the URL where you redirect users when the login fails.

          https://mycompany.com/login/failed.html

        • When selecting Missing cookie, in the Value input, enter the name of the cookie that is set when an authentication is successful.

          ASP.NET_SessionId

        • When selecting Missing text, in the Value input, enter a string to show to authenticated users.

          • Hello, jsmith@mycompany.com!
          • Log out
        • When selecting Text, in the Value input, enter a string to display on the page when a login fails.

          • An error has occurred.

          • Your username or password is invalid.

      4. If the website relies heavily on asynchronous requests, under Loading delay, enter the maximum delay in milliseconds to wait for the JavaScript to execute on each web page.

      5. (Optional) If your automatic form authentication configuration does not work, or if the web page requires specific actions during the login process, you might have to enter a Custom login sequence configuration. Contact the Coveo Support team for assistance.

  4. Expand the Content to Include section and consider changing the default value of any of the following parameters to fine-tune how web pages included in this source are crawled.

    • Maximum depth

      Enter the maximum number of link levels followed below the Site URL root page to include in the source.

      Value Crawling depth
      0 Home page content included in the source, but not the content of its linked pages.
      1 Content of the home page and its linked pages included in the source, but not the content of the subpage links.
       
      100 Default - Content up to the 100th link level is included in the source.
    • Pages to include

      Select how the web source follows links to web pages from external domains and includes them:

      • Exclude external pages (default)

        The typical desired behavior where linked web pages that are not part of the domain of the specified Site URL are not included.

        Your Site URL is http://www.mycompany.com and one of its pages contains a link to an https://en.wikipedia.org/ page. The linked Wikipedia page is not crawled and included in your searchable content.

      • Include external pages, but not their subpages

        Linked web pages that are not part of the domain of the specified Site URL are indexed, but not any pages linked within those external pages.

        Subdomains are part of the external pages, meaning that some items could not be indexed if located in subdomains.

      • Include external pages and their subpages

        Pages linked in included external pages are also included.

        Including pages linked in external pages can lead anywhere on the Internet, infinitely discovering linked pages on other websites.

        If you do select this option, you should add one or more filters, preferably Inclusion filters, to restrict the discovery to identifiable sites.

    • Inclusion filters

      Enter a filter to apply, and then indicate whether the filter includes a Wildcard or a Regex (regular expression) patterns. Pages matching the specified URL expression are included.

      • You can test your regexes to ensure they match the desired URLs with tools such as Regex101.

      • You can customize regexes to meet your use case focusing on aspects such as:

        • Case insensitivity

        • Capturing groups

        • Trailing slash inclusion

        • File extension

        You want to index HTML pages on your company staging and dev websites without taking the case sensitivity or the trailing slash (/) into account, so you use the following regex:

        (?i)^.*(company-(dev|staging)).*html.?$

        The regex matches the following URLs:

        • http://company-dev/important/document.html/
        • http://ComPanY-DeV/important/document.html/ (because of (?i), the case insensitive flag)
        • http://company-dev/important/document.html (with or without trailing / because of .?)
        • http://company-staging/important/document.html/ (because of dev|staging) but does not match the following ones:
        • http://besttech-dev/important/document.html/ (besttech is not included in the regex)
        • http://company-dev/important/document.pdf/ (only html files are included)
        • http://company-prod/important/document.html/ (prod is not included in the regex)

      When you specify an inclusion filter, the page specified in Site URL box must be part of the inclusion filter scope, otherwise no items are indexed because the starting page is excluded and the crawling process stops. In case the Site URL redirects to another URL, both of them must be part of the inclusion filter scope.

      The www.mycompany.com website you crawl contains versions in several languages and you want to have one source per language. For the US English source, your parameter values could be as shown in the following table.

      Parameter Value
      Site URL www.mycompany.com/en-us/welcome.html
      Inclusion filters www.mycompany.com/en-us/*
    • Exclusion filters

      Enter a filter to apply, and then select if the filter includes a Wildcard or a Regex (regular expression) patterns. Pages matching the specified URL expression are ignored.

      • Exclusion filters also apply to shortened and redirected URLs.

      • Ensure the Site URL you specified is not excluded by one of your exclusion filters.

      • By default, if pages are only accessible via excluded pages, those pages will also be excluded.

        You can still include pages that are only referenced in excluded pages by setting the ExpandBeforeFiltering hidden parameter to true in the source JSON configuration (see Add a Hidden Source Parameter). However, setting the parameter to true can significantly reduce the crawling speed since the crawler fetches many pages that can be rejected in the end.

        "ExpandBeforeFiltering": {
          "sensitivity": false,
          "value": "true"
        }
        

        sensitivity is a Boolean parameter.

      • There is no point in indexing the search page of your website, so you exclude its URL:

        www.mycompany.com/en-us/search.html

      • You do not want to index ZIP files that are linked from website pages:

        www.mycompany.com/en-us/*.zip

    • Query parameters to ignore

      Enter query string parameters that the source should ignore when determining whether a URL corresponds to a distinct item.

      By default, the source considers the whole URL to determine whether it is a distinct item. The URLs of the website you index can contain one or more query parameters after the hostname and the path. Some of them may contribute to change the content of the page, and therefore legitimately contribute to a distinct URL. Others however may not affect the content, and in such a case, the source may include a page more than once, creating search result duplicates, when they are not entered here to be ignored.

      The URL of a website page for which you get search result duplicates looks as follows:

      http://www.mysite.com/v1/getitdone.html?lang=en&param1=abc&param2=123

      The values of param1 and param2 can change for the /v1/getitdone.html page without affecting its content while the lang value changes the language in which the page appears. You want to ignore the param1 and param2 query parameters to eliminate search result duplicates, not lang. You enter one parameter name per line:

      param1

      param2

    • Additional content

      Select the JavaScript-rendered check box only when some website content you want to include is dynamically rendered by JavaScript. By default, the web source does not execute the JavaScript code in crawled website pages.

      Selecting the JavaScript-rendered check box on may significantly increase the time needed to crawl pages.

      When the JavaScript takes longer to execute than normal or makes asynchronous calls, consider increasing the Loading delay parameter value to ensure the pages with longest rendering time are indexed with all the rendered content. Enter the time in milliseconds allowed for dynamic content to be retrieved before indexing the content. When the value is 0 (default), the crawler does not wait after the page is loaded.

  5. In the Crawling Settings section, specify how you want the Coveo Cloud crawler to behave when going through the desired websites.

    • Crawling limit rate

      The number of milliseconds between each request sent to retrieve content from a specified domain. The default value is 1 request per 1000 milliseconds, which is the highest rate at which Coveo Cloud can crawl a public website.

      If you want to increase the crawling speed for a site you own, for instance if you need to retrieve the content of a large website, enter a lower number. For this crawling speed to apply, however, Coveo must verify that you are the owner of the site.

      • If your source is of the Cloud type, create an empty text file named coveo-ownership-orgid.txt, replacing orgid with your Coveo Cloud organization ID (see Organization ID). Then, upload this file at the root of the website you want to index. Changing the default number of milliseconds between each request has no effect if you do not also provide the expected text file proving your ownership.

      • If your source retrieves the content of an internal website via the Coveo On-Premises Crawling Module, the specified crawling rate applies automatically, as Coveo detects that the crawled site has a private IP address (see Coveo On-Premises Crawling Module, Content Retrieval Methods, and Private IPv4 Addresses). You therefore do not have anything to do to prove ownership.

      If your site has robot.txt directives that include a Crawl-delay parameter with a different value, the slowest crawling speed applies (see Crawl-delay Directive). See also the Respect robots.txt directives option.

    • Respect URL casing

      Deselect this check box when web page URLs that you include are not case sensitive, meaning one unique page can be accessed with different URL casings.

      When web page URLs are not case sensitive (Respect URL casing check box deselected), inclusion and exclusion filters are also not case sensitive.

      The file system of the website you index is case insensitive and the page URLs typically use camel casing. However, some links to these pages (followed by the source crawling) rather use all lowercase URLs. When the Respect URL casing check box is selected (default value), you get page duplicates in your source, one copy for each found URL casing variant. Deselect the Respect URL casing option to eliminate duplicates.

    • Respect robots.txt directives

      Deselect this check box only when you want the crawler to bypass restrictions specified in the website robots.txt file (see Robots exclusion standard).

    • Respect noindex directives

      Deselect this check box if you want the Coveo Cloud crawler to index pages that have a noindex directive in their meta tag or in their X-Robots-Tag HTTP response header (see Noindex and Nofollow Directives).

    • Respect nofollow directives

      Deselect this check box if you want the Coveo Cloud crawler to follow links in pages that have a nofollow directive in their meta tag or in their X-Robots-Tag HTTP response header (see Noindex and Nofollow Directives).

    • Respect nofollow anchors

      Deselect this check box if you want the Coveo Cloud crawler to follow links that have a rel="nofollow" attribute (see Noindex and Nofollow Directives).

    • Request retry count

      Enter the maximum number of retries for a URL if an error is encountered. The default is 3. When the value is 0, there are no retries.

    • Request retry delay

      Enter the minimum delay in milliseconds to wait between a failed HTTP request and the next retry. The default is 1000 milliseconds.

    • Request timeout

      Enter the web request timeout value in seconds. The default is 60 seconds. When the value is 0, there is no timeout.

      Consider increasing the value if a target website sometimes responds slower. This prevents getting errors that can be avoided.

  6. In the Web scraping section, in the JSON configuration box, enter a custom JSON configuration to precisely include page sections or extract metadata from the website pages (see Web Scraping Configuration).

  7. In the Access tab, determine whether each group and API key can view or edit the source configuration (see Understanding Resource Access):
    1. In the Access Level column, select View or Edit for each available group.
    2. On the left-hand side of the tab, if available, click Groups or API Keys to switch lists.

    If you remove the Edit access level from all the groups of which you are a member, you will not be able to edit the source again after saving. Only administrators and members of other groups that have Edit access on this resource will be able to do so. To keep your ability to edit this resource, you must grant the Edit access level to at least one of your groups.

  8. Optionally, consider editing or adding mappings (see Adding and Managing Source Mappings).

    You can only manage mapping rules once you build the source (see Add or Edit a Source).

  9. Complete your source addition or edition:

    • Click Add Source/Save when you want to save your source configuration changes without starting a build/rebuild, such as when you know you want to do other changes soon.

      On the Sources page, you must click Start initial build or Start required rebuild in the source Status column to add the source content or make your changes effective, respectively.

      OR

    • Click Add and Build Source/Save and Rebuild Source when you are done editing the source and want to make changes effective.

      Back on the Sources page, you can review the progress of your Web source addition or modification (see Adding and Managing Sources).

    Once the source is built or rebuilt, you can review its content in the Content Browser (see Inspect Items With the Content Browser).

What’s Next?