Adding a Web Source

You can add the content of a website to your Coveo organization. Coveo indexes your website pages to make them searchable by only you when you make the source private or all members of the Coveo organization when you share the source.

Web Crawler source limitations:

  • Doesn’t support incremental refresh. A source full refresh is required to update the source content (see Modify a Source Schedule).

  • Doesn’t include website content dynamically rendered by JavaScript, meaning the source doesn’t execute the JavaScript code in crawled website pages.

Source Features Summary

Features Supported Additional information
Web page version N/A
Searchable content type

Web pages (complete)

Content update Incremental refresh
Full refresh 1
Rebuild
Permission types Secured

Private
Shared

1: The full refresh doesn’t immediately retrieve deleted web pages, but will remove a page from the index if the page returns a 404 error three times in a row. Otherwise, a rebuild eliminates deleted web pages from the index.

Add a Web source

To edit a Web source, see Edit the Source Configuration to Re-index its Content or Re-authorize the Access, and then follow the steps below, starting from step 4.

  1. If not already done, log in to your Coveo organization.

  2. In the navigation bar on the left, under Search Content, select Sources, and then click Add Source.

  3. On the Add Source page, click Web.

    When you create a source, you become the owner of the source.

  4. In the Add/Edit a Web Source dialog box:

    Admin-AddWebCrawlerSourced

    1. In the Source Name box, enter a name or your choice describing the content of the source.

    2. In the URLs box, enter one or more website addresses including the protocol (http:// or https://) that you want to make searchable, then press the Enter key or click Add.

      • To add a URL, click Add.

      • To remove a URL, click Delete.

        Create one source per website. If you choose to include more than one address, ensure that all parameters are applicable to all addresses specified in the URLs box.

    3. In the Inclusion Filters box, enter one or more filter rules using wildcards to define filtering patterns to include referred web pages outside of the starting address(es).

      • To add a filter rule, click Add.

        Regular Expressions (regex) are NOT supported in inclusion filter rules.

        • The starting address is http://www.MyCompany.com. This website refers to some pages on the related career website (http://career.MyCompany.com) that you want to also index. You can add the following inclusion filter pattern to also index job opportunities:

          http://career.MyCompany.com/jobs/*

        • Your business website also refers to several other related websites for each one of your line of products. You want to index the products on each one of the sites but not the whole thing so you could use a filter like the following:

          */products*

      • To remove a filter rule, click Delete.

    4. Review if you need to change the Skip addresses with parameters (domain.com?parameters) parameter default value.

      Select this check box to prevent the source from indexing pages whose addresses contain a query part that can return similar content, and therefore prevent indexing page duplicates and save disk space. Clear this check box when same addresses with different parameters return different content. This option is cleared by default.

    5. In the Exclusion Filters box, enter one or more filter rules using wildcards to define filtering patterns to prevent indexing one or more subsections under the starting address(es).

      • To add a filter rule, click Add.

        Regular Expressions (regex) are NOT supported in exclusion filter rules.

        • The starting address of a secured website source containing human resources content is https://corp.MyCompany.com/dfs/dept/HR. You can use the following exclusion filter to prevent indexing retired employees items that are all under the same folder:

          https://corp.MyCompany.com/dfs/dept/HR/employees/retired/*

        • Your website has several archives with old and obsolete information you don’t want to index. You know the URL will always contain /obs so you can use the following exclusion filter using two wildcards:

          */obs*

      • To remove a filter rule, click Delete.

    6. In the User Agent box, enter the name used by the Web source to identify itself to the website when downloading pages. Leave empty to use the default value (CoveoEnterpriseSearch).

    7. In the Security drop-down menu, select if you want the website content to be Shared or Private (see Source Permission Types).

    8. If your website is secured with the basic authentication type, click Basic Authentication and then enter the Username and Password of an account that can access all the content that you want to index.

      If you configured the source to be Shared within your Coveo organization, in search results, users can see all the items that indexing account can access. Exclusion Filters are a good way to prevent the disclosure of potentially sensitive content (see step 3.c.).

      The following authentication methods aren’t supported:

      • Cookies

      • Sessions IDs

      • Forms (Login page)

    9. Click Start Indexing (or Refresh Index when editing the source).

  5. Back on the Sources page, you can review the progress of your Web source addition (see Review the State of Sources Available to You).

What’s Next?

What's Next for Me?