Add a File System source

A File System source allows members with the required privileges to retrieve and make searchable the content of files shared over a network via the Coveo Crawling Module.

Example

Your company has a shared network drive on which letter, PowerPoint presentation, and email signature templates are available to all employees. You decide to index the whole drive to make its content searchable via your Coveo-powered search page.

When you have the required privileges, you can add files shared over a network to a Coveo organization.

Tip
Leading practice

The number of items that a source processes per hour (crawling speed) depends on various factors, such as network bandwidth and source configuration. See About crawling speed for information on what can impact crawling speed, as well as possible solutions.

Source key characteristics

The following table presents the key characteristics of a File System source.

Features Supported Additional information

Windows Server version

2022, 2019, and 2016

Content update operations

refresh

x

rescan

check

Takes place every day by default

rebuild

check

Content security options

Same users and groups as in your content system

check

Specific users and groups

check

Everyone

check

Add a File System source

Before you start, ensure that the content to index and make searchable is shared over a network.

Also ensure that the Coveo Crawling Module is installed on a server that has access to the file system of which you want to retrieve the content.

Follow the instructions below to add a File System source.

  1. On the Sources (platform-ca | platform-eu | platform-au) page, click Add source.

  2. In the Add a source of content panel, click the File System source tile.

  3. Configure your source.

Tip
Leading practice

It’s best to create or edit your source in your sandbox organization first. Once you’ve confirmed that it indexes the desired content, you can copy your source configuration to your production organization, either with a snapshot or manually.

See About non-production organizations for more information and best practices regarding sandbox organizations.

"Configuration" tab

In the Add a File System Source panel, the Configuration tab is selected by default. It contains your source’s general and authentication information, as well as other parameters.

If you haven’t already installed the Coveo Crawling Module on a server that has access to the file system of which you want to retrieve the content, click Download Crawling Module to do so.

General information

Source name

Enter a name for your source.

Tip
Leading practice

A source name can’t be modified once it’s saved, therefore be sure to use a short and descriptive name, using letters, numbers, hyphens (-), and underscores (_). Avoid spaces and other special characters.

Path

Enter the network path to a file system folder or a file.

Examples
  • For a file located on the server where the Crawling Module is installed: C:\Users\adminuser\Documents

  • For a file located on a different server than that where the Crawling Module is installed, but on the same network: file:\\my-server\Users\adminuser\Documents

To exclude certain folders or files, first configure and save your source with a broad path. Then, see Refine the content to index.

Paired Crawling Module

If your source is a Crawling Module source, and if you have more than one Crawling Module linked to this organization, select the one with which you want to pair your source. If you change the Crawling Module instance paired with your source, a successful rebuild is required for your change to apply.

Optical Character Recognition (OCR)

If you want Coveo to extract text from image files or PDF files containing images, enable the appropriate option.

The extracted text is processed as item data, meaning that it’s fully searchable and will appear in the item Quick view. See Enable optical character recognition for details on this feature.

Project

Use the Project selector to associate your source with one or more Coveo projects.

"Authentication" section

The File System source supports the following authentication methods.

Regardless of the method you choose, you must enter the Username and Password of a dedicated administrator account that has access to the content you want to index. See Source credentials leading practices. If you selected Active Directory on-premises, fill the additional boxes that appeared. If you selected Native, skip to "Content to include" section.

Active Directory username and Active Directory password

Enter credentials to grant Coveo access to your Active Directory.

Expand well-known SIDs

Select this option if you want the users that are included in your Active Directory well-known security identifiers to be granted access to the indexed content. Expect an increase in the duration of the security identity provider refresh operation. Supported well-known SIDs are: Everyone, Authenticated Users, Domain Admins, Domain Users, and Anonymous Users.

Tip
Leading practice

If your entire content is secured with Everyone or Authenticated users, we recommend selecting the Everyone content security option instead. The result will be the same, that is, all users will be able to access the database content through your search interface, and Coveo’s update operations will be more efficient.

Expand trusted domains

Select this option to have Coveo connect to your root domain to get the security identities of your other domains through the root domain.

If your environment contains more than one domain, you can establish a bidirectional or outbound cross-link relationship between the root domain of your Crawling Module server and your additional domains. When you do so, these domains trust your root domain, and Coveo can get their security identities through this root domain.

However, when enabling this option, you should expect an increase in the duration of the security identity provider refresh operation. Moreover, if a linked domain is unreachable, Coveo stops the security identity provider refresh operation.

Enable TLS

Select this option to use a TLS protocol to retrieve your security identities. If you do, we strongly recommend selecting StartTLS if you can. Since LDAPS is a much older protocol, you should only select this value if StartTLS is incompatible with your environment.

Email attributes

By default, Coveo retrieves the email address associated to each security identity from the mail attribute. Optionally, you can specify additional or different attributes to check. Should an attribute contain more than one value, Coveo uses the first one.

"Content to include" section

By default, when you select the Same users and groups as in your content system option in the Content Security tab, only NTFS permission entries are indexed and replicated in your search interface. Check the Share permissions box if you also want to index and enforce share permissions. The NTFS and share permission systems are combined, and therefore each end user must be allowed to access an item in both permission models to see this item in their search results.

For further information on share and NTFS permissions, see Share and NTFS Permissions on a File Server.

For more information on sources that index permissions and on how Coveo handles these permissions, see Coveo management of security identities and item permissions.

"Content security" tab

Select who will be able to access the source items through a Coveo-powered search interface. For details on this parameter, see Content security.

"Access" tab

In the Access tab, specify whether each group (and API key, if applicable) in your Coveo organization can view or edit the current source.

For example, when creating a new source, you could decide that members of Group A can edit its configuration, while Group B can only view it.

For more information, see Custom access level.

Completion

  1. Finish adding or editing your source:

    • When you want to save your source configuration changes without starting a build/rebuild, such as when you know you want to do other changes soon, click Add source/Save.

    • When you’re done editing the source and want to make changes effective, click Add and build source/Save and rebuild source.

      Note

      On the Sources (platform-ca | platform-eu | platform-au) page, you must click Launch build or Start required rebuild in the source Status column to add the source content or to make your changes effective, respectively.

      Back on the Sources (platform-ca | platform-eu | platform-au) page, you can follow the progress of your source addition or modification.

      Once the source is built or rebuilt, you can review its content in the Content Browser.

  2. Once your source is done building or rebuilding, review the metadata Coveo is retrieving from your content.

    1. On the Sources (platform-ca | platform-eu | platform-au) page, click your source, and then click More > View and map metadata in the Action bar.

    2. If you want to use a currently not indexed metadata in a facet or result template, map it to a field.

      1. Click the metadata and then, at the top right, click Add to Index.

      2. In the Apply a mapping on all item types of a source panel, select the field you want to map the metadata to, or add a new field if none of the existing fields are appropriate.

        Notes
        • For details on configuring a new field, see Add or edit a field.

        • For advanced mapping configurations, like applying a mapping to a specific item type, see Manage mappings.

      3. Click Apply mapping.

    3. Depending on the source type you use, you may be able to extract additional metadata from your content. You can then map that metadata to a field, just like you did for the default metadata.

      More on custom metadata extraction and indexing

      Some source types let you define rules to extract metadata beyond the default metadata Coveo discovers during the initial source build.

      For example:

      Source type Custom metadata extraction methods

      Push API

      Define metadata key-value pairs in the addOrUpdate section of the PUT request payload used to upload push operations to an Amazon S3 file container.

      In the JSON configuration (REST API | GraphQL API) of the source, define metadata names (REST API | GraphQL API) and specify where to locate the metadata values in the JSON API response Coveo receives.

      Database

      Add <CustomField> elements in the XML configuration. Each element defines a metadata name and the database field to use to populate the metadata with.

      Web

      Sitemap

      Some source types automatically map metadata to default or user created fields, making the mapping process unnecessary. Some source types automatically create mappings and fields for you when you configure metadata extraction.

      See your source type documentation for more details.

    4. When you’re done reviewing and mapping metadata, return to the Sources (platform-ca | platform-eu | platform-au) page.

    5. To reindex your source with your new mappings, click Launch rebuild in the source Status column.

    6. Once the source is rebuilt, you can review its content in the Content Browser.

Refine the content to index

You may want to avoid indexing certain subfolders, or to index only a few of them. To do so:

  1. If not already done, create and save your source with a broad path.

  2. In your source JSON configuration, enter an address filter to refine the targeted content.

    Important

    Your path must match one of your inclusion addressPatterns and not match any of your exclusion addressPatterns.

  3. Build or rebuild your source.

Limitation

The File System source has limitations regarding item URIs and switching Crawling Modules.

Changing item URIs

When indexing content with the Crawling Module, ensure not to change space character encoding in an item’s URI, as Coveo uses URIs to distinguish items.

For example, an item whose URI would change from example.com/my first item to example.com/my%20first%20item wouldn’t be recognized as the same by Coveo. As a result, it would be indexed twice, and the older version wouldn’t be deleted.

Item URIs are displayed in the Content Browser (platform-ca | platform-eu | platform-au). We recommend you check where these URIs come from before making changes that affect space character encoding. Depending on your source type, the URI may be an item’s URL, or it may be built out of pieces of metadata by your source mapping rules. For example, your item URIs may consist of the main site URL plus the item filename, due to a mapping rule such as example.com/%[filename]. In such a case, changing space encoding in the item filename could impact the URI.

Switching Crawling Modules

Changing the Crawling Module paired with a File System source may create duplicate items in your index.

This is because the new Crawling Module doesn’t have access to the previous Crawling Module’s database, and therefore doesn’t know which items have already been indexed. As a result, the new Crawling Module indexes your content again, creating duplicates.

To avoid this, instead of changing the Crawling Module paired with your source, you can duplicate this source, and then edit the duplicate source to pair it with the desired Crawling Module. Then, you can delete the original source.

Alternatively, you can delete the duplicate items created by the Crawling Module switch. This can however only be done via the /olderthan Push API call.

To make this call, you need your organization and source IDs. You also need the ordering ID of the rebuild operation that followed the Crawling Module change. You can get it from either the Log Browser (platform-ca | platform-eu | platform-au) or the Crawling Module logs, under The initial ordering ID for the current refresh operation is:.

What’s next?