Add a File System source

This is for:

In this article

Source key characteristics
Add a File System source
Refine the content to index
Limitation
What’s next?

A File System source allows members with the required privileges to retrieve and make searchable the content of files shared over a network via the Coveo Crawling Module.

Example

Your company has a shared network drive on which letter, PowerPoint presentation, and email signature templates are available to all employees. You decide to index the whole drive to make its content searchable via your Coveo-powered search page.

When you have the required privileges, you can add files shared over a network to a Coveo organization.

Leading practice

The number of items that a source processes per hour (crawling speed) depends on various factors, such as network bandwidth and source configuration. See About crawling speed for information on what can impact crawling speed, as well as possible solutions.

Source key characteristics

The following table presents the key characteristics of a File System source.

Features

Supported

Additional information

Windows Server version

2025, 2022, 2019, and 2016

Content update operations

refresh

rescan

Takes place every day by default

rebuild

Content security options

Same users and groups as in your content system

Specific users and groups

Everyone

Add a File System source

Before you start, ensure that the content to index and make searchable is shared over a network.

Also ensure that the Coveo Crawling Module is installed on a server that has access to the file system of which you want to retrieve the content.

Follow the instructions below to add a File System source.

On the Sources (platform-ca | platform-eu | platform-au) page, click Add source.
In the Add a source of content panel, click the File System source tile.
Configure your source.

Leading practice

It’s best to create or edit your source in your sandbox organization first. Once you’ve confirmed that it indexes the desired content, you can copy your source configuration to your production organization, either with a snapshot or manually.

See About non-production organizations for more information and best practices regarding sandbox organizations.

"Configuration" tab

In the Add a File System Source panel, the Configuration tab is selected by default. It contains your source’s general and authentication information, as well as other parameters.

General information

Name

Enter a name for your source.

Leading practice

A source name can’t be modified once it’s saved, therefore be sure to use a short and descriptive name, using letters, numbers, hyphens (-), and underscores (_). Avoid spaces and other special characters.

Path

Enter the network path to a file system folder or a file.

Examples

For a file located on the server where the Crawling Module is installed: C:\Users\adminuser\Documents

For a file located on a different server than that where the Crawling Module is installed, but on the same network: file:\\my-server\Users\adminuser\Documents

To exclude certain folders or files, first configure and save your source with a broad path. Then, see Refine the content to index.

Project

Use the Project selector to associate your source with one or more Coveo projects.

"Authentication" section

In the Authentication section, provide credentials for Coveo to use to access your file system.

Depending on your environment, select either of the following:

Use Crawling Module identity: Coveo will use the administrator credentials of the server where the selected Crawling Module is installed.
Use specific credentials: Coveo will use the credentials you provide to access your content. For example, you may need to select this option if your Crawling Module is installed on your server with a non-administrator user identity, and Coveo must use administrator credentials to crawl your files.

"Crawling Module" section

If you haven’t already installed the Coveo Crawling Module on a server that has access to the content to index, click Download to do so.

If you have more than one Crawling Module linked to this organization, select the one with which you want to pair your source. If you change the Crawling Module instance paired with your source, a successful rebuild is required for your change to apply.

"Items" tab

In the Items tab, you can specify how the source handles items based on their file type or content type.

File types

File types let you define how the source handles items based on their file extension or content type. For each file type, you can specify whether to index the item content and metadata, only the item metadata, or neither.

You should fine-tune the file type configurations with the objective of indexing only the content that’s relevant to your users.

Example

Your repository contains .pdf files, but you don’t want them to appear in search results. You click Extensions and then, for the .pdf extension, you change the Default action and Action on error values to Ignore item.

For more details about this feature, see Customize the indexing process.

Content and images

If you want Coveo to extract text from image files or PDF files containing images, enable the appropriate option. The extracted text is processed as item data, meaning that it’s fully searchable and will appear in the item Quick view.

Note

When OCR is enabled, ensure the source’s relevant file type configurations index the item content. Indexing the item’s metadata only or ignoring the item will prevent OCR from being applied.

See Enable optical character recognition for details on this feature.

"Content security" tab

Select who will be able to access the source items through a Coveo-powered search interface. For details on the content security options, see Content security.

If you select "Same users and groups as in your content system"

If you select Same users and groups as in your content system, Coveo will retrieve the Active Directory permissions with which your file system is secured in order to replicate them in your search interface. Therefore, each user will only see in their search results the content they can access in the original file system.

To enable this option, you must allow Coveo to connect to your file system with Active Directory on-premises credentials.

Active Directory username and password

Enter credentials to grant Coveo access to your Active Directory.

The credentials must belong to a dedicated administrator account that has access to the content you want to index. See Source credentials leading practices.

Email attributes

By default, Coveo retrieves the email address associated to each security identity from the mail attribute. Optionally, you can specify additional or different attributes to check. Should an attribute contain more than one value, Coveo uses the first one.

Enable Transport Layer Security (TLS)

Select this option to use a TLS protocol to retrieve your security identities. If you do, we strongly recommend selecting StartTLS if you can. Since LDAPS is a much older protocol, you should only select this value if StartTLS is incompatible with your environment.

Expand well-knowns

Select this option if you want the users that are included in your Active Directory well-known security identifiers to be granted access to the indexed content. Supported well-known SIDs are: Everyone, Authenticated Users, Domain Admins, Domain Users, and Anonymous Users.

When enabling this option, you can expect an increase in the duration of the security identity provider refresh operation.

Leading practice

If your entire content is secured with Everyone or Authenticated users, we recommend selecting the Everyone content security option instead. The result will be the same, that is, all users will be able to access the database content through your search interface, and Coveo’s update operations will be more efficient.

Expand trusted domains

Select this option to have Coveo connect to your root domain to get the security identities of your other domains through the root domain.

If your environment contains more than one domain, you can establish a bidirectional or outbound cross-link relationship between the root domain of your Crawling Module server and your additional domains. When you do so, these domains trust your root domain, and Coveo can get their security identities through this root domain.

When enabling this option, you can expect an increase in the duration of the security identity provider refresh operation. Moreover, if a linked domain is unreachable, Coveo stops the security identity provider refresh operation.

Permissions to index

By default, only NTFS permission entries are indexed and replicated in your search interface. Select Share and NTFS permissions if you also want to index and enforce share permissions. When you index NTFS and share permissions, Coveo combines these systems. Therefore, each end user must be allowed to access an item in both permission models to see this item in their search results.

For further information on share and NTFS permissions, see Share and NTFS Permissions on a File Server.

"Access" tab

In the Access tab, specify whether each group (and API key, if applicable) in your Coveo organization can view or edit the current source.

For example, when creating a new source, you could decide that members of Group A can edit its configuration, while Group B can only view it.

For more information, see Custom access level.

Completion

Finish adding or editing your source:

When you want to save your source configuration changes without starting a build/rebuild, such as when you know you want to do other changes soon, click Add source/Save.

When you’re done editing the source and want to make changes effective, click Add and build source/Save and rebuild source.

Note

On the Sources (platform-ca | platform-eu | platform-au) page, you must click Launch build or Start required rebuild in the source Status column to add the source content or to make your changes effective, respectively.

Back on the Sources (platform-ca | platform-eu | platform-au) page, you can follow the progress of your source addition or modification.

Once the source is built or rebuilt, you can review its content in the Content Browser.

Once your source is done building or rebuilding, review the metadata Coveo is retrieving from your content.

Note

Not clear on the purpose of metadata? Watch this video.

On the Sources (platform-ca | platform-eu | platform-au) page, click your source, and then click More > View and map metadata in the Action bar.
If you want to use a currently not indexed metadata in a facet or result template, map it to a field.
1. Click the metadata and then, at the top right, click Add to Index.
2. In the Apply a mapping on all item types of a source panel, select the field you want to map the metadata to, or add a new field if none of the existing fields are appropriate. For advanced mapping configurations, like applying a mapping to a specific item type, see Manage mappings.
3. Click Apply mapping.

Depending on the source type you use, you may be able to extract additional metadata from your content. You can then map that metadata to a field, just like you did for the default metadata.

More on custom metadata extraction and indexing

Some source types let you define rules to extract metadata beyond the default metadata Coveo discovers during the initial source build.

For example:

Source type Custom metadata extraction methods

Source type	Custom metadata extraction methods
Push API	Define metadata key-value pairs in the `addOrUpdate` section of the `PUT` request payload used to upload push operations to an Amazon S3 file container.
REST API and GraphQL API	In the JSON configuration (REST API \| GraphQL API) of the source, define metadata names (REST API \| GraphQL API) and specify where to locate the metadata values in the JSON API response Coveo receives.
Database	Add `<CustomField>` elements in the XML configuration. Each element defines a metadata name and the database field to use to populate the metadata with.
Web	Configure web scraping configurations that contain metadata extraction rules using CSS or XPath selectors. Extract metadata from JSON-LD `<script>` tags.
Sitemap	Extract metadata included in the XML sitemap file. Configure web scraping configurations that contain metadata extraction rules using CSS or XPath selectors. Extract JSON-LD `<script>` tag metadata. Extract `<meta>` tag content using the `IndexHtmlMetadata` JSON parameter.

Push API

Define metadata key-value pairs in the addOrUpdate section of the PUT request payload used to upload push operations to an Amazon S3 file container.

REST API
and
GraphQL API

In the JSON configuration (REST API | GraphQL API) of the source, define metadata names (REST API | GraphQL API) and specify where to locate the metadata values in the JSON API response Coveo receives.

Database

Add <CustomField> elements in the XML configuration. Each element defines a metadata name and the database field to use to populate the metadata with.

Web

Configure web scraping configurations that contain metadata extraction rules using CSS or XPath selectors.
Extract metadata from JSON-LD <script> tags.

Sitemap

Extract metadata included in the XML sitemap file.
Configure web scraping configurations that contain metadata extraction rules using CSS or XPath selectors.
Extract JSON-LD <script> tag metadata.
Extract <meta> tag content using the IndexHtmlMetadata JSON parameter.

Some source types automatically map metadata to default or user created fields, making the mapping process unnecessary. Some source types automatically create mappings and fields for you when you configure metadata extraction.

See your source type documentation for more details.

When you’re done reviewing and mapping metadata, return to the Sources (platform-ca | platform-eu | platform-au) page.
To reindex your source with your new mappings, click Launch rebuild in the source Status column.
Once the source is rebuilt, you can review its content in the Content Browser.

Refine the content to index

You may want to avoid indexing certain subfolders, or to index only a few of them. To do so:

If not already done, create and save your source with a broad path.
In your source JSON configuration, enter an address filter to refine the targeted content.

Your path must match one of your inclusion addressPatterns and not match any of your exclusion addressPatterns.
Build or rebuild your source.

Limitation

The File System source has limitations regarding item URIs and switching Crawling Modules.

Changing item URIs

When indexing content with the Crawling Module, ensure not to change space character encoding in an item’s URI, as Coveo uses URIs to distinguish items.

For example, an item whose URI would change from example.com/my first item to example.com/my%20first%20item wouldn’t be recognized as the same by Coveo. As a result, it would be indexed twice, and the older version wouldn’t be deleted.

Item URIs are displayed in the Content Browser (platform-ca | platform-eu | platform-au). We recommend you check where these URIs come from before making changes that affect space character encoding. Depending on your source type, the URI may be an item’s URL, or it may be built out of pieces of metadata by your source mapping rules. For example, your item URIs may consist of the main site URL plus the item filename, due to a mapping rule such as example.com/%[filename]. In such a case, changing space encoding in the item filename could impact the URI.

Switching Crawling Modules

Changing the Crawling Module paired with a File System source may create duplicate items in your index.

This is because the new Crawling Module doesn’t have access to the previous Crawling Module’s database, and therefore doesn’t know which items have already been indexed. As a result, the new Crawling Module indexes your content again, creating duplicates.

To avoid this, instead of changing the Crawling Module paired with your source, you can duplicate this source, and then edit the duplicate source to pair it with the desired Crawling Module. Then, you can delete the original source.

Alternatively, you can delete the duplicate items created by the Crawling Module switch. This can however only be done via the "Delete old items" Push API call.

To make this call, you need your organization and source IDs. You also need the ordering ID of the rebuild operation that followed the Crawling Module change. You can get it from either the Log Browser (platform-ca | platform-eu | platform-au) or the Crawling Module logs, under The initial ordering ID for the current refresh operation is:.

What’s next?

Schedule source updates.
Consider subscribing to deactivation notifications to receive an alert when a Crawling Module component becomes obsolete and stops the content crawling process.