Add a File System source
Add a File System source
A File System source allows members with the required privileges to retrieve and make searchable the content of files shared over a network via the Coveo Crawling Module.
Your company has a shared network drive on which letter, PowerPoint presentation, and email signature templates are available to all employees. You decide to index the whole drive to make its content searchable via your Coveo-powered search page.
When you have the required privileges, you can add files shared over a network to a Coveo organization.
Leading practice
The number of items that a source processes per hour (crawling speed) depends on various factors, such as network bandwidth and source configuration. See About crawling speed for information on what can impact crawling speed, as well as possible solutions. |
Source key characteristics
Features | Supported | Additional information | |
---|---|---|---|
Windows Server version |
2022, 2019, and 2016 |
||
Content security options |
|||
Add a File System source
Before you start, ensure that the content to index and make searchable is shared over a network.
Also ensure that the Coveo Crawling Module is installed on a server that has access to the file system of which you want to retrieve the content.
Follow the instructions below to add a File System source.
-
On the Sources (platform-ca | platform-eu | platform-au) page, click Add source.
-
In the Add a source of content panel, click the File System source tile.
-
Configure your source.
Leading practice
It’s best to create or edit your source in your sandbox organization first. Once you’ve confirmed that it indexes the desired content, you can copy your source configuration to your production organization, either with a snapshot or manually. See About non-production organizations for more information and best practices regarding sandbox organizations. |
"Configuration" tab
In the Add a File System Source panel, the Configuration tab is selected by default. It contains your source’s general and authentication information, as well as other parameters.
If you haven’t already installed the Coveo Crawling Module on a server that has access to the file system of which you want to retrieve the content, click Download Crawling Module to do so.
General information
Source name
Enter a name for your source.
Leading practice
A source name can’t be modified once it’s saved, therefore be sure to use a short and descriptive name, using letters, numbers, hyphens ( |
Path
Enter the network path to a file system folder or a file.
-
For a file located on the server where the Crawling Module is installed:
C:\Users\adminuser\Documents
-
For a file located on a different server than that where the Crawling Module is installed, but on the same network:
file:\\my-server\Users\adminuser\Documents
To exclude certain folders or files, first configure and save your source with a broad path. Then, see Refine the content to index.
Paired Crawling Module
If your source is a Crawling Module source, and if you have more than one Crawling Module linked to this organization, select the one with which you want to pair your source. If you change the Crawling Module instance paired with your source, a successful rebuild is required for your change to apply.
Optical Character Recognition (OCR)
If you want Coveo to extract text from image files or PDF files containing images, enable the appropriate option.
The extracted text is processed as item data, meaning that it’s fully searchable and will appear in the item Quick view. See Enable optical character recognition for details on this feature.
Project
If you have the Enterprise edition, use the Project selector to associate your source with one or multiple Coveo projects.
"Authentication" section
The File System source supports the following authentication methods.
-
Active Directory on-premises
Select this option to allow Coveo to connect to your file system with Active Directory on-premises credentials.
Selecting this option also allows Coveo to retrieve the Active Directory permissions with which your file system is secured in order to replicate them in your search interface. However, for this feature to be enforced, you must also select the Same users and groups as in your content system option in the Content security tab.
-
Native
Select this option to allow Coveo to connect to your file system with Active Directory on-premises credentials. This option does not allow selecting the Same users and groups as in your content system option in the Content security tab.
Regardless of the method you choose, you must enter the Username and Password of a dedicated administrator account that has access to the content you want to index. See Source credentials leading practices. If you selected Active Directory on-premises, fill the additional boxes that appeared. If you selected Native, skip to "Content to include" section.
Active Directory username and Active Directory password
Enter credentials to grant Coveo access to your Active Directory.
Expand well-known SIDs
Select this option if you want the users that are included in your Active Directory well-known security identifiers to be granted access to the indexed content.
Expect an increase in the duration of the security identity provider refresh operation.
Supported well-known SIDs are: Everyone
, Authenticated Users
, Domain Admins
, Domain Users
, and Anonymous Users
.
Leading practice
If your entire content is secured with |
Expand trusted domains
Select this option to have Coveo connect to your root domain to get the security identities of your other domains through the root domain.
If your environment contains more than one domain, you can establish a bidirectional or outbound cross-link relationship between the root domain of your Crawling Module server and your additional domains. When you do so, these domains trust your root domain, and Coveo can get their security identities through this root domain.
However, when enabling this option, you should expect an increase in the duration of the security identity provider refresh operation. Moreover, if a linked domain is unreachable, Coveo stops the security identity provider refresh operation.
Enable TLS
Select this option to use a TLS protocol to retrieve your security identities. If you do, we strongly recommend selecting StartTLS if you can. Since LDAPS is a much older protocol, you should only select this value if StartTLS is incompatible with your environment.
Email attributes
By default, Coveo retrieves the email address associated to each security identity from the mail
attribute.
Optionally, you can specify additional or different attributes to check.
Should an attribute contain more than one value, Coveo uses the first one.
"Content to include" section
By default, when you select the Same users and groups as in your content system option in the Content Security tab, only NTFS permission entries are indexed and replicated in your search interface. Check the Share permissions box if you also want to index and enforce share permissions. The NTFS and share permission systems are combined, and therefore each end user must be allowed to access an item in both permission models to see this item in their search results.
For further information on share and NTFS permissions, see Share and NTFS Permissions on a File Server.
For more information on sources that index permissions and on how Coveo handles these permissions, see Coveo management of security identities and item permissions.
"Content security" tab
Select who will be able to access the source items through a Coveo-powered search interface. For details on this parameter, see Content security.
"Access" tab
In the Access tab, set whether each group (and API key, if applicable) in your Coveo organization can view or edit the current source.
For example, when creating a new source, you could decide that members of Group A can edit its configuration while Group B can only view it.
See Custom access level for more information.
Completion
-
Finish adding or editing your source:
-
When you want to save your source configuration changes without starting a build/rebuild, such as when you know you want to do other changes soon, click Add source/Save.
-
When you’re done editing the source and want to make changes effective, click Add and build source/Save and rebuild source.
NoteOn the Sources (platform-ca | platform-eu | platform-au) page, you must click Launch build or Start required rebuild in the source Status column to add the source content or to make your changes effective, respectively.
Back on the Sources (platform-ca | platform-eu | platform-au) page, you can follow the progress of your source addition or modification.
Once the source is built or rebuilt, you can review its content in the Content Browser.
-
-
Once your source is done building or rebuilding, review the metadata Coveo is retrieving from your content.
-
On the Sources (platform-ca | platform-eu | platform-au) page, click your source, and then click More > View and map metadata in the Action bar.
-
If you want to use a currently not indexed metadata in a facet or result template, map it to a field.
-
Click the metadata and then, at the top right, click Add to Index.
-
In the Apply a mapping on all item types of a source panel, select the field you want to map the metadata to, or add a new field if none of the existing fields are appropriate.
Notes-
For details on configuring a new field, see Add or edit a field.
-
For advanced mapping configurations, like applying a mapping to a specific item type, see Manage mappings.
-
-
Click Apply mapping.
-
-
Depending on the source type you use, you may be able to extract additional metadata from your content. You can then map that metadata to a field, just like you did for the default metadata.
More on custom metadata extraction and indexing
Some source types let you define rules to extract metadata beyond the default metadata Coveo discovers during the initial source build.
For example:
Source type Custom metadata extraction methods Define metadata key-value pairs in the
addOrUpdate
section of thePUT
request payload used to upload push operations to an Amazon S3 file container.REST API
and
GraphQL APIIn the JSON configuration (REST API | GraphQL API) of the source, define metadata names (REST API | GraphQL API) and specify where to locate the metadata values in the JSON API response Coveo receives.
Add
<CustomField>
elements in the XML configuration. Each element defines a metadata name and the database field to use to populate the metadata with.-
Configure web scraping configurations that contain metadata extraction rules using CSS or XPath selectors.
-
Extract metadata from JSON-LD
<script>
tags.
-
Configure web scraping configurations that contain metadata extraction rules using CSS or XPath selectors.
-
Extract JSON-LD
<script>
tag metadata. -
Extract
<meta>
tag content using theIndexHtmlMetadata
JSON parameter.
Some source types automatically map metadata to default or user created fields, making the mapping process unnecessary. Some source types automatically create mappings and fields for you when you configure metadata extraction.
See your source type documentation for more details.
-
-
When you’re done reviewing and mapping metadata, return to the Sources (platform-ca | platform-eu | platform-au) page.
-
To reindex your source with your new mappings, click Launch rebuild in the source Status column.
-
Once the source is rebuilt, you can review its content in the Content Browser.
-
Refine the content to index
You may want to avoid indexing certain subfolders, or to index only a few of them. To do so:
-
If not already done, create and save your source with a broad path.
-
In your source JSON configuration, enter an address filter to refine the targeted content.
Your path must match one of your inclusion
addressPatterns
and not match any of your exclusionaddressPatterns
. -
Build or rebuild your source.
Limitation
The File System source has limitations regarding item URIs and switching Crawling Modules.
Changing item URIs
When indexing content with the Crawling Module, ensure not to change space character encoding in an item’s URI, as Coveo uses URIs to distinguish items.
For example, an item whose URI would change from example.com/my first item
to example.com/my%20first%20item
wouldn’t be recognized as the same by Coveo.
As a result, it would be indexed twice, and the older version wouldn’t be deleted.
Item URIs are displayed in the Content Browser (platform-ca | platform-eu | platform-au).
We recommend you check where these URIs come from before making changes that affect space character encoding.
Depending on your source type, the URI may be an item’s URL, or it may be built out of pieces of metadata by your source mapping rules.
For example, your item URIs may consist of the main site URL plus the item filename, due to a mapping rule such as example.com/%[filename]
.
In such a case, changing space encoding in the item filename could impact the URI.
Switching Crawling Modules
Changing the Crawling Module paired with a File System source may create duplicate items in your index.
This is because the new Crawling Module doesn’t have access to the previous Crawling Module’s database, and therefore doesn’t know which items have already been indexed. As a result, the new Crawling Module indexes your content again, creating duplicates.
To avoid this, instead of changing the Crawling Module paired with your source, you can duplicate this source, and then edit the duplicate source to pair it with the desired Crawling Module. Then, you can delete the original source.
Alternatively, you can delete the duplicate items created by the Crawling Module switch.
This can however only be done via the /olderthan
Push API call.
To make this call, you need your organization and source IDs.
You also need the ordering ID of the rebuild operation that followed the Crawling Module change.
You can get it from either the Log Browser (platform-ca | platform-eu | platform-au) or the Crawling Module logs, under The initial ordering ID for the current refresh operation is:
.
What’s Next?
-
Consider subscribing to deactivation notifications to receive an alert when a Crawling Module component becomes obsolete and stops the content crawling process.