Google Drive source configuration leading practices
Google Drive source configuration leading practices
Google Drives can contain large volumes of content, making it essential to index only the items relevant to your search interface users. By excluding unnecessary content from indexing, you reduce the indexing time and improve search relevance.
This article presents Google Drive source scoping features and indexing strategies that can significantly improve indexing performance.
Index only required user drives
On the Add a Google Drive source page, in the Users to include section, All is the default value. You can instead select Specific and specify the users whose drives you want to index.
Alternatively, you can select All and exclude specific users' drives by adding exclusion address patterns to your source JSON configuration.
Suppose you want to index all users' drives except for those belonging to bfranklin@abc.com
and jwhite@abc.com
.
-
On the Sources (platform-ca | platform-eu | platform-au) page, click your Google Drive source, and then click More > Edit configuration with JSON.
-
In the Edit configuration with JSON panel, locate the
addressPatterns
object. -
Add two
"allowed": false"
address patterns to obtain the following configuration:"addressPatterns": [ { "allowed": false, "expression": ".*User:bfranklin@abc.com.*", "patternType": "RegEx" }, { "allowed": false, "expression": ".*User:jwhite@abc.com`.*", "patternType": "RegEx" }, { "allowed": true, "expression": "*", "patternType": "Wildcard" } ]
This inclusion pattern ensures that all remaining users' drives are indexed. If you omit this pattern, no items will be indexed. -
Click Save to apply your change for future update operations or Save and rebuild source.
Index only specific shared drives
By default, the Google Drive source only indexes a user’s My Drive content. However, you can configure the source to index content from the user’s Shared drives as well, or exclusively from the Shared drives.
-
On the Sources (platform-ca | platform-eu | platform-au) page, click the Google Drive for Work source, and then click More > Edit configuration with JSON.
-
In the Edit configuration with JSON panel, set the
IndexUserDrives
andIndexSharedDrives
totrue
orfalse
, depending on your requirements.
For maximum control over indexed content and source ease of management, consider creating a dedicated user account and indexing the handpicked drives you share with this account.
Index only items modified within a rolling period
By default, the Google Drive source indexes all items, regardless of when they were last modified. You can configure the source to index only items modified within a rolling period.
-
On the Sources (platform-ca | platform-eu | platform-au) page, click your Google Drive source, and then click More > Edit configuration with JSON.
-
Configure the
ItemsModificationPeriod
parameter to specify the rolling period. See Configuring unlisted parameters andItemsModificationPeriod
for details. -
Click Save to apply your change for future update operations or Save and rebuild source.
Exclude MIME types
The Google Drive source lets you specify Google Drive supported MIME types that you want to exclude from indexing.
-
On the Sources (platform-ca | platform-eu | platform-au) page, click your Google Drive source, and then click More > Edit configuration with JSON.
-
Configure the
MimeTypesToIgnore
parameter to specify the MIME types to exclude. See See Configuring unlisted parameters andMimeTypesToIgnore
for details. -
Click Save to apply your change for future update operations or Save and rebuild source.
Split your source and parallelize updates
Google limits Drive API calls per minute, but it’s a generous limit. There’s no per-day limit.
You can take advantage of the generous per-minute limit by splitting your Google Drive source into smaller ones and scheduling their updates to run in parallel.
Note
The Google Drive source supports refresh and rescan scheduled source updates. You should understand the differences between both source update types and set scheduled update frequencies that make sense in your context. |
One way to split your source is by using address patterns to organize user drives into alphabetical groups.
You have a Google Drive source that indexes content from All user drives. Now, you want to split this source into multiple sources.
Suppose you want the first source to index My Drive content from users whose email addresses start with letters a–d. The second source would then cover users with email addresses starting from e–h, and so on.
-
On the Sources (platform-ca | platform-eu | platform-au) page, click your Google Drive source, and then click More > Duplicate.
-
In the Duplicate a source panel, enter a name (that contains "a–d") for your new source. Then, click Duplicate source.
-
Return to the Sources (platform-ca | platform-eu | platform-au) page, click your new Google Drive source, and then click More > Edit configuration with JSON.
-
In the Edit configuration with JSON panel, locate the
addressPatterns
object. -
Enter the following
addressPatterns
configuration:"addressPatterns": [ { "allowed": true, "expression": ".*/Root:GoogleDrive($|/User:[a-d]).*", "patternType": "RegEx" } ]
-
Click Save.
-
Repeat the previous steps for the remaining groups of users. For example, in your next source, you would use
".*/Root:GoogleDrive($|/User:[e-h]).*"
in the address pattern. -
At the end of the process, delete the original source.
A word on indexing pipeline extensions
indexing pipeline extensions (IPEs) offer a powerful way to customize the indexing process. However, whereas your Google Drive source configurations are applied in the crawling stage of the Coveo indexing pipeline, indexing pipeline extensions are applied in the document processing manager (DPM).
Using an IPE to reject items doesn’t reduce the number of items crawled and, therefore, only adds to the total time required before source items are indexed.
Use IPEs only as a last resort, when native filtering within the Google Drive source isn’t possible.