Google Drive source configuration leading practices
Google Drive source configuration leading practices
Google Drives can contain large volumes of content, making it essential to index only the items relevant to your search interface users. By excluding unnecessary content from indexing, you reduce the indexing time and improve search relevance.
This article presents Google Drive source scoping features and indexing strategies that can significantly improve indexing performance.
Index only required user drives
In the source configuration Content to index subtab, under Select users to index, select Specific users and enter the emails of the users whose drives you want to index.
Alternatively, you can select All users and exclude specific users' drives by adding URL exclusions.
Suppose you want to index all users' drives except for those belonging to bfranklin@abc.com
and jwhite@abc.com
.
-
On the Sources (platform-ca | platform-eu | platform-au) page, click your Google Drive source, then click Edit in the Action bar.
-
In the Configuration tab, select the Content to index subtab.
-
In the Exclusions section, click Add rule.
-
In the new rule dropdown menu, select contains.
-
In the text box next to the dropdown menu, enter the following:
User:bfranklin@abc.com
-
Click Add rule. Repeat the previous steps to add a contains exclusion rule for the following text:
User:jwhite@abc.com
-
In the Inclusions section, ensure Include all non-excluded items is selected. This inclusion rule ensures that all remaining users' drives are indexed.
-
Click Save and rebuild source.
Index only specific shared drives
By default, the Google Drive source only indexes the specified users' My Drive content. However, you can configure the source to index content from the users' Shared drives as well, or only from the Shared drives.
To index only shared drives content:
-
On the Sources (platform-ca | platform-eu | platform-au) page, click your Google Drive source, and then click Edit in the Action bar.
-
In the Configuration tab, select the Content to index subtab.
-
Under Select users to index, choose Specific users and enter the emails of the users whose shared drives you want to index.
NoteThe source indexes a shared drive only if the specified user is set as the manager of that shared drive.
-
Under Select the type of drives to index, check only the Index shared drives box.
-
Click Save and rebuild source.
For maximum control over indexed content and source ease of management, consider creating a dedicated user account and indexing the handpicked drives you share with this account.
Index only items modified within a rolling period
By default, the Google Drive source indexes all items, regardless of when they were last modified. You can configure the source to index only items modified within a rolling period.
-
On the Sources (platform-ca | platform-eu | platform-au) page, click your Google Drive source, and then click Edit in the Action bar.
-
In the Content to index subtab, in the Additional exclusions section, use the two controls in the Exclude older content section to specify the number of units and the time period respectively.
-
Click Save and rebuild source.
Exclude MIME types
The Google Drive source lets you specify Google Drive supported MIME types that you want to exclude from indexing.
-
On the Sources (platform-ca | platform-eu | platform-au) page, click your Google Drive source, and then click Edit in the Action bar.
-
In the Content to index subtab, in the Additional exclusions section, use the Exclude MIME types dropdown menu to select all types you want to exclude. You can also add your own MIME types to the exclusion list.
-
Click Save and rebuild source.
Split your source and parallelize updates
Google limits Drive API calls per minute, but it’s a generous limit. There’s no per-day limit.
You can take advantage of the generous per-minute limit by splitting your Google Drive source into smaller ones and scheduling their updates to run in parallel.
Note
The Google Drive source supports refresh and rescan scheduled source updates. You should understand the differences between both source update types and set scheduled update frequencies that make sense in your context. |
One way to split your source is by using URL inclusions to organize user drives into alphabetical groups.
You have a Google Drive source that indexes content from All users. Now, you want to split this source into multiple sources.
Suppose you want the first source to index My Drive content from users whose email addresses start with letters a-d. The second source would then cover users with email addresses starting from e-h, and so on.
-
On the Sources (platform-ca | platform-eu | platform-au) page, click your Google Drive source, and then click More > Duplicate.
-
In the Duplicate a source panel, enter a name (that contains "a-d") for your new source. Then, click Duplicate source.
-
Return to the Sources (platform-ca | platform-eu | platform-au) page, click your new Google Drive source, and then click Edit in the Action bar.
-
In the Configuration tab, select the Content to index subtab.
-
In the Exclusions section, ensure you have no rules configured.
-
In the Inclusions section, click Include non-excluded items that match at least one rule.
-
Click Add rule.
-
Under Include an item if its URL, in the dropdown menu, select matches regex rule.
-
In the text box next to the dropdown menu, enter the following:
.*/Root:GoogleDrive($|/User:[a-d]).*
-
Click Save.
-
Repeat the previous steps for the remaining groups of users. For example, in your next source, you would use
.*/Root:GoogleDrive($|/User:[e-h]).*
for your matches regex rule. -
At the end of the process, delete the original source.
A word on indexing pipeline extensions
indexing pipeline extensions (IPEs) offer a powerful way to customize the indexing process. However, whereas your Google Drive source configurations are applied in the crawling stage of the Coveo indexing pipeline, indexing pipeline extensions are applied in the document processing manager (DPM).
Using an IPE to reject items doesn’t reduce the number of items crawled and, therefore, only adds to the total time required before source items are indexed.
Use IPEs only as a last resort, when native filtering within the Google Drive source isn’t possible.