Google Drive source configuration leading practices

Google Drives can contain large volumes of content, making it essential to index only the items relevant to your search interface users. By excluding unnecessary content from indexing, you reduce the indexing time and improve search relevance.

This article presents Google Drive source scoping features and indexing strategies that can significantly improve indexing performance.

Index only required user drives

On the Add a Google Drive source page, in the Users to include section, All is the default value. You can instead select Specific and specify the users whose drives you want to index.

Alternatively, you can select All and exclude specific users' drives by adding exclusion address patterns to your source JSON configuration.

Example

Suppose you want to index all users' drives except for those belonging to bfranklin@abc.com and jwhite@abc.com.

  1. On the Sources (platform-ca | platform-eu | platform-au) page, click your Google Drive source, and then click More > Edit configuration with JSON.

  2. In the Edit configuration with JSON panel, locate the addressPatterns object.

  3. Add two "allowed": false" address patterns to obtain the following configuration:

        "addressPatterns": [
          {
            "allowed": false,
            "expression": ".*User:bfranklin@abc.com.*",
            "patternType": "RegEx"
          },
          {
            "allowed": false,
            "expression": ".*User:jwhite@abc.com`.*",
            "patternType": "RegEx"
          },
          {
            "allowed": true,  1
            "expression": "*",
            "patternType": "Wildcard"
          }
        ]
    1 This inclusion pattern ensures that all remaining users' drives are indexed. If you omit this pattern, no items will be indexed.
  4. Click Save to apply your change for future update operations or Save and rebuild source.

Index only specific shared drives

By default, the Google Drive source only indexes a user’s My Drive content. However, you can configure the source to index content from the user’s Shared drives as well, or exclusively from the Shared drives.

  1. On the Sources (platform-ca | platform-eu | platform-au) page, click the Google Drive for Work source, and then click More > Edit configuration with JSON.

  2. In the Edit configuration with JSON panel, set the IndexUserDrives and IndexSharedDrives to true or false, depending on your requirements.

For maximum control over indexed content and source ease of management, consider creating a dedicated user account and indexing the handpicked drives you share with this account.

Index only items modified within a rolling period

By default, the Google Drive source indexes all items, regardless of when they were last modified. You can configure the source to index only items modified within a rolling period.

  1. On the Sources (platform-ca | platform-eu | platform-au) page, click your Google Drive source, and then click More > Edit configuration with JSON.

  2. Configure the ItemsModificationPeriod parameter to specify the rolling period. See Configuring unlisted parameters and ItemsModificationPeriod for details.

  3. Click Save to apply your change for future update operations or Save and rebuild source.

Exclude MIME types

The Google Drive source lets you specify Google Drive supported MIME types that you want to exclude from indexing.

  1. On the Sources (platform-ca | platform-eu | platform-au) page, click your Google Drive source, and then click More > Edit configuration with JSON.

  2. Configure the MimeTypesToIgnore parameter to specify the MIME types to exclude. See See Configuring unlisted parameters and MimeTypesToIgnore for details.

  3. Click Save to apply your change for future update operations or Save and rebuild source.

Split your source and parallelize updates

Google limits Drive API calls per minute, but it’s a generous limit. There’s no per-day limit.

You can take advantage of the generous per-minute limit by splitting your Google Drive source into smaller ones and scheduling their updates to run in parallel.

Note

The Google Drive source supports refresh and rescan scheduled source updates. You should understand the differences between both source update types and set scheduled update frequencies that make sense in your context.

One way to split your source is by using address patterns to organize user drives into alphabetical groups.

Example

You have a Google Drive source that indexes content from All user drives. Now, you want to split this source into multiple sources.

Suppose you want the first source to index My Drive content from users whose email addresses start with letters a–d. The second source would then cover users with email addresses starting from e–h, and so on.

  1. On the Sources (platform-ca | platform-eu | platform-au) page, click your Google Drive source, and then click More > Duplicate.

  2. In the Duplicate a source panel, enter a name (that contains "a–d") for your new source. Then, click Duplicate source.

  3. Return to the Sources (platform-ca | platform-eu | platform-au) page, click your new Google Drive source, and then click More > Edit configuration with JSON.

  4. In the Edit configuration with JSON panel, locate the addressPatterns object.

  5. Enter the following addressPatterns configuration:

        "addressPatterns": [
          {
            "allowed": true,
            "expression": ".*/Root:GoogleDrive($|/User:[a-d]).*",
            "patternType": "RegEx"
          }
        ]
  6. Click Save.

  7. Repeat the previous steps for the remaining groups of users. For example, in your next source, you would use ".*/Root:GoogleDrive($|/User:[e-h]).*" in the address pattern.

  8. At the end of the process, delete the original source.

A word on indexing pipeline extensions

indexing pipeline extensions (IPEs) offer a powerful way to customize the indexing process. However, whereas your Google Drive source configurations are applied in the crawling stage of the Coveo indexing pipeline, indexing pipeline extensions are applied in the document processing manager (DPM).

Using an IPE to reject items doesn’t reduce the number of items crawled and, therefore, only adds to the total time required before source items are indexed.

Use IPEs only as a last resort, when native filtering within the Google Drive source isn’t possible.