Add an Amazon S3 source

Amazon simple storage service (S3) is a cloud-based data storage designed to store, manage, and distribute large quantities of data worldwide. Members with the required privileges can add the content of Amazon S3 buckets to a Coveo organization. Coveo indexes Amazon S3 files to make them searchable.

Tip
Leading practice

The number of items that a source processes per hour (crawling speed) depends on various factors, such as network bandwidth and source configuration. See About crawling speed for information on what can impact crawling speed, as well as possible solutions.

Source key characteristics

Features Supported Additional information

Amazon S3 version

Latest cloud version

Following available Amazon S3 releases

Indexable content[1]

Buckets[2] and objects (folders and files)

Content update operations

refresh

x

rescan

check

Takes place every day by default

rebuild

check

Content security options

Same users and groups as in your content system

x

Specific users and groups

check

Everyone

check

Add an Amazon S3 source

Follow the instructions below to add an Amazon S3 source.

  1. On the Sources (platform-ca | platform-eu | platform-au) page, click Add source.

  2. In the Add a source of content panel, click the Amazon S3 source tile.

  3. Configure your source.

Tip
Leading practice

It’s best to create or edit your source in your sandbox organization first. Once you’ve confirmed that it indexes the desired content, you can copy your source configuration to your production organization, either with a snapshot or manually.

See About non-production organizations for more information and best practices regarding sandbox organizations.

"Configuration" tab

In the Add an Amazon S3 source panel, the Configuration tab is selected by default. It contains your source’s general and authentication information, as well as other parameters.

General information

Source name

Enter a name for your source.

Tip
Leading practice

A source name can’t be modified once it’s saved, therefore be sure to use a short and descriptive name, using letters, numbers, hyphens (-), and underscores (_). Avoid spaces and other special characters.

Amazon S3 bucket URL

Enter the address of one or more Amazon S3 buckets using one of the following formats:

  • Virtual-host style

Examples
  • http://<BUCKET>.s3.amazonaws.com/

  • http://<BUCKET>.s3.<AWS_REGION>.amazonaws.com/

where you replace <BUCKET> with the name of your actual bucket, and <AWS_REGION> with your region-specific endpoint.

  • Path style

Examples
  • http://s3.amazonaws.com/<BUCKET>

  • http://s3.<AWS_REGION>.amazonaws.com/<BUCKET>

where you replace <BUCKET> with the name of your actual bucket, and <AWS_REGION> with your region-specific endpoint.

Notes
  • To exclude certain subfolders, first configure and save your source with a broad URL. Then, see Refine the Content to Index.

  • If a region isn’t specified in the URL, it uses the US Standard (us-east-1) region endpoint by default.

  • When the URL points to a folder inside a bucket, only keys starting with that prefix will be crawled.

  • Replace all spaces in the bucket name with %20, if any. For example, http://s3.<AWS_REGION>.amazonaws.com/doc example bucket should be replaced with http://s3.<AWS_REGION>.amazonaws.com/doc%20example%20bucket.

Optical character recognition (OCR)

If you want Coveo to extract text from image files or PDF files containing images, enable the appropriate option.

The extracted text is processed as item data, meaning that it’s fully searchable and will appear in the item Quick view. See Enable optical character recognition for details on this feature.

Project

If you have the Enterprise edition, use the Project selector to associate your source with one or multiple Coveo projects.

"Authentication" section

Fill the appropriate boxes depending on whether your S3 bucket content is secured or public:

If your S3 bucket content is secured

If your S3 bucket content is secured, meaning not accessible by anonymous users, enter the AWS Access Key ID and AWS Secret Access Key values linked to an AWS Identity and Access Management (IAM) account. The IAM account must have at least the read permission on the bucket content to index.

See the Console Access section in the Understanding and Getting Your Security Credentials article for more details.

If your S3 bucket content is public

If your S3 bucket content is public, meaning anonymous users can access the content, you may leave the AWS Access Key ID and AWS Secret Access Key boxes empty.

However, you must grant the List bucket permission to Everyone (public access) to prevent getting an authentication error such as: Coveo is unable to authenticate to access the specified bucket through the Amazon S3 API and consequently cannot perform any action regarding your source. Edit the configuration to review the provided AWS Access Key ID and AWS Secret Access Key ID.

Tip

Before building the source, in a browser, test your bucket URL (without a path), and validate that it returns an XML file listing the bucket content (keys). If you get a short Access denied XML error, the source will give an authentication error.

"Content security" tab

Select who will be able to access the source items through a Coveo-powered search interface. For details on this parameter, see Content security.

"Access" tab

In the Access tab, set whether each group (and API key, if applicable) in your Coveo organization can view or edit the current source.

For example, when creating a new source, you could decide that members of Group A can edit its configuration while Group B can only view it.

See Custom access level for more information.

Completion

  1. Finish adding or editing your source:

    • When you want to save your source configuration changes without starting a build/rebuild, such as when you know you want to do other changes soon, click Add source/Save.

    • When you’re done editing the source and want to make changes effective, click Add and build source/Save and rebuild source.

      Note

      On the Sources (platform-ca | platform-eu | platform-au) page, you must click Launch build or Start required rebuild in the source Status column to add the source content or to make your changes effective, respectively.

      Back on the Sources (platform-ca | platform-eu | platform-au) page, you can follow the progress of your source addition or modification.

      Once the source is built or rebuilt, you can review its content in the Content Browser.

  2. Once your source is done building or rebuilding, review the metadata Coveo is retrieving from your content.

    1. On the Sources (platform-ca | platform-eu | platform-au) page, click your source, and then click More > View and map metadata in the Action bar.

    2. If you want to use a currently not indexed metadata in a facet or result template, map it to a field.

      1. Click the metadata and then, at the top right, click Add to Index.

      2. In the Apply a mapping on all item types of a source panel, select the field you want to map the metadata to, or add a new field if none of the existing fields are appropriate.

        Notes
        • For details on configuring a new field, see Add or edit a field.

        • For advanced mapping configurations, like applying a mapping to a specific item type, see Manage mappings.

      3. Click Apply mapping.

    3. Depending on the source type you use, you may be able to extract additional metadata from your content. You can then map that metadata to a field, just like you did for the default metadata.

      More on custom metadata extraction and indexing

      Some source types let you define rules to extract metadata beyond the default metadata Coveo discovers during the initial source build.

      For example:

      Source type Custom metadata extraction methods

      Push API

      Define metadata key-value pairs in the addOrUpdate section of the PUT request payload used to upload push operations to an Amazon S3 file container.

      In the JSON configuration (REST API | GraphQL API) of the source, define metadata names (REST API | GraphQL API) and specify where to locate the metadata values in the JSON API response Coveo receives.

      Database

      Add <CustomField> elements in the XML configuration. Each element defines a metadata name and the database field to use to populate the metadata with.

      Web

      Sitemap

      Some source types automatically map metadata to default or user created fields, making the mapping process unnecessary. Some source types automatically create mappings and fields for you when you configure metadata extraction.

      See your source type documentation for more details.

    4. When you’re done reviewing and mapping metadata, return to the Sources (platform-ca | platform-eu | platform-au) page.

    5. To reindex your source with your new mappings, click Launch rebuild in the source Status column.

    6. Once the source is rebuilt, you can review its content in the Content Browser.

Refine the content to index

You may want to avoid indexing certain subfolders, or to index only a few of them. To do so:

  1. If not already done, create and save your source with a broad bucket URL.

  2. In your source JSON configuration, enter an address filter to refine the targeted content.

    Important

    Your bucket URL must match one of your inclusion addressPatterns and not match any of your exclusion addressPatterns.

  3. Build or rebuild your source.

Required privileges

You can assign privileges to allow access to specific tools in the Coveo Administration Console. The following table indicates the privileges required to view or edit elements of the Sources (platform-ca | platform-eu | platform-au) page and associated panels. See Manage privileges and Privilege reference for more information.

Note

The Edit all privilege isn’t required to create sources. When granting privileges for the Sources domain, you can grant a group or API key the View all or Custom access level, instead of Edit all, and then select the Can Create checkbox to allow users to create sources. See Can Create ability dependence for more information.

Actions Service Domain Required access level

View sources, view source update schedules, and subscribe to source notifications

Content

Fields

View

Sources

Organization

Organization

Edit sources, edit source update schedules, and view the View and map metadata subpage

Content

Fields

Edit

Sources

Content

Source metadata

View

Organization

Organization

What’s next?


1. An access key is needed to connect to the Amazon Web Services (AWS) service through the software development kit (SDK). The access key is a way to authenticate from the SDK as an Identity and Access Management (IAM) account. The number of requests is unlimited, but you’re charged for every request to your Amazon S3 buckets.
2. Amazon S3 Requester Pays buckets aren’t supported.