Keep an index up to date

Once created, a source must periodically retrieve data from its original content repository to maintain up-to-date items (and possibly permissions) in your index. Most sources support three different ways to update their content: refresh, rescan, or rebuild.

This article offers guidelines on scheduling and triggering source updates and monitoring the indexing process. The following diagram and table outline how each source update operation affects the Coveo indexing pipeline at various stages.

Coveo indexing pipeline activity diagram with source updates | Coveo
Indexing pipeline stage refresh rescan rebuild

Crawling

N/A

Crawl the entire content repository

DPM

Receive and process an incremental update retrieved directly from the indexed system (for example, using an API)

Receive and process only new or updated crawled content

Receive and process all crawled content

Indexing

  • Add/overwrite new/updated items (with possible limitations)

  • Keep unchanged items

  • Remove deleted items (with possible limitations)

  • Add or overwrite new and updated items

  • Keep unchanged items

  • Remove deleted items

  • Add all items

  • Remove all previously indexed items

Note

Each connector available in the Coveo Platform implements its own rebuild, rescan, and possibly refresh mechanisms.

However, to leverage a Push source, you must essentially design and host your own custom connector relying on the Push API. This implies that you’re fully responsible for implementing and scheduling the execution of your Push source update mechanisms. The Push API lets you set the status of your Push source so that your activity logs remain coherent.

Trigger a full source rebuild

A full source rebuild re-indexes an entire content repository from scratch. This type of source update can weigh heavily on the computing resources of the indexed system, as well as on those of the Coveo Platform, especially when a large content repository must be re-indexed. For this reason, you can only trigger a source rebuild manually.

Typically, you should only rebuild a source when:

  • You create that source, assuming its initial build process didn’t start automatically, or was somehow canceled.

  • You modify one or more of the following settings on that source:

    • Its basic configuration (for example, you edit the web scraping configuration of a Sitemap source).

    • Its mappings (for example, you remove a mapping from a source).

    • Its indexing pipeline extensions (IPEs) (for example, you associate a new post-conversion IPE to a source).

      Note

      If you modify an IPE that applies to many sources, you should rebuild each of those sources.

  • You enable or disable one of the following options on a field that’s associated with that source:

    • Search operator

    • Displayable in results

    • Free text search

    • Ranking

    • Stemming

    Any time you change a setting that must apply to all items in a source (that is, to all unmodified, modified, and new items), you should rebuild that source.

Schedule and trigger incremental source updates

You will typically schedule incremental updates on each of your sources to ensure that your indexed content remains up-to-date. When you need some content repository changes to become quickly available in your index, you can also trigger an incremental source update manually.

There are two types of incremental source updates:

  • Refresh

    Retrieves an incremental update directly from an indexed system (for example, by leveraging an exposed REST API).

  • Rescan

    Re-crawls an entire content repository to retrieve new and updated content.

A refresh consumes less computing resources than a rescan. However, while all connectors support rescans, several don’t support refreshes at all, or have limited refresh capabilities. For example, upon refreshing, some data, metadata, or permissions may not be updated, or deleted items may not be removed from the index.

Consequently, a source that supports refreshing should normally have both a refresh and a rescan schedule. The rescan will clean up the index after several refreshes have occurred.

Note

To figure out whether a given source connector supports refresh and, if so, whether that connector has any refresh limitations, see the documentation on managing sources based on that connector.

Determine incremental source update frequency

Important

Non-generic Coveo connectors supporting incremental source updates (for example, SharePoint, Jive, YouTube, etc.) have default schedules which are adapted to the system they connect to, as well as to crawler performance.

These default schedules fit most use cases and should only be modified to reduce the load on an indexed system or to ensure that a small source is updated more often.

The refresh and rescan frequency of a source should depend on two factors:

  • How often the original content repository changes.

  • How quickly original content repository changes need to become available in your index.

In addition, when a source supports refresh, its rescan frequency should depend on the quality of those refreshes. A source that has a highly reliable refresh mechanism may only require a monthly rescan to handle edge-case indexing issues. In contrast, a source with a flawed refresh mechanism might require a daily rescan schedule.

Examples
  • A Jira source that indexes an instance with many active users could refresh every 15-30 minutes, and rescan every Sunday.

  • A Sitemap source that indexes a relatively stable intranet could refresh every week day, and rescan every Sunday.

  • A source that doesn’t support refresh, but indexes content that changes several times a day could be rescanned every 4-6 hours.

  • A source that indexes entirely static content shouldn’t have a refresh or rescan schedule at all.

Note

Updating a source taxes computing resources in the Coveo Platform and in the indexed system itself. Therefore, you should be careful to schedule source updates to be performed only as frequently as required.

Monitor the indexing process

There are two ways you can monitor the indexing process:

What’s next?

The Explore indexed content article explains how you can navigate and inspect indexed content through the Coveo Administration Console.