Keeping an Up-To-Date Index

Once created, a source must periodically retrieve data from its original content repository to maintain up-to-date items (and possibly permissions) in your index. Most sources support three different ways to update their content: refresh, rescan, and rebuild (see Refresh VS Rescan VS Rebuild).

This article offers guidelines on scheduling and triggering source updates, and monitoring the overall indexing process. The following diagram and table outline how each source update type affects the Coveo™ Cloud indexing pipeline at various stages (see About the Indexing Process).

Coveo Cloud indexing pipeline activity diagram (with source updates)

Indexing pipeline stage Refresh Rescan Rebuild
Crawling N/A Crawl the entire content repository
DPM Receive and process an incremental update fetched directly from the indexed system (e.g., using an API) Receive and process only new/updated crawled content Receive and process all crawled content
Indexing
  • Add/overwrite new/updated items (with possible limitations)
  • Keep unchanged items
  • Remove deleted items (with possible limitations)
  • Add/overwrite new/updated items
  • Keep unchanged items
  • Remove deleted items
  • Add all items
  • Remove all previously indexed items

Each connector available in the Coveo Platform implements its own rebuild, rescan, and possibly refresh mechanisms.

However, to leverage a Push source, you must essentially design and host your own custom connector relying on the Push API (see Push API). This implies that you’re fully responsible for implementing and scheduling the execution of your Push source update mechanisms. The Push API allows you to set the status of your Push source so that your activity logs remain coherent (see Updating the Status of a Push Source).

Trigger Full Source Rebuilds

A full source rebuild re-indexes an entire content repository from scratch. This type of source update can weight heavily on the computing resources of the indexed system, as well as on those of the Coveo Platform, especially when a large content repository must be re-indexed. For this reason, you can only trigger a source rebuild manually (see Refresh, Rescan, or Rebuild Sources).

Typically, you should only rebuild a source when:

  • You create that source, assuming its initial build process didn’t start automatically, or was somehow canceled.

  • You modify one or more of the following settings on that source:

    • Its basic configuration (e.g., you edit the web scraping configuration of a Sitemap source).

    • Its mappings (e.g., you remove a mapping from a source).

    • Its indexing pipeline extensions (IPEs) (e.g., you associate a new post-conversion IPE to a source).

      If you modify an IPE that applies to many sources, you should rebuild each of those sources.

  • You enable or disable one of the following options on a field that’s associated with that source:

    • Search operator

    • Displayable in results

    • Free text search

    • Ranking

    • Stemming

    Essentially, anytime you change a setting that must apply to all items in a source (i.e., to all unmodified, modified, and new items), you should rebuild that source.

Schedule and Trigger Incremental Source Updates

You will typically configure incremental update schedules on each of your sources to ensure that your indexed content remains up-to-date (see Edit a Source Schedule). When you need some content repository changes to become quickly available in your index, you can also trigger an incremental source update manually (see Refresh, Rescan, or Rebuild Sources).

There are two types of incremental source updates:

  • Refresh

    Retrieves an incremental update directly from an indexed system (e.g., by leveraging an exposed REST API).

  • Rescan

    Re-crawls an entire content repository to retrieve new and updated content.

A refresh consumes less computing resources than a rescan. However, while all connectors support rescans, several don’t support refreshes at all, or have limited refresh capabilities (e.g., some data, metadata, and/or permissions may not be updated, deleted items may not be removed from the index, etc.). Therefore, a source supporting refresh should normally have both a refresh and a rescan schedule, the purpose of the latter being to clean things up after several refreshes have occurred.

To find whether a given source connector supports refresh and, that being the case, whether that connector has any refresh limitations, see the documentation on managing sources based on that connector (see Connector Directory).

Determine Incremental Source Update Frequency

Non-generic Coveo connectors supporting incremental source updates (e.g., SharePoint, Jive, YouTube, etc.) have default schedules which are adapted to the system they connect to, as well as to crawler performance. Those default schedules fit most use cases and should typically only be modified to reduce the load on an indexed system, or ensure that a small source is updated more often.

The refresh and rescan frequency of a source should essentially depend on two factors:

  • How often the original content repository changes.

  • How quickly original content repository changes need to become available in your index.

In addition, when a source supports refresh, its rescan frequency should depend on the quality of those refreshes. A source that has a highly reliable refresh mechanism may only require a monthly rescan to deal with a few corner case indexing issues. On the other hand, a source that has a very flawed refresh mechanism may require a daily rescan schedule.

  • A Jira source that indexes an instance with many active users could refresh every 15-30 minutes, and rescan every Sunday.

  • A Sitemap source that indexes a relatively stable intranet could refresh every week day, and rescan every Sunday.

  • A source that doesn’t support refresh, but indexes content that changes several times a day, could be rescanned every 4-6 hours.

  • A source that indexes entirely static content shouldn’t have a refresh and/or rescan schedule at all.

Updating a source taxes computing resources in the Coveo Platform and in the indexed system itself. Therefore, you should be careful to schedule source updates to be performed only as frequently as actually required.

Monitor the Indexing Process

There are two ways you can monitor the indexing process:

  • Inspect the indexing pipeline logs

    The Log Browser allows you to review the status of each item going through the indexing process. This information can be extremely useful when troubleshooting indexing issues (see Review Item Logs).

  • Subscribe to source notifications

    You can subscribe to email notification to be alerted when certain activities are triggered on a specific source, such as when a refresh, rescan or rebuild fails, succeeds or aborts (see Manage Source Notification Subscriptions).

What’s Next?

The next article in this section explains how you can navigate and inspect indexed content through the Coveo Administration Console (see Exploring Indexed Content).

Recommended Articles