About the Indexing Process
Each piece of data considered for indexing (web page, file, database record, Salesforce object, etc.) must go through the entire Coveo™ Cloud indexing pipeline to end up as an item in your index (see Coveo Cloud Indexing Pipeline).
In the following diagram, the indexing pipeline stages over which you can exercise control have a bright orange background.
This article provides an overview of each of those stages.
At the crawling stage of the indexing pipeline, raw data is retrieved from various content repositories, and sent to the document processing manager (DPM) queue. To specify what content repositories need to be crawled at this stage, you must create and configure sources in your Coveo organization (see Add or Edit a Source).
The Coveo Platform offers a broad range of source connectors, many of which are designed to index data residing in specific systems such as Jira or SharePoint, while others are more generic (see Connector Types). Sources indexing secured content can retrieve permissions and security identities to replicate secured systems in your index (see Coveo Cloud Management of Security Identities and Item Permissions). In general, sources retrieving Cloud-accessible content rely on Coveo cloud-hosted crawlers, whereas sources retrieving content behind firewalls rely on the Coveo On-Premises Crawling Module, or on the Push API directly (see Content Retrieval Methods).
Later in this guide, you will learn how to determine what connectors are required for your search project (see Preparing for Indexing).
Further on, you will learn how to take advantage of the crawling flexibility of your sources (see Set the Crawling Scope and Refine/Enhance Crawled Content).
At the applying extensions stages of the indexing pipeline, indexing pipeline extension (IPEs) are executed. An IPE is a custom Python 3 script that either runs before (pre-conversion) or after (post-conversion) the processing and mapping stages of the indexing pipeline (see Processing and Mapping). Each source can have its own set of pre-conversion and post-conversion IPEs (see Apply an Extension to a Source). IPEs typically alter or reject candidate items before they can reach the index.
Later in this guide, you will learn how and when you should use pre-conversion and post-conversion IPEs (see Further Refine/Enhance Content Using Indexing Pipeline Extensions).
IPEs can slow down the indexing pipeline process and make it difficult to troubleshoot. Therefore, you should use IPEs only when actually required.
Optical Character Recognition
At the optical character recognition (OCR) stage of the indexing pipeline, Coveo Cloud extracts text from images and PDF files in sources for which the OCR feature has been enabled (see Enable Optical Character Recognition). OCR-extracted text is processed as item data, meaning that it’s fully searchable, and will appear in the item Quick View (see Search Result Quick View).
At the processing stage of the indexing pipeline, candidate items are converted to a format suitable for indexing, and automatic language detection occurs, if applicable. You can exercise no direct control over this stage.
The indexer can convert various standard file formats (see Supported File Formats). Candidate items whose format isn’t supported can still be indexed by reference. The index also supports a wide array of languages (see Supported Languages - Coveo Cloud), many of which have their own stemmer (see About Stemming).
At the mapping stage of the indexing pipeline, candidate item metadata is associated to fields in the index. You can exercise granular control over this stage through the mapping configuration of each of your sources (see Manage Source Mappings).
Later in this guide, you will learn how you can concatenate metadata using custom mapping rules, and how to create and customize fields (see Define Custom Mapping Rules to Populate Fields).
The mapping stage merely establishes which metadata key-value pairs are going to populate which fields. Fields are actually populated at the indexing stage (see Indexing).
At the indexing stage of the indexing pipeline, fields are populated with metadata as determined at the mapping stage (see Mapping), and the fully processed item is committed to the index.
You can exercise no direct control over this stage.
The next article in this section outlines the steps you should follow before you start indexing content (see Preparing for Indexing).