---
title: About the indexing process
slug: '2684'
canonical_url: https://docs.coveo.com/en/2684/
collection: project-guide
source_format: adoc
---
# About the indexing process
Each piece of data considered for [indexing](https://docs.coveo.com/en/204/) (including web pages, files, database records, and Salesforce objects, among others) must go through the entire [Coveo indexing pipeline](https://docs.coveo.com/en/184/) to end up as an [item](https://docs.coveo.com/en/210/) in your index.

This article and diagram provide an overview of the main indexing pipeline stages.

![Flowchart showing the steps of the Coveo indexing pipeline with extensions](https://docs.coveo.com/en/assets/images/index-content/indexing-pipeline-flowchart-with-extension.png)

For detailed information on all indexing pipeline stages, see [Coveo indexing pipeline](https://docs.coveo.com/en/1893/).

## Crawling

At the _crawling_ stage of the indexing pipeline, raw data is retrieved from various content repositories, and sent to the [document processing manager (DPM)](https://docs.coveo.com/en/191/) queue.
To specify which content repositories need to be crawled at this stage, you must [create and configure](https://docs.coveo.com/en/3390/) [sources](https://docs.coveo.com/en/246/) in your [Coveo organization](https://docs.coveo.com/en/185/).

The [Coveo Platform](https://docs.coveo.com/en/186/) offers [a broad range](https://docs.coveo.com/en/1702#connector-types) of source [connectors](https://docs.coveo.com/en/2734/), many of which are designed to index data residing in specific systems such as Jira or SharePoint, while others are more generic.
Sources indexing secured content can retrieve [permissions](https://docs.coveo.com/en/223/) and [security identities](https://docs.coveo.com/en/240/) to [replicate secured systems](https://docs.coveo.com/en/1719/) in your index.
In general, sources that [retrieve](https://docs.coveo.com/en/1612/) Cloud-accessible content rely on [Coveo crawlers](https://docs.coveo.com/en/2121/), whereas sources that retrieve content from behind firewalls rely on the Coveo Crawling Module or Push API directly.

In the [Prepare to index content](https://docs.coveo.com/en/2680/) article, you'll learn how to determine which connectors are required for your search project.

In the [Apply indexing techniques](https://docs.coveo.com/en/2721/) article, you'll learn how to [take advantage of the crawling flexibility](https://docs.coveo.com/en/2721#set-the-crawling-scope-and-refine-or-enhance-crawled-content) of your sources.

## Applying extensions

[indexing pipeline extensions (IPEs)](https://docs.coveo.com/en/206/) are executed during the _applying extensions_ stages of the indexing pipeline.
An IPE is a custom Python 3 script that either runs before (_pre-conversion_) or after (_post-conversion_) the [processing](#processing) and [mapping](#mapping) stages of the indexing pipeline.
Each source can have [its own set](https://docs.coveo.com/en/1936/) of pre-conversion and post-conversion IPEs.
IPEs typically alter or reject candidate items before they can reach the index.

Later in this guide, you'll learn [how and when to use](https://docs.coveo.com/en/2721#further-refine-or-enhance-content-using-indexing-pipeline-extensions) pre-conversion and post-conversion IPEs.

> **Leading practice**
>
> IPEs can slow down the indexing pipeline process and make it difficult to troubleshoot.
> You should only use IPEs when necessary.

See also:

* [Indexing pipeline extension overview](https://docs.coveo.com/en/1556/)

* [Use the Extensions API](https://docs.coveo.com/en/156/)

## Optical character recognition

At the _optical character recognition_ (OCR) stage of the indexing pipeline, the [Coveo Platform](https://docs.coveo.com/en/186/) extracts text from images and PDF files in sources for which the [OCR feature](https://docs.coveo.com/en/2937/) has been enabled.
OCR-extracted text is processed as item data, meaning that it's fully searchable, and will appear in the item [Quick view](https://docs.coveo.com/en/2760#search-result-quick-view).

## Processing

At the _processing_ stage of the indexing pipeline, candidate items are converted to a format suitable for indexing, and automatic language detection occurs, if applicable.
You can exercise no direct control over this stage.

The indexer can convert various standard [file formats](https://docs.coveo.com/en/1689/).
Candidate items whose format isn't supported can still be indexed by reference.
The index also supports a [wide array of languages](https://docs.coveo.com/en/1956/), many of which have their own [stemmer](https://docs.coveo.com/en/1576/).

## Mapping

At the _mapping_ stage of the indexing pipeline, candidate item metadata is associated to [fields](https://docs.coveo.com/en/200/) in the index.
You can exercise [granular control](https://docs.coveo.com/en/1640/) over this stage through the [mapping](https://docs.coveo.com/en/217/) configuration of each of your sources.

Elsewhere in this guide, you can learn how to concatenate metadata using [custom mapping rules](https://docs.coveo.com/en/2721#define-custom-mapping-rules-to-populate-fields) and how to [create and customize fields](https://docs.coveo.com/en/2721#create-and-populate-custom-fields).

> **Note**
>
> The mapping stage merely establishes which metadata key-value pairs are going to populate which fields.
> Fields are actually populated at the [indexing](#indexing) stage.

See also:

* [Mapping Rule Syntax Reference](https://docs.coveo.com/en/1839/)


* [Add or edit a body mapping](https://docs.coveo.com/en/1847/)

## Indexing

At the _indexing_ stage of the indexing pipeline, fields are populated with metadata as determined at the [mapping](#mapping) stage, and the fully processed item is committed to the index.

You can exercise no direct control over this stage.

## What's next?

The [Prepare to index content](https://docs.coveo.com/en/2680/) article outlines the steps you should follow before you start indexing content.