--- title: Chunking strategy slug: p9ub0044 canonical_url: https://docs.coveo.com/en/p9ub0044/ collection: leverage-machine-learning source_format: adoc --- # Chunking strategy When a [CPR](https://docs.coveo.com/en/oaie9196/) [model](https://docs.coveo.com/en/1012/) builds, it creates [embeddings](https://docs.coveo.com/en/ncc87383/) for the [indexed](https://docs.coveo.com/en/204/) items specified in the [model settings](https://docs.coveo.com/en/oaie5476/). To create the [embeddings](https://docs.coveo.com/en/ncc87383/), the [model](https://docs.coveo.com/en/1012/) uses a process called chunking to break large pieces of text into smaller segments called chunks. Each chunk is mapped as a distinct vector in the [embedding](https://docs.coveo.com/en/ncc87383/) vector space. The [embeddings](https://docs.coveo.com/en/ncc87383/) are used by the [model](https://docs.coveo.com/en/1012/) for semantic content retrieval to find the most relevant chunks in response to a query. The success of a [RAG](https://docs.coveo.com/en/p8ie0159/) system depends, in part, on the quality of the chunks. The more coherent and contextually focused the chunks are, the better the semantic alignment between the query intent and the chunks that are retrieved by the [model](https://docs.coveo.com/en/1012/). This results in more relevant content retrieval, less content ambiguity, and ultimately better responses from the [RAG](https://docs.coveo.com/en/p8ie0159/) system. There are many ways to segment text into chunks. The method that's used to create the chunks is referred to as the chunking strategy. This article describes the chunking strategies that are available for a [CPR](https://docs.coveo.com/en/oaie9196/) [model](https://docs.coveo.com/en/1012/), and provides guidance to help you choose the best strategy for your use case. It also provides information on the [index data stream](#chunking-data-stream) that's used to create the chunks. The [CPR](https://docs.coveo.com/en/oaie9196/) [model](https://docs.coveo.com/en/1012/) offers two chunking strategies to choose from and [configure for your CPR model](https://docs.coveo.com/en/oaie5476#set-the-chunking-strategy): * [Structure-aware chunking](#structure-aware-chunking) * [Fixed-size chunking](#fixed-size-chunking) > **Note** > > [CPR](https://docs.coveo.com/en/oaie9196/) [models](https://docs.coveo.com/en/1012/) created after the release of structure-aware chunking (October 2025) use the structure-aware chunking strategy by default. > [CPR](https://docs.coveo.com/en/oaie9196/) [models](https://docs.coveo.com/en/1012/) created before October 2025 use the fixed-size chunking strategy by default. > > To view a [CPR](https://docs.coveo.com/en/oaie9196/) [model](https://docs.coveo.com/en/1012/)'s active chunking strategy, on the [**Models**](https://platform.cloud.coveo.com/admin/#/orgid/ai-and-ml/models/) ([platform-ca](https://platform-ca.cloud.coveo.com/admin/#/orgid/ai-and-ml/models/) | [platform-eu](https://platform-eu.cloud.coveo.com/admin/#/orgid/ai-and-ml/models/) | [platform-au](https://platform-au.cloud.coveo.com/admin/#/orgid/ai-and-ml/models/)) page, click the [CPR](https://docs.coveo.com/en/oaie9196/) [model](https://docs.coveo.com/en/1012/), and then click **View JSON** in the Action bar. > The chunking strategy appears in the `strategy` parameter of the `chunkerConfig` object under `extraConfig`. > **Important** > > Structure-aware chunking is specifically optimized for large language [models](https://docs.coveo.com/en/1012/) (LLMs) and [RAG](https://docs.coveo.com/en/p8ie0159/) systems, and is the recommended chunking strategy. > You should use structure-aware chunking unless you have a [specific use case that requires](https://docs.coveo.com/en/p9ub0044#choosing-a-chunking-strategy) the use of fixed-size chunking. > **Note** > > You can [configure each CPR model](https://docs.coveo.com/en/oaie5476#set-the-chunking-strategy) to use a different chunking strategy depending on your specific needs. ## Structure-aware chunking Structure-aware chunking uses a dynamic algorithm to determine the optimal chunk based on semantic boundaries, [token limits](#model-embedding-limits), text formatting, and structure. > **Note** > > Review the [main considerations](https://docs.coveo.com/en/p9ub0044#choosing-a-chunking-strategy) when choosing between chunking strategies. The following elements are taken into consideration when determining the chunk boundaries: * Headings and sections * Paragraph structure * Inline formatting * Shifts in subject or focus * Whitespace By using semantic, formatting, and structural markers to set the boundaries of each chunk, instead of a fixed word count, chunks are more coherent and contextually focused. This is especially true when using the [Markdown data stream](#chunking-data-stream), which preserves the item's structure and formatting, to create chunks. When using the [body text data stream](#chunking-data-stream), structure-aware chunking can still perceive elements like headings, lists, and paragraphs by using newline characters and indentation patterns, but not as effectively as when using the Markdown data stream. Because each chunk is created with a focus on maintaining semantic coherence and focus, the size of each chunk varies, while respecting [token limits](#model-embedding-limits). ![Structure-aware chunking | Coveo](https://docs.coveo.com/en/assets/images/leverage-machine-learning/structure-aware-chunking.png) Complex items containing tables, hierarchical information, and structured data benefit significantly from this approach. Text that belong together stay together, improving the contextual relevance of each chunk. Semantic boundaries, which are natural breakpoints in the text where the subject shifts or completes, also help dictate where chunks begin and end. Table items, lists, and sections are preserved within a single chunk whenever possible, instead of being split across multiple chunks. **Example** For an item that contains basic structured information separated by headings, structure-aware chunking analyses the item and creates four distinct chunks that focus on specific sections, while keeping elements such as tables in the same chunk. [Fixed-size chunking](#fixed-size-chunking), however, creates three chunks of 250 words each, with no regard for semantic or structural boundaries. ![Chunk comparison between fixed-size and structure-aware chunking | Coveo](https://docs.coveo.com/en/assets/images/leverage-machine-learning/chunk-comparison.png) Unlike with fixed-size chunking, there's no content overlap between chunks. Content overlap can sometimes lead to contradictory information during content retrieval because the same text can appear in more than one chunk and in different contexts. Therefore, the risk of contradictory information from retrieved chunks is reduced when using structure-aware chunking. Structure-aware chunking is specifically optimized for large language models (LLMs) and [RAG](https://docs.coveo.com/en/p8ie0159/) systems. Chunks created using semantic and structural markers, and with no content overlap, improve the semantic alignment between the query intent and the chunks retrieved by the [CPR](https://docs.coveo.com/en/oaie9196/) [model](https://docs.coveo.com/en/1012/). In the context of your [RAG](https://docs.coveo.com/en/p8ie0159/) system, this results in more relevant, comprehensive, and coherent chunks that you can use in your LLM-powered application to generate higher-quality responses. Structure-aware chunking requires more processing than fixed-size chunking, and typically results in more chunks per item. This impacts the number of chunks that count toward the [model embedding limits](#model-embedding-limits). > **Tip** > > Given the increase in processing and chunk count with structure-aware chunking, follow [best practices](https://docs.coveo.com/en/oaod5329#best-practices) by keeping items concise and focused. ## Fixed-size chunking As the name implies, fixed-size chunking creates chunks by splitting text into segments of a fixed number of words. > **Note** > > Review the [main considerations](https://docs.coveo.com/en/p9ub0044#choosing-a-chunking-strategy) when choosing between chunking strategies. The text is split using a rolling window of 250 whitespace-delimited words. For a given item, the first chunk contains the first 250 words, the second chunk contains the next 250 words, and so on. The last chunk contains the remaining words, which may be fewer than 250 words. ![Fixed-size chunking | Coveo](https://docs.coveo.com/en/assets/images/leverage-machine-learning/fixed-size-chunking.png) Semantic boundaries (natural breakpoints in the text where the subject shifts or completes), formatting, and text structure, such as headings, paragraphs, and lists, aren't taken into consideration when creating the chunks. Because of this, chunks are created with up to 10% overlap between chunks. This is done to preserve context continuity so that important context isn't lost when the text is separated into chunks. Content overlap, however, can lead to contradictory information during content retrieval because the same text can appear in different contexts. Fixed-size chunking requires less processing than structure-aware chunking, and typically results in less chunks per item, which may be a consideration given the [model embedding limits](#model-embedding-limits). ## Choosing a chunking strategy The chunking strategy you choose for your [model](https://docs.coveo.com/en/1012/) impacts how the text is segmented into chunks. > **Important** > > Structure-aware chunking is specifically optimized for large language [models](https://docs.coveo.com/en/1012/) (LLMs) and [RAG](https://docs.coveo.com/en/p8ie0159/) systems, and is the recommended chunking strategy. > You should use structure-aware chunking unless you have a specific use case that requires the use of fixed-size chunking. > **Note** > > You can [configure each CPR model](https://docs.coveo.com/en/oaie5476#set-the-chunking-strategy) to use a different chunking strategy depending on your specific needs. Choosing between [structure-aware](#structure-aware-chunking) and [fixed-size](#fixed-size-chunking) chunking comes down to the following considerations: * **Dataset size**: Because structure-aware chunking creates chunks dynamically based on semantic and structural markers instead of a fixed word count, it typically results in more chunks per item than fixed-size chunking. The following image shows a simplified example of the number of chunks created for an item using both chunking strategies. ![Difference between fixed-size and structure-aware chunking | Coveo](https://docs.coveo.com/en/assets/images/leverage-machine-learning/fixed-size-vs-structure-aware-chunking.png) This may impact the number of chunks that count toward the [model embedding limits](#model-embedding-limits). If your dataset is too large and the embedding limits for chunks are exceeded, fixed-size chunking may be more appropriate. > **Note** > > When embedding limits are exceeded, the [model](https://docs.coveo.com/en/1012/) build fails and an error appears on the [**Models**](https://platform.cloud.coveo.com/admin/#/orgid/ai-and-ml/models/) ([platform-ca](https://platform-ca.cloud.coveo.com/admin/#/orgid/ai-and-ml/models/) | [platform-eu](https://platform-eu.cloud.coveo.com/admin/#/orgid/ai-and-ml/models/) | [platform-au](https://platform-au.cloud.coveo.com/admin/#/orgid/ai-and-ml/models/)) page and [model information tab](https://docs.coveo.com/en/1894/) of the [Coveo Administration Console](https://docs.coveo.com/en/183/). * **Model refresh schedule**: Structure-aware chunking requires more processing than fixed-size chunking. As a result, model build times can be longer with structure-aware chunking depending on the size of your dataset. For models with large datasets that require daily or frequent refreshes, fixed-size chunking may be more appropriate. In summary, you should choose structure-aware chunking unless your dataset is too large and exceeds the [chunk embedding limits](#model-embedding-limits), or if your model requires daily or frequent refreshes that can't accommodate the longer processing times. > **Note** > > [CPR](https://docs.coveo.com/en/oaie9196/) [models](https://docs.coveo.com/en/1012/) created after the release of structure-aware chunking (October 2025) use the structure-aware chunking strategy by default. > [CPR](https://docs.coveo.com/en/oaie9196/) [models](https://docs.coveo.com/en/1012/) created before October 2025 use the fixed-size chunking strategy by default. > > To view a [CPR](https://docs.coveo.com/en/oaie9196/) [model](https://docs.coveo.com/en/1012/)'s active chunking strategy, on the [**Models**](https://platform.cloud.coveo.com/admin/#/orgid/ai-and-ml/models/) ([platform-ca](https://platform-ca.cloud.coveo.com/admin/#/orgid/ai-and-ml/models/) | [platform-eu](https://platform-eu.cloud.coveo.com/admin/#/orgid/ai-and-ml/models/) | [platform-au](https://platform-au.cloud.coveo.com/admin/#/orgid/ai-and-ml/models/)) page, click the [CPR](https://docs.coveo.com/en/oaie9196/) [model](https://docs.coveo.com/en/1012/), and then click **View JSON** in the Action bar. > The chunking strategy appears in the `strategy` parameter of the `chunkerConfig` object under `extraConfig`. > **Tip** > > Given the increase in processing and chunk count with structure-aware chunking, follow [best practices](https://docs.coveo.com/en/oaod5329#best-practices) by keeping items concise and focused. > **Important** > > Modifying the chunking strategy initiates an automatic [model](https://docs.coveo.com/en/1012/) [rebuild](https://docs.coveo.com/en/2712/). [cols="1,1,1",options="header"] |=== | Consideration | Structure-aware chunking | Fixed-size chunking | Chunk size | Dynamically determined based on semantic boundaries, text formatting, and structure | Fixed 250 whitespace-delimited words per chunk | Content overlap between chunks | No | Yes | Number of chunks | Results in more chunks per item and approximately 60% more chunks overall on average depending on the dataset | Results in less chunks per item and overall | Model build time | Requires build times up to 3x longer (depending on the dataset) due to algorithm complexity | Requires shorter build times | Best suited for a| * A [model](https://docs.coveo.com/en/1012/) with a smaller dataset that doesn't exceed the [embedding limits](#model-embedding-limits) * A [model](https://docs.coveo.com/en/1012/) with a refresh schedule that can accommodate the longer processing times a| * A [model](https://docs.coveo.com/en/1012/) with a large dataset that exceeds the [embedding limits](#model-embedding-limits) * A [model](https://docs.coveo.com/en/1012/) that requires daily or frequent refreshes |=== ## Chunking data stream When items are [indexed](https://docs.coveo.com/en/204/), the [indexing pipeline](https://docs.coveo.com/en/184/) processes each item into different data streams that are used for specific purposes. The [data streams](https://docs.coveo.com/en/2891/) that pertain to the chunking process are the _body text_ and _body Markdown_ data streams: * **Body text**: Contains all the item's body content in text format. This data stream is primarily used during [indexing](https://docs.coveo.com/en/1893#indexing) to add the item contents to the unified index to make the content searchable. However, it can also be used by your [model](https://docs.coveo.com/en/1012/) to create chunks in the absence of the body Markdown data stream. * **Body Markdown**: Contains all the item's body content in Markdown format. It preserves the item's formatting and structure using Markdown, and is used solely for the purpose of creating chunks for [embeddings](https://docs.coveo.com/en/ncc87383/). For a given item, the [model](https://docs.coveo.com/en/1012/) uses either the body text or body Markdown data stream to create the chunks. If a Markdown data stream exists for an item, the [model](https://docs.coveo.com/en/1012/) automatically uses that data stream to create the chunks. There's no configuration required to use the Markdown [data stream](https://docs.coveo.com/en/2891/). If a Markdown data stream isn't available for an item, the [model](https://docs.coveo.com/en/1012/) uses the body text data stream instead to create the chunks. > **Notes** > > * The Markdown data stream is processed for PDF files only. > All other file types are processed only with body text and body HTML data streams. > > * A PDF file that's already indexed won't have a Markdown data stream until it's re-indexed. > To make sure all of your PDF files are processed to include a Markdown data stream, [rebuild your source](https://docs.coveo.com/en/2039#rebuild). > > * To optimize indexing performance, the processing time for an item's Markdown data stream is limited to 15 minutes. > If the limit is reached, the Markdown data stream will be truncated. > In this case, the [model](https://docs.coveo.com/en/1012/) still uses the truncated body Markdown data stream to create the chunks. **Example** When a PDF file is [indexed](https://docs.coveo.com/en/204/), the [indexing pipeline](https://docs.coveo.com/en/184/) processes the item and creates three data streams for the body content: HTML, text, and Markdown. Since a Markdown data stream exists for the item, the [CPR](https://docs.coveo.com/en/oaie9196/) [model](https://docs.coveo.com/en/1012/) uses it to create the [embeddings](https://docs.coveo.com/en/ncc87383/). If the item didn't have a Markdown data stream, the [model](https://docs.coveo.com/en/1012/) would use the text data stream instead. ![Chunker data streams | Coveo](https://docs.coveo.com/en/assets/images/leverage-machine-learning/chunker-data-stream.png) The HTML data stream is used to render an HTML version of the item to be used by the [quickview](https://docs.coveo.com/en/3311/) component of a search interface. > **Tip** > > You can apply an [indexing pipeline extension (IPE)](https://docs.coveo.com/en/1645/) script to modify the original file or any of the data streams. ### Advantages of the Markdown data stream The Markdown data stream is the preferred data stream to use when creating chunks, which is why it's automatically used by your [model](https://docs.coveo.com/en/1012/) when available. > **Note** > > There's no configuration required to use the Markdown [data stream](https://docs.coveo.com/en/2891/). > If it exists for an [indexed](https://docs.coveo.com/en/204/) item, the [model](https://docs.coveo.com/en/1012/) always uses it to create chunks instead of the body text data stream. When using [structure-aware chunking](#structure-aware-chunking), the model takes advantage of the structure and formatting present in the Markdown data stream to create more coherent and semantically focused chunks. When using the body text data stream, it can still perceive elements like headings, lists, and paragraphs by using newline characters and indentation patterns, but not as effectively as when using the Markdown data stream. The structure and formatting present in the Markdown data stream allows [structure-aware chunking](#structure-aware-chunking) to create more coherent and semantically focused chunks. While [fixed-size chunking](#fixed-size-chunking) doesn't leverage an item's structure and formatting to create chunks, using the Markdown data stream is still beneficial as the Markdown formatting is preserved in the chunk. When chunks are created, the format of the data stream used is preserved in the chunk. This applies to chunks created using both structure-aware chunking and fixed-size chunking, and when using either the body Markdown or body text data stream. Chunks created using the body Markdown data stream retain the Markdown formatting, while chunks created using the body text data stream are plain text. Since large language models (LLMs) are trained on structured text, a chunk that preserves an item's structure and formatting improves an LLM's reasoning, retrieval capabilities, and ultimately provides better responses from a [RAG](https://docs.coveo.com/en/p8ie0159/) system. This is why the [model](https://docs.coveo.com/en/1012/) uses the Markdown data stream whenever possible to create chunks, no matter which chunking strategy is used. ## Model embedding limits The [CPR](https://docs.coveo.com/en/oaie9196/) [model](https://docs.coveo.com/en/1012/) converts your content's body text into numerical representations ([vectors](https://docs.coveo.com/en/nccf9008/)) in a process called [embedding](https://docs.coveo.com/en/ncc87383/). It does this by breaking the text up into smaller segments called chunks, and each chunk is mapped as a distinct vector. For more information, see [Embeddings](https://docs.coveo.com/en/oaie5277#embeddings). Due to the amount of processing required for embeddings, the model is subject to the following embedding limits,depending on the [chunking strategy](https://docs.coveo.com/en/p9ub0044/). > **Note** > > For a given [model](https://docs.coveo.com/en/1012/), the same chunking strategy is used for all sources and item types. [cols="1,1,1",options="header"] |=== | Limit | Structure-aware chunking | Fixed-size chunking | Chunk size | Average of 300 tokens per chunk (minimum: 200 tokens; maximum: 400 tokens) | 250 whitespace-delimited words per chunk | Maximum number of items or chunks 2+a| Up to 15 million items or 50 million chunks > **Notes** > > * The maximum number of items depends on the [item allocation of your product plan](https://docs.coveo.com/en/l2590456#generative-ai-solutions). > > * Your CPR implementation must include a [Semantic Encoder (SE) model](https://docs.coveo.com/en/nb6a0483/). > If you have more than one CPR model in your Coveo organization, each CPR model must use only the items that are used by the SE model. | Maximum number of chunks per item 2+a| 1000 (default) > **Note** > > The default setting of 1000 is suitable for the majority of use cases. > If required, you can [set a custom value between `1` and `1000`](https://docs.coveo.com/en/oaie5476#set-the-maximum-chunks-per-item). > **Important** > > The [model](https://docs.coveo.com/en/1012/) will embed the item's text until the maximum chunks per item limit is reached. > The remaining text in the item won't be embedded and therefore won't be used by the [model](https://docs.coveo.com/en/1012/). > > To make sure that each item's text is fully embedded, follow [best practices](https://docs.coveo.com/en/oaod5329#best-practices) by keeping items concise and focused. |===