--- title: Passage Retrieval (CPR) content requirements and best practices slug: oaod5329 canonical_url: https://docs.coveo.com/en/oaod5329/ collection: leverage-machine-learning source_format: adoc --- # Passage Retrieval (CPR) content requirements and best practices [Passage Retrieval (CPR)](https://docs.coveo.com/en/oaie5277/) uses your enterprise content as the data from which to retrieve passages. The content that you choose to use for CPR, and the quality of that content, has a direct impact on the quality of the passages that are retrieved by CPR. A CPR implementation requires you to [create a CPR model](https://docs.coveo.com/en/oaie5476/). When creating the [model](https://docs.coveo.com/en/1012/), you must specify the [indexed](https://docs.coveo.com/en/204/) content that the model will use to retrieve passages. This article describes the requirements and best practices with regards to the content that you choose to use for CPR. > **Note** > > A [Passage Retrieval (CPR)](https://docs.coveo.com/en/oaie5277/) implementation must include both a [CPR](https://docs.coveo.com/en/oaie9196/) [model](https://docs.coveo.com/en/1012/) and a [Semantic Encoder (SE)](https://docs.coveo.com/en/nbtb0041/) [model](https://docs.coveo.com/en/1012/). > The same SE model can be used with multiple CPR models. > Both the CPR and SE models must be configured to use the same content. > > See [CPR overview](https://docs.coveo.com/en/oaie5277#cpr-overview) for information on how CPR and SE work together in the context of a user query to retrieve passages. ## How CPR uses your content Before deciding on the content to use for CPR, it's important to have a basic understanding of how your content is used to retrieve passages. When an [item](https://docs.coveo.com/en/210/) is [indexed](https://docs.coveo.com/en/204/), the item's content is mapped to the [`body`](https://docs.coveo.com/en/1847#item-body) [field](https://docs.coveo.com/en/200/) in the Coveo [index](https://docs.coveo.com/en/204/). The [CPR](https://docs.coveo.com/en/oaie9196/) [model](https://docs.coveo.com/en/1012/) uses a pre-trained sentence transformer language model to convert your indexed content's [body](https://docs.coveo.com/en/3313/) text into mathematical representations ([vectors](https://docs.coveo.com/en/nccf9008/)) in a process called [embedding](https://docs.coveo.com/en/oaie5277#embeddings). When a user enters a [query](https://docs.coveo.com/en/231/), the model uses the vector space to retrieve the most relevant content. This retrieval is based on semantic similarity using [embeddings](https://docs.coveo.com/en/ncc87383/), which are created using the text in your content's [body](https://docs.coveo.com/en/3313/) text. In summary, the [CPR](https://docs.coveo.com/en/oaie9196/) [model](https://docs.coveo.com/en/1012/) parses your content's [body](https://docs.coveo.com/en/3313/) text when creating the embeddings, and uses only that content to retrieve passages. An item's [body](https://docs.coveo.com/en/3313/) data, therefore, should be as clean, focused, and relevant as possible. The better the data, the better the embeddings. For best results, you should adhere to the requirements and best practices detailed in this article when choosing the content to use for CPR. For more information on how your content is used to retrieve passages, see [CPR processes](https://docs.coveo.com/en/oaie5277#cpr-processes). > **Note** > > The [CPR](https://docs.coveo.com/en/oaie9196/) [model](https://docs.coveo.com/en/1012/) uses only the content in an item's `body` field. > Content in other searchable fields, such as `title`, `author`, `source`, and `date`, isn't embedded by the [model](https://docs.coveo.com/en/1012/) and therefore isn't considered during passage retrieval. > Passages are retrieved solely based on the semantic similarity between the query and the content of an item's body. > For example, even if a query matches terms in an item's `title` field, the [CPR](https://docs.coveo.com/en/oaie9196/) model won't retrieve passages from that item unless the body content is semantically relevant. ## Requirements * The content you want to use must be [indexed](https://docs.coveo.com/en/204/) in your Coveo [organization](https://docs.coveo.com/en/185/) before creating the [model](https://docs.coveo.com/en/1012/). You don't have to use all the content in your index. In fact, best practices dictate that you should choose a reasonably sized dataset to keep the content focused and relevant. When creating the model, you can choose to use a subset of your indexed content by selecting the [sources](https://docs.coveo.com/en/246/) that contain the [items](https://docs.coveo.com/en/210/), and then further filtering the source dataset. For more information, see [Choose your content](#choose-your-content). > **Note** > > If the indexed items you want to use aren't [optimized for use](#optimize-your-content) with the model, re-index the items with the proper configuration. * An indexed item must contain a unique value in the [`permanentid`](https://docs.coveo.com/en/1913/) field in order for the item's content to be embedded and used by the model. > **Note** > > By default, an item indexed using a standard Coveo source automatically contains a value in its `permanentid` field that Coveo uses as the item's unique identifier. > However, if you're using a custom source, such as Push API, you must make sure that the items that you want to use contain a unique value in the `permanentid` field. > If not, you must [map unique metadata to the item's `permanentid` field](https://docs.coveo.com/en/1913#taking-advantage-of-the-permanentid-field). > > To verify if an item contains a unique value in the `permanentid` field, you can use the [**Content Browser**](https://platform.cloud.coveo.com/admin/#/orgid/content/browser/) ([platform-ca](https://platform-ca.cloud.coveo.com/admin/#/orgid/content/browser/) | [platform-eu](https://platform-eu.cloud.coveo.com/admin/#/orgid/content/browser/) | [platform-au](https://platform-au.cloud.coveo.com/admin/#/orgid/content/browser/)) page of the [Coveo Administration Console](https://docs.coveo.com/en/183/) to [check the item's properties](https://docs.coveo.com/en/1712/). * The indexed item's `language` field is English. > **Tip** > > By default, only English content is supported. > However, Coveo offers beta support for languages other than English. > Learn more about [multilingual content retrieval and answer generation](https://docs.coveo.com/en/p5ne0024/). > **Note** > > To verify an item's `language` field, you can use the [**Content Browser**](https://platform.cloud.coveo.com/admin/#/orgid/content/browser/) ([platform-ca](https://platform-ca.cloud.coveo.com/admin/#/orgid/content/browser/) | [platform-eu](https://platform-eu.cloud.coveo.com/admin/#/orgid/content/browser/) | [platform-au](https://platform-au.cloud.coveo.com/admin/#/orgid/content/browser/)) page of the [Coveo Administration Console](https://docs.coveo.com/en/183/) to [check the item's properties](https://docs.coveo.com/en/1712/). ### Supported file types Coveo has tested and supports the following file types for use with the [model](https://docs.coveo.com/en/1012/): * HTML * PDF > **Notes** > > * Other text-based file types that are [supported at ingestion](https://docs.coveo.com/en/1689/) that aren't listed above may also provide good results, however, they're not officially supported by Coveo for use with the model. > > * PDFs with single-column uninterrupted paragraph-based text sections provide best results. > Text in tables and multi-columned text are embedded but parsing the text is more unpredictable. > > * You can use the [optical character recognition (OCR) source feature](https://docs.coveo.com/en/2937/) to extract text from images in PDFs and image files. > Otherwise, text from images won't be embedded or used by the model. > > * Video files aren't supported. ## Best practices This section describes best practices when it comes to choosing the content to use for the [model](https://docs.coveo.com/en/1012/) and how to optimize the content for best results. ### Choose your content When deciding on the content to use, consider the following: * Prioritize content that's designed to answer questions such as knowledge base articles, support documents, FAQs, community answers, and product documentation. * Prioritize content that's written in a conversational tone. * Prioritize shorter documents that are focused on a single topic. > **Note** > > Avoid using very long documents that cover multiple topics. > This may result in text being embedded as semantically similar, even though the context or topic is different. * Content should be written using a single language. > **Tip** > > By default, only English content is supported. > However, Coveo offers beta support for languages other than English. > Learn more about [multilingual content retrieval and answer generation](https://docs.coveo.com/en/p5ne0024/). > **Note** > > Other languages are supported as a beta. > For a list of beta-supported languages, see [Supported languages for machine learning models](https://docs.coveo.com/en/1956#supported-languages-for-machine-learning-models). * Avoid multiple documents with similar content. * Choose a reasonably sized dataset to keep the content focused and current. > **Important** > > Keep the [model embedding limits](#model-embedding-limits) in mind when choosing the content for your model. ### Optimize your content To optimize your content for the [model](https://docs.coveo.com/en/1012/), follow these best practices: * Ensure that boilerplate content, such as headers, footers, and extra navigation elements, is removed from the `[body](https://docs.coveo.com/en/3313/)` data when the [items](https://docs.coveo.com/en/210/) are [indexed](https://docs.coveo.com/en/204/). * Review the `body` data and source mappings to make sure that the body contains the desired content. > **Notes** > > * For Web and Sitemap sources, you can use a [web scraping configuration](https://docs.coveo.com/en/mc1f3573/) to remove boilerplate content and select the web content to index. > > * For all other types of sources, you can [edit an item's body mapping](https://docs.coveo.com/en/1847#add-or-edit-an-item-body-mapping) to make sure that the body contains the desired content. ## When is an answer not generated? Adhering to the requirements and best practices outlined in this article greatly improves the relevancy of the passages that are exposed to your LLM. In certain cases, however, an answer can't be generated for a given user [query](https://docs.coveo.com/en/231/). This can be caused by [insufficient relevant content](#insufficient-relevant-content). ### Insufficient relevant content The passages ([chunks](https://docs.coveo.com/en/n9de0370#chunking)) that are used in the passage retrieval flow must meet a minimum relevancy threshold with the user [query](https://docs.coveo.com/en/231/). A verification is made by the [CPR](https://docs.coveo.com/en/oaie9196/) [model](https://docs.coveo.com/en/1012/) when retrieving the most relevant passages during [second-stage content retrieval](https://docs.coveo.com/en/n9de0370#second-stage-content-retrieval). If all of the passages identified during second-stage content retrieval don't meet [CPR](https://docs.coveo.com/en/oaie9196/)’s minimum relevancy threshold with the user query, no passages will be made available to the LLM application. ## Model embedding limits The [CPR](https://docs.coveo.com/en/oaie9196/) [model](https://docs.coveo.com/en/1012/) converts your content's body text into numerical representations ([vectors](https://docs.coveo.com/en/nccf9008/)) in a process called [embedding](https://docs.coveo.com/en/ncc87383/). It does this by breaking the text up into smaller segments called chunks, and each chunk is mapped as a distinct vector. For more information, see [Embeddings](https://docs.coveo.com/en/oaie5277#embeddings). Due to the amount of processing required for embeddings, the model is subject to the following embedding limits,depending on the [chunking strategy](https://docs.coveo.com/en/p9ub0044/). > **Note** > > For a given [model](https://docs.coveo.com/en/1012/), the same chunking strategy is used for all sources and item types. [cols="1,1,1",options="header"] |=== | Limit | Structure-aware chunking | Fixed-size chunking | Chunk size | Average of 300 tokens per chunk (minimum: 200 tokens; maximum: 400 tokens) | 250 whitespace-delimited words per chunk | Maximum number of items or chunks 2+a| Up to 15 million items or 50 million chunks > **Notes** > > * The maximum number of items depends on the [item allocation of your product plan](https://docs.coveo.com/en/l2590456#generative-ai-solutions). > > * Your CPR implementation must include a [Semantic Encoder (SE) model](https://docs.coveo.com/en/nb6a0483/). > If you have more than one CPR model in your Coveo organization, each CPR model must use only the items that are used by the SE model. | Maximum number of chunks per item 2+a| 1000 (default) > **Note** > > The default setting of 1000 is suitable for the majority of use cases. > If required, you can [set a custom value between `1` and `1000`](https://docs.coveo.com/en/oaie5476#set-the-maximum-chunks-per-item). > **Important** > > The [model](https://docs.coveo.com/en/1012/) will embed the item's text until the maximum chunks per item limit is reached. > The remaining text in the item won't be embedded and therefore won't be used by the [model](https://docs.coveo.com/en/1012/). > > To make sure that each item's text is fully embedded, follow [best practices](https://docs.coveo.com/en/oaod5329#best-practices) by keeping items concise and focused. |=== ## What's next? [Create a CPR model](https://docs.coveo.com/en/oaie5476/).