Passage Retrieval (CPR) content requirements and best practices

Passage Retrieval (CPR) uses your enterprise content as the data from which to retrieve passages. The content that you choose to use for CPR, and the quality of that content, has a direct impact on the quality of the passages that are retrieved by CPR.

A CPR implementation requires you to create a CPR model. When creating the model, you must specify the indexed content that the model will use to retrieve passages.

This article describes the requirements and best practices with regards to the content that you choose to use for CPR.

Note

A Passage Retrieval (CPR) implementation must include both a CPR model and a Semantic Encoder (SE) model. The same SE model can be used with multiple CPR models. Both the CPR and SE models must be configured to use the same content.

See CPR overview for information on how CPR and SE work together in the context of a user query to retrieve passages.

How CPR uses your content

Before deciding on the content to use for CPR, it’s important to have a basic understanding of how your content is used to retrieve passages.

When an item is indexed, the item’s content is typically mapped to the body field in the Coveo index. The CPR model uses a pre-trained sentence transformer language model to convert your indexed content’s body text into mathematical representations (vectors) in a process called embedding. When a user enters a query, the model uses the vector space to retrieve the most relevant content. This retrieval is based on semantic similarity using embeddings, which are created using the text in your content’s body text.

In summary, the CPR model parses your content’s body text when creating the embeddings, and uses only that content to retrieve passages. An item’s body data, therefore, should be as clean, focused, and relevant as possible. The better the data, the better the embeddings. For best results, you should adhere to the requirements and best practices detailed in this article when choosing the content to use for CPR.

For more information on how your content is used to retrieve passages, see CPR processes.

Important

The CPR model uses only the content in an item’s body field. The model doesn’t use the content in other searchable fields, such as title, author, source, and date. This means that the content in fields other than body will be taken into account only for keyword-based (lexical) retrieval. The content won’t be used by the models for embeddings or passage retrieval.

Requirements

Supported file types

Coveo has tested and supports the following file types for use with the model:

  • HTML

  • PDF

Notes
  • Other text-based file types that are supported at ingestion that aren’t listed above may also provide good results, however, they’re not officially supported by Coveo for use with the model.

  • PDFs with single-column uninterrupted paragraph-based text sections provide best results. Text in tables and multi-columned text are embedded but parsing the text is more unpredictable.

  • You can use the optical character recognition (OCR) source feature to extract text from images in PDFs and image files. Otherwise, text from images won’t be embedded or used by the model.

  • Video files aren’t supported.

Best practices

This section describes best practices when it comes to choosing the content to use for the model and how to optimize the content for best results.

Choose your content

When deciding on the content to use, consider the following:

  • Prioritize content that’s designed to answer questions such as knowledge base articles, support documents, FAQs, community answers, and product documentation.

  • Prioritize content that’s written in a conversational tone.

  • Prioritize shorter documents that are focused on a single topic.

    Note

    Avoid using very long documents that cover multiple topics. This may result in text being embedded as semantically similar, even though the context or topic is different.

  • Content should be written using a single language.

    Tip

    By default, only English content is supported. However, Coveo offers beta support for languages other than English. Learn more about multilingual content retrieval and answer generation.

    Note

    Other languages are supported as a beta. For a list of beta-supported languages, see Supported languages for machine learning models.

  • Avoid multiple documents with similar content.

  • Choose a reasonably sized dataset to keep the content focused and current.

Important

Keep the model embedding limits in mind when choosing the content for your model.

Optimize your content

To optimize your content for the model, follow these best practices:

  • Ensure that boilerplate content, such as headers, footers, and extra navigation elements, is removed from the body data when the items are indexed.

  • Review the body data and source mappings to make sure that the body contains the desired content.

Notes

When is an answer not generated?

Adhering to the requirements and best practices outlined in this article greatly improves the relevancy of the passages that are exposed to your LLM. In certain cases, however, an answer can’t be generated for a given user query. This can be caused by insufficient relevant content.

Insufficient relevant content

The passages (chunks) that are used in the passage retrieval flow must meet a minimum relevancy threshold with the user query. A verification is made by the CPR model when retrieving the most relevant passages during second-stage content retrieval. If all of the passages identified during second-stage content retrieval don’t meet CPR’s minimum relevancy threshold with the user query, no passages will be made available to the LLM application.

Model embedding limits

The CPR model converts your content’s body text into numerical representations (vectors) in a process called embedding. It does this by breaking the text up into smaller segments called chunks, and each chunk is mapped as a distinct vector. For more information, see Embeddings.

Due to the amount of processing required for embeddings, the model is subject to the following embedding limits:

Note

The same chunking strategy is used for all sources and item types.

  • Up to 15 million items or 50 million chunks

    Notes
  • 1000 chunks per item

    Important

    This means that for a given item, there can be a maximum of 1000 chunks. So if an item is long with a lot of text, such as more than 200,000 words or 250 pages, the model will embed the item’s text until the 1000-chunk limit is reached. The remaining text won’t be embedded and therefore won’t be used by the model.

  • 250 words per chunk

    Note

    There can be an overlap of up to 10% between chunks. In other words, the last 10% of the previous chunk can be the first 10% of the next chunk.

What’s next?