--- title: Semantic Encoder (SE) content requirements and best practices slug: nbo90598 canonical_url: https://docs.coveo.com/en/nbo90598/ collection: leverage-machine-learning source_format: adoc --- # Semantic Encoder (SE) content requirements and best practices > **Important** > > * A [Semantic Encoder (SE)](https://docs.coveo.com/en/nbtb0041/) [model](https://docs.coveo.com/en/1012/) is only supported for use as part of a [Relevance Generative Answering (RGA)](https://docs.coveo.com/en/n9de0370/) or [Passage Retrieval (CPR)](https://docs.coveo.com/en/oaie5277/) implementation. > > * The [Semantic Encoder (SE)](https://docs.coveo.com/en/nbtb0041/) [model](https://docs.coveo.com/en/1012/) is available as a paid product extension. > Contact [Coveo Sales](https://www.coveo.com/en/contact) or your Account Manager to add SE to your [organization](https://docs.coveo.com/en/185/) license. A [Semantic Encoder (SE)](https://docs.coveo.com/en/nbtb0041/) [model](https://docs.coveo.com/en/1012/) creates [embeddings](https://docs.coveo.com/en/n9de0370#embeddings) for your [indexed](https://docs.coveo.com/en/204/) [item](https://docs.coveo.com/en/210/) content. The [embeddings](https://docs.coveo.com/en/ncc87383/) are then used by the [SE](https://docs.coveo.com/en/nbtb0041/) [model](https://docs.coveo.com/en/1012/) to retrieve items for a given query based on semantic similarity (see [What does an SE model do?](https://docs.coveo.com/en/nb890247#what-does-an-se-model-do)). In the context of generating answers using [Relevance Generative Answering (RGA)](https://docs.coveo.com/en/n9de0370/) or retrieving passages using [Passage Retrieval (CPR)](https://docs.coveo.com/en/oaie5277/), a list of the most relevant content is sent to the [RGA](https://docs.coveo.com/en/nbtb6010/) or [CPR](https://docs.coveo.com/en/oaie9196/) [model](https://docs.coveo.com/en/1012/) for answer generation or passage retrieval (see [RGA overview](https://docs.coveo.com/en/n9de0370#rga-overview) or [CPR overview](https://docs.coveo.com/en/oaie5277#cpr-overview)). The quality of that content has a direct impact on the quality of the embeddings, the relevance of the retrieved items, and the quality of the answers generated by RGA or the passages retrieved by CPR. When [creating an SE model](https://docs.coveo.com/en/nb890247/), you must specify the indexed content that the model will use. This article describes the requirements and best practices with regards to the content that you choose. > **Note** > > When an [SE](https://docs.coveo.com/en/nbtb0041/) [model](https://docs.coveo.com/en/1012/) is used in a [Relevance Generative Answering (RGA)](https://docs.coveo.com/en/n9de0370/) or [Passage Retrieval (CPR)](https://docs.coveo.com/en/oaie5277/) implementation, the same [SE](https://docs.coveo.com/en/nbtb0041/) [model](https://docs.coveo.com/en/1012/) can be used with multiple [RGA](https://docs.coveo.com/en/nbtb6010/) or [CPR](https://docs.coveo.com/en/oaie9196/) models. > The [RGA](https://docs.coveo.com/en/nbtb6010/) and [CPR](https://docs.coveo.com/en/oaie9196/) models must be configured to use the same content as the [SE](https://docs.coveo.com/en/nbtb0041/) [model](https://docs.coveo.com/en/1012/). > > See [RGA overview](https://docs.coveo.com/en/n9de0370#rga-overview) or [CPR overview](https://docs.coveo.com/en/oaie5277#cpr-overview) for information on how SE works with RGA or CPR in the context of a search session. ## How SE uses your content Before deciding on the content to use, it's important to understand how the [SE](https://docs.coveo.com/en/nbtb0041/) [model](https://docs.coveo.com/en/1012/) uses your content. When an [item](https://docs.coveo.com/en/210/) is [indexed](https://docs.coveo.com/en/204/), its content is [mapped](https://docs.coveo.com/en/217/) to the [`title`](https://docs.coveo.com/en/1839#item-title-selection) and [`body`](https://docs.coveo.com/en/1847#item-body) fields in the Coveo index. The [SE](https://docs.coveo.com/en/nbtb0041/) [model](https://docs.coveo.com/en/1012/) uses a pre-trained sentence transformer language model to convert your indexed content's title and body text into mathematical representations ([vectors](https://docs.coveo.com/en/nccf9008/)) in a process called [embedding](https://docs.coveo.com/en/n9de0370#embeddings). When a user enters a [query](https://docs.coveo.com/en/231/), the model uses the vector space to retrieve the most relevant content. This retrieval is based on semantic similarity using [embeddings](https://docs.coveo.com/en/ncc87383/), which are created using the text in your content's title and body text. In summary, the [SE](https://docs.coveo.com/en/nbtb0041/) model parses only your content's title and body text when creating the embeddings. Therefore, an item's title and [body](https://docs.coveo.com/en/3313/) data should be as clean, focused, and relevant as possible. The better the data, the better the embeddings, and the better the content retrieval. For more information, see [What does an SE model do?](https://docs.coveo.com/en/nb890247#what-does-an-se-model-do) > **Note** > > The [SE](https://docs.coveo.com/en/nbtb0041/) [model](https://docs.coveo.com/en/1012/) uses only the content in an item's `body` and `title` fields. > Content in other searchable fields, such as `author`, `source`, and `date`, isn't embedded by the [model](https://docs.coveo.com/en/1012/) and therefore isn't considered for vector-based content retrieval. ## Requirements * The content you want to use must be [indexed](https://docs.coveo.com/en/204/) in your Coveo [organization](https://docs.coveo.com/en/185/) before creating the [model](https://docs.coveo.com/en/1012/). You don't have to use all the content in your index. In fact, best practices dictate that you should choose a reasonably sized dataset to keep the content focused and relevant. When creating the model, you can choose to use a subset of your indexed content by selecting the [sources](https://docs.coveo.com/en/246/) that contain the [items](https://docs.coveo.com/en/210/), and then further filtering the source dataset. For more information, see [Choose your content](#choose-your-content). > **Note** > > If the indexed items you want to use aren't [optimized for use](#optimize-your-content) with the model, re-index the items with the proper configuration. * An indexed item must contain a unique value in the [`permanentid`](https://docs.coveo.com/en/1913/) field in order for the item's content to be embedded and used by the model. > **Note** > > By default, an item indexed using a standard Coveo source automatically contains a value in its `permanentid` field that Coveo uses as the item's unique identifier. > However, if you're using a custom source, such as Push API, you must make sure that the items that you want to use contain a unique value in the `permanentid` field. > If not, you must [map unique metadata to the item's `permanentid` field](https://docs.coveo.com/en/1913#taking-advantage-of-the-permanentid-field). > > To verify if an item contains a unique value in the `permanentid` field, you can use the [**Content Browser**](https://platform.cloud.coveo.com/admin/#/orgid/content/browser/) ([platform-ca](https://platform-ca.cloud.coveo.com/admin/#/orgid/content/browser/) | [platform-eu](https://platform-eu.cloud.coveo.com/admin/#/orgid/content/browser/) | [platform-au](https://platform-au.cloud.coveo.com/admin/#/orgid/content/browser/)) page of the [Coveo Administration Console](https://docs.coveo.com/en/183/) to [check the item's properties](https://docs.coveo.com/en/1712/). * The indexed item's `language` field is English. > **Tip** > > By default, only English content is supported. > However, Coveo offers beta support for languages other than English. > Learn more about [multilingual content retrieval and answer generation](https://docs.coveo.com/en/p5ne0024/). > **Note** > > To verify an item's `language` field, you can use the [**Content Browser**](https://platform.cloud.coveo.com/admin/#/orgid/content/browser/) ([platform-ca](https://platform-ca.cloud.coveo.com/admin/#/orgid/content/browser/) | [platform-eu](https://platform-eu.cloud.coveo.com/admin/#/orgid/content/browser/) | [platform-au](https://platform-au.cloud.coveo.com/admin/#/orgid/content/browser/)) page of the [Coveo Administration Console](https://docs.coveo.com/en/183/) to [check the item's properties](https://docs.coveo.com/en/1712/). ### Supported file types Coveo has tested and supports the following file types for use with the [model](https://docs.coveo.com/en/1012/): * HTML * PDF > **Notes** > > * Other text-based file types that are [supported at ingestion](https://docs.coveo.com/en/1689/) that aren't listed above may also provide good results, however, they're not officially supported by Coveo for use with the model. > > * PDFs with single-column uninterrupted paragraph-based text sections provide best results. > Text in tables and multi-columned text are embedded but parsing the text is more unpredictable. > > * You can use the [optical character recognition (OCR) source feature](https://docs.coveo.com/en/2937/) to extract text from images in PDFs and image files. > Otherwise, text from images won't be embedded or used by the model. > > * Video files aren't supported. ## Best practices This section describes best practices when it comes to choosing the content to use for the [model](https://docs.coveo.com/en/1012/) and how to optimize the content for best results. ### Choose your content When deciding on the content to use, consider the following: * Prioritize content that's designed to answer questions such as knowledge base articles, support documents, FAQs, community answers, and product documentation. * Prioritize content that's written in a conversational tone. * Prioritize shorter documents that are focused on a single topic. > **Note** > > Avoid using very long documents that cover multiple topics. > This may result in text being embedded as semantically similar, even though the context or topic is different. * Content should be written using a single language. > **Tip** > > By default, only English content is supported. > However, Coveo offers beta support for languages other than English. > Learn more about [multilingual content retrieval and answer generation](https://docs.coveo.com/en/p5ne0024/). > **Note** > > Other languages are supported as a beta. > For a list of beta-supported languages, see [Supported languages for machine learning models](https://docs.coveo.com/en/1956#supported-languages-for-machine-learning-models). * Avoid multiple documents with similar content. * Choose a reasonably sized dataset to keep the content focused and current. > **Important** > > Keep the [model embedding limits](#model-embedding-limits) in mind when choosing the content for your model. ### Optimize your content To optimize your content for the [model](https://docs.coveo.com/en/1012/), follow these best practices: * Ensure that boilerplate content, such as headers, footers, and extra navigation elements, is removed from the `[body](https://docs.coveo.com/en/3313/)` data when the [items](https://docs.coveo.com/en/210/) are [indexed](https://docs.coveo.com/en/204/). * Review the `title` and `body` data and source [mappings](https://docs.coveo.com/en/217/) to make sure that the title and body contain the desired content. > **Notes** > > * For Web and Sitemap sources, you can use a [web scraping configuration](https://docs.coveo.com/en/mc1f3573/) to remove boilerplate content and select the web content to index. > > * For all other types of sources, you can edit an item's [title](https://docs.coveo.com/en/1839#item-title-selection) and [body](https://docs.coveo.com/en/1847#add-or-edit-an-item-body-mapping) [mappings](https://docs.coveo.com/en/217/) to make sure that the title and body contain the desired content. ## Model embedding limits The [SE](https://docs.coveo.com/en/nbtb0041/) [model](https://docs.coveo.com/en/1012/) converts your content's title and body text into numerical representations ([vectors](https://docs.coveo.com/en/nccf9008/)) in a process called [embedding](https://docs.coveo.com/en/ncc87383/). It does this by breaking the text up into smaller segments called chunks, and each chunk is mapped as a distinct vector. For more information, see [Embeddings](https://docs.coveo.com/en/n9de0370#embeddings). The model is subject to the following embedding limits based on the selected [chunking strategy](https://docs.coveo.com/en/oaie5476#set-the-chunking-strategy): > **Note** > > For a given [model](https://docs.coveo.com/en/1012/), the same chunking strategy is used for all sources and item types. * Up to 15 million items or 50 million chunks > **Note** > > The maximum number of items depends on the [item allocation of your product plan](https://docs.coveo.com/en/l2590456#generative-ai-solutions). * 11 chunks per item > **Important** > > This limit is sufficient for the SE [model](https://docs.coveo.com/en/1012/) to capture an item's main concepts. > If an item is long with a lot of text, however, such as more than 4000 words or 5 pages, the [model](https://docs.coveo.com/en/1012/) will embed the item's text until the 11-chunk limit is reached. > The remaining text won't be embedded and therefore won't be used by the [model](https://docs.coveo.com/en/1012/). > > To make sure that each item's text is fully embedded, follow [best practices](https://docs.coveo.com/en/nbo90598#best-practices) by keeping items concise and focused. * 500 words per chunk > **Note** > > There can be an overlap of up to 20% between chunks. > In other words, the last 20% of the previous chunk can be the first 20% of the next chunk. ## What's next? [Create an SE model](https://docs.coveo.com/en/nb890247/).