---
title: Semantic Encoder (SE) content requirements and best practices
slug: nbo90598
canonical_url: https://docs.coveo.com/en/nbo90598/
collection: leverage-machine-learning
source_format: adoc
---
# Semantic Encoder (SE) content requirements and best practices
> **Important**
>
> * A [Semantic Encoder (SE)](https://docs.coveo.com/en/nbtb0041/) [model](https://docs.coveo.com/en/1012/) is only supported for use as part of a [Relevance Generative Answering (RGA)](https://docs.coveo.com/en/n9de0370/) or [Passage Retrieval (CPR)](https://docs.coveo.com/en/oaie5277/) implementation.
> 
> * The [Semantic Encoder (SE)](https://docs.coveo.com/en/nbtb0041/) [model](https://docs.coveo.com/en/1012/) is available as a paid product extension.
> Contact [Coveo Sales](https://www.coveo.com/en/contact) or your Account Manager to add SE to your [organization](https://docs.coveo.com/en/185/) license.

A [Semantic Encoder (SE)](https://docs.coveo.com/en/nbtb0041/) [model](https://docs.coveo.com/en/1012/) creates [embeddings](https://docs.coveo.com/en/n9de0370#embeddings) for your [indexed](https://docs.coveo.com/en/204/) [item](https://docs.coveo.com/en/210/) content.
The [embeddings](https://docs.coveo.com/en/ncc87383/) are then used by the [SE](https://docs.coveo.com/en/nbtb0041/) [model](https://docs.coveo.com/en/1012/) to retrieve items for a given query based on semantic similarity (see [What does an SE model do?](https://docs.coveo.com/en/nb890247#what-does-an-se-model-do)).

In the context of generating answers using [Relevance Generative Answering (RGA)](https://docs.coveo.com/en/n9de0370/) or retrieving passages using [Passage Retrieval (CPR)](https://docs.coveo.com/en/oaie5277/), a list of the most relevant content is sent to the [RGA](https://docs.coveo.com/en/nbtb6010/) or [CPR](https://docs.coveo.com/en/oaie9196/) [model](https://docs.coveo.com/en/1012/) for answer generation or passage retrieval (see [RGA overview](https://docs.coveo.com/en/n9de0370#rga-overview) or [CPR overview](https://docs.coveo.com/en/oaie5277#cpr-overview)).
The quality of that content has a direct impact on the quality of the embeddings, the relevance of the retrieved items, and the quality of the answers generated by RGA or the passages retrieved by CPR.

When [creating an SE model](https://docs.coveo.com/en/nb890247/), you must specify the indexed content that the model will use.
This article describes the requirements and best practices with regards to the content that you choose.

> **Note**
>
> When an [SE](https://docs.coveo.com/en/nbtb0041/) [model](https://docs.coveo.com/en/1012/) is used in a [Relevance Generative Answering (RGA)](https://docs.coveo.com/en/n9de0370/) or [Passage Retrieval (CPR)](https://docs.coveo.com/en/oaie5277/) implementation, the same [SE](https://docs.coveo.com/en/nbtb0041/) [model](https://docs.coveo.com/en/1012/) can be used with multiple [RGA](https://docs.coveo.com/en/nbtb6010/) or [CPR](https://docs.coveo.com/en/oaie9196/) models.
> The [RGA](https://docs.coveo.com/en/nbtb6010/) and [CPR](https://docs.coveo.com/en/oaie9196/) models must be configured to use the same content as the [SE](https://docs.coveo.com/en/nbtb0041/) [model](https://docs.coveo.com/en/1012/).
> 
> See [RGA overview](https://docs.coveo.com/en/n9de0370#rga-overview) or [CPR overview](https://docs.coveo.com/en/oaie5277#cpr-overview) for information on how SE works with RGA or CPR in the context of a search session.

## How SE uses your content

Before deciding on the content to use, it's important to understand how the [SE](https://docs.coveo.com/en/nbtb0041/) [model](https://docs.coveo.com/en/1012/) uses your content.

When an [item](https://docs.coveo.com/en/210/) is [indexed](https://docs.coveo.com/en/204/), its content is [mapped](https://docs.coveo.com/en/217/) to the [`title`](https://docs.coveo.com/en/1839#item-title-selection) and [`body`](https://docs.coveo.com/en/1847#item-body) fields in the Coveo index.
The [SE](https://docs.coveo.com/en/nbtb0041/) [model](https://docs.coveo.com/en/1012/) uses a pre-trained sentence transformer language model to convert your indexed content's title and body text into mathematical representations ([vectors](https://docs.coveo.com/en/nccf9008/)) in a process called [embedding](https://docs.coveo.com/en/n9de0370#embeddings).
When a user enters a [query](https://docs.coveo.com/en/231/), the model uses the vector space to retrieve the most relevant content.
This retrieval is based on semantic similarity using [embeddings](https://docs.coveo.com/en/ncc87383/), which are created using the text in your content's title and body text.

In summary, the [SE](https://docs.coveo.com/en/nbtb0041/) model parses only your content's title and body text when creating the embeddings.
Therefore, an item's title and [body](https://docs.coveo.com/en/3313/) data should be as clean, focused, and relevant as possible.
The better the data, the better the embeddings, and the better the content retrieval.
For more information, see [What does an SE model do?](https://docs.coveo.com/en/nb890247#what-does-an-se-model-do)

> **Note**
>
> The [SE](https://docs.coveo.com/en/nbtb0041/) [model](https://docs.coveo.com/en/1012/) uses only the content in an item's `body` and `title` fields.
> Content in other searchable fields, such as `author`, `source`, and `date`, isn't embedded by the [model](https://docs.coveo.com/en/1012/) and therefore isn't considered for vector-based content retrieval.

## Requirements

* The content you want to use must be [indexed](https://docs.coveo.com/en/204/) in your Coveo [organization](https://docs.coveo.com/en/185/) before creating the [model](https://docs.coveo.com/en/1012/).

You don't have to use all the content in your index.
In fact, best practices dictate that you should choose a reasonably sized dataset to keep the content focused and relevant.
When creating the model, you can choose to use a subset of your indexed content by selecting the [sources](https://docs.coveo.com/en/246/) that contain the [items](https://docs.coveo.com/en/210/), and then further filtering the source dataset.
For more information, see [Choose your content](#choose-your-content).

> **Note**
>
> If the indexed items you want to use aren't [optimized for use](#optimize-your-content) with the model, re-index the items with the proper configuration.

* An indexed item must contain a unique value in the [`permanentid`](https://docs.coveo.com/en/1913/) field in order for the item's content to be embedded and used by the model.

> **Note**
>
> By default, an item indexed using a standard Coveo source automatically contains a value in its `permanentid` field that Coveo uses as the item's unique identifier.
> However, if you're using a custom source, such as Push API, you must make sure that the items that you want to use contain a unique value in the `permanentid` field.
> If not, you must [map unique metadata to the item's `permanentid` field](https://docs.coveo.com/en/1913#taking-advantage-of-the-permanentid-field).
> 
> To verify if an item contains a unique value in the `permanentid` field, you can use the [**Content Browser**](https://platform.cloud.coveo.com/admin/#/orgid/content/browser/) ([platform-ca](https://platform-ca.cloud.coveo.com/admin/#/orgid/content/browser/) | [platform-eu](https://platform-eu.cloud.coveo.com/admin/#/orgid/content/browser/) | [platform-au](https://platform-au.cloud.coveo.com/admin/#/orgid/content/browser/)) page of the [Coveo Administration Console](https://docs.coveo.com/en/183/) to [check the item's properties](https://docs.coveo.com/en/1712/).

* The indexed item's `language` field is English.

> **Tip**
>
> By default, only English content is supported.
> However, Coveo offers beta support for languages other than English.
> Learn more about [multilingual content retrieval and answer generation](https://docs.coveo.com/en/p5ne0024/).

> **Note**
>
> To verify an item's `language` field, you can use the [**Content Browser**](https://platform.cloud.coveo.com/admin/#/orgid/content/browser/) ([platform-ca](https://platform-ca.cloud.coveo.com/admin/#/orgid/content/browser/) | [platform-eu](https://platform-eu.cloud.coveo.com/admin/#/orgid/content/browser/) | [platform-au](https://platform-au.cloud.coveo.com/admin/#/orgid/content/browser/)) page of the [Coveo Administration Console](https://docs.coveo.com/en/183/) to [check the item's properties](https://docs.coveo.com/en/1712/).

### Supported file types

Coveo has tested and supports the following file types for use with the [model](https://docs.coveo.com/en/1012/):

* HTML

* PDF

> **Notes**
>
> * Other text-based file types that are [supported at ingestion](https://docs.coveo.com/en/1689/) that aren't listed above may also provide good results, however, they're not officially supported by Coveo for use with the model.
> 
> * PDFs with single-column uninterrupted paragraph-based text sections provide best results.
> Text in tables and multi-columned text are embedded but parsing the text is more unpredictable.
> 
> * You can use the [optical character recognition (OCR) source feature](https://docs.coveo.com/en/2937/) to extract text from images in PDFs and image files.
> Otherwise, text from images won't be embedded or used by the model.
> 
> * Video files aren't supported.

## Best practices

This section describes best practices when it comes to choosing the content to use for the [model](https://docs.coveo.com/en/1012/) and how to optimize the content for best results.


### Choose your content

When deciding on the content to use, consider the following:

* Prioritize content that's designed to answer questions such as knowledge base articles, support documents, FAQs, community answers, and product documentation.

* Prioritize content that's written in a conversational tone.

* Prioritize shorter documents that are focused on a single topic.

> **Note**
>
> Avoid using very long documents that cover multiple topics.
> This may result in text being embedded as semantically similar, even though the context or topic is different.

* Content should be written using a single language.

> **Tip**
>
> By default, only English content is supported.
> However, Coveo offers beta support for languages other than English.
> Learn more about [multilingual content retrieval and answer generation](https://docs.coveo.com/en/p5ne0024/).

> **Note**
>
> Other languages are supported as a beta.
> For a list of beta-supported languages, see [Supported languages for machine learning models](https://docs.coveo.com/en/1956#supported-languages-for-machine-learning-models).

* Avoid multiple documents with similar content.

* Choose a reasonably sized dataset to keep the content focused and current.

> **Important**
>
> Keep the [model embedding limits](#model-embedding-limits) in mind when choosing the content for your model.

### Optimize your content

To optimize your content for the [model](https://docs.coveo.com/en/1012/), follow these best practices:

* Ensure that boilerplate content, such as headers, footers, and extra navigation elements, is removed from the `[body](https://docs.coveo.com/en/3313/)` data when the [items](https://docs.coveo.com/en/210/) are [indexed](https://docs.coveo.com/en/204/).
* Review the `title` and `body` data and source [mappings](https://docs.coveo.com/en/217/) to make sure that the title and body contain the desired content.

> **Notes**
>
> * For Web and Sitemap sources, you can use a [web scraping configuration](https://docs.coveo.com/en/mc1f3573/) to remove boilerplate content and select the web content to index.
> 
> * For all other types of sources, you can edit an item's [title](https://docs.coveo.com/en/1839#item-title-selection) and [body](https://docs.coveo.com/en/1847#add-or-edit-an-item-body-mapping) [mappings](https://docs.coveo.com/en/217/) to make sure that the title and body contain the desired content.

## Model embedding limits
The [SE](https://docs.coveo.com/en/nbtb0041/) [model](https://docs.coveo.com/en/1012/) converts your content's title and body text into numerical representations ([vectors](https://docs.coveo.com/en/nccf9008/)) in a process called [embedding](https://docs.coveo.com/en/ncc87383/).
It does this by breaking the text up into smaller segments called chunks, and each chunk is mapped as a distinct vector.
For more information, see [Embeddings](https://docs.coveo.com/en/n9de0370#embeddings).

The model is subject to the following embedding limits based on the selected [chunking strategy](https://docs.coveo.com/en/oaie5476#set-the-chunking-strategy):

> **Note**
>
> For a given [model](https://docs.coveo.com/en/1012/), the same chunking strategy is used for all sources and item types.

* Up to 15 million items or 50 million chunks

> **Note**
>
> The maximum number of items depends on the [item allocation of your product plan](https://docs.coveo.com/en/l2590456#generative-ai-solutions).

* 11 chunks per item

> **Important**
>
> This limit is sufficient for the SE [model](https://docs.coveo.com/en/1012/) to capture an item's main concepts.
> If an item is long with a lot of text, however, such as more than 4000 words or 5 pages, the [model](https://docs.coveo.com/en/1012/) will embed the item's text until the 11-chunk limit is reached.
> The remaining text won't be embedded and therefore won't be used by the [model](https://docs.coveo.com/en/1012/).
> 
> To make sure that each item's text is fully embedded, follow [best practices](https://docs.coveo.com/en/nbo90598#best-practices) by keeping items concise and focused.

* 500 words per chunk

> **Note**
>
> There can be an overlap of up to 20% between chunks.
> In other words, the last 20% of the previous chunk can be the first 20% of the next chunk.

## What's next?

[Create an SE model](https://docs.coveo.com/en/nb890247/).