Semantic Encoder (SE) content requirements and best practices
Semantic Encoder (SE) content requirements and best practices
|
A SE model creates embeddings for your indexed item content. The embeddings are then used by the SE model to retrieve items for a given query based on semantic similarity (see What does an SE model do?).
In the context of generating answers with Relevance Generative Answering (RGA), a list of the most relevant content is sent to the RGA model for answer generation (see RGA overview). The quality of that content has a direct impact on the quality of the embeddings, the relevance of the retrieved items, and the quality of the answers generated by RGA.
When creating an SE model, you must specify the indexed content that the model will use. This article describes the requirements and best practices with regards to the content that you choose.
Note
An optimal Relevance Generative Answering (RGA) implementation includes both an RGA model and an SE model. For best results, both models should be configured to use the same content. See RGA overview for information on how RGA and SE work together in the context of a search session to generate answers. |
How SE uses your content
Before deciding on the content to use, it’s important to understand how the SE model uses your content.
When an item is indexed, its content is typically mapped to the title
and body
fields in the Coveo index.
The SE model uses a pre-trained sentence transformer language model to convert your indexed content’s title and body text into mathematical representations (vectors) in a process called embedding.
When a user enters a query, the model references the vector space to retrieve the most relevant content.
This retrieval is based on semantic similarity using embeddings, which are created using the text in your content’s title and body text.
In summary, the SE model parses only your content’s title and body text when creating the embeddings. Therefore, an item’s title and body data should be as clean, focused, and relevant as possible. The better the data, the better the embeddings, and the better the content retrieval. For more information, see What does an SE model do?.
The SE model considers only the content in an item’s |
Requirements
-
The content you want to use must be indexed in your Coveo organization before creating the model.
You don’t have to use all the content in your index. In fact, best practices dictate that you should choose a reasonably sized dataset to keep the content focused and relevant. When creating the model, you can choose to use a subset of your indexed content by selecting the sources that contain the items, and then further filtering the source dataset. For more information, see Choose your content.
NoteIf the indexed items you want to use aren’t optimized for use with the model, re-index the items with the proper configuration.
-
An indexed item must contain a unique value in the
permanentid
field in order for the item’s content to be embedded and used by the model.NoteBy default, an item indexed using a standard Coveo source automatically contains a value in its
permanentid
field that Coveo uses as the item’s unique identifier. However, if you’re using a custom source, such as PUSH API, you must make sure that the items that you want to use for answer generation contain a unique value in thepermanentid
field. If not, you must map unique metadata to the item’spermanentid
field.To verify if an item contains a unique value in the
permanentid
field, you can use the Content Browser (platform-ca | platform-eu | platform-au) page of the Coveo Administration Console to check the item’s properties. -
The indexed item’s
language
field is English.NoteTo verify an item’s
language
field, you can use the Content Browser (platform-ca | platform-eu | platform-au) page of the Coveo Administration Console to check the item’s properties.
Supported file types
Coveo has tested and supports the following file types for use with the model:
-
HTML
-
PDF
Notes
|
Best practices
This section describes best practices when it comes to choosing the content to use for the model and how to optimize the content for best results.
Choose your content
When deciding on the content to use, consider the following:
-
Prioritize content that’s designed to answer questions such as knowledge base articles, support documents, FAQs, community answers, and product documentation.
-
Prioritize content that’s written in a conversational tone.
-
Prioritize shorter documents that are focused on a single topic.
NoteAvoid using very long documents that cover multiple topics. This may result in text being embedded as semantically similar, even though the context or topic is different.
-
Content should be written using a single language (English).
-
Avoid multiple documents with similar content.
-
Choose a reasonably sized dataset to keep the content focused and current.
Keep the model embedding limits in mind when choosing the content for your model. |
Optimize your content
To optimize your content for the model, follow these best practices:
-
Ensure that boilerplate content, such as headers, footers, and extra navigation elements, is removed from the
body
data when the items are indexed. -
Review the
title
andbody
data and source mappings to make sure that the title and body contain the desired content.
Notes
|
Model embedding limits
The SE model converts your content’s body text to numerical representations (vectors) in a process called embedding. It does this by breaking the text up into smaller segments called chunks, and each chunk is mapped as a distinct vector. For more information, see Embeddings.
Due to the amount of processing required for embeddings, the model is subject to the following embedding limits:
Note
The same chunking strategy is used for all sources and item types. |
-
Up to 5 million items or 50 million chunks
NoteThe maximum number of items depends on the item allocation of your product plan.
-
11 chunks per item
This means that for a given item, there can be a maximum of 11 chunks. This limit is sufficient in order for the SE model to capture an item’s main concepts through embeddings. If an item is very long with a lot of text, however, such as more than 4000 words or 5 pages, the model will embed the item’s text until the 11-chunk limit is reached. The remaining text won’t be embedded and therefore won’t be used by the model. Use shorter and more focused items to make sure that the entire item’s text is embedded.
-
500 words per chunk
NoteThere can be an overlap of up to 20% between chunks. In other words, the last 20% of the previous chunk can be the first 20% of the next chunk.