Semantic Encoder (SE) content requirements and best practices

Important

A Semantic Encoder (SE) model creates embeddings for your indexed item content. The embeddings are then used by the SE model to retrieve items for a given query based on semantic similarity (see What does an SE model do?).

In the context of generating answers using Relevance Generative Answering (RGA) or retrieving passages using Passage Retrieval (CPR), a list of the most relevant content is sent to the RGA or CPR model for answer generation or passage retrieval (see RGA overview or CPR overview). The quality of that content has a direct impact on the quality of the embeddings, the relevance of the retrieved items, and the quality of the answers generated by RGA or the passages retrieved by CPR.

When creating an SE model, you must specify the indexed content that the model will use. This article describes the requirements and best practices with regards to the content that you choose.

Note

When an SE model is used in a Relevance Generative Answering (RGA) or Passage Retrieval (CPR) implementation, the same SE model can be used with multiple RGA or CPR models. The RGA and CPR models must be configured to use the same content as the SE model.

How SE uses your content

Before deciding on the content to use, it’s important to understand how the SE model uses your content.

When an item is indexed, its content is mapped to the title and body fields in the Coveo index. The SE model uses a pre-trained sentence transformer language model to convert your indexed content’s title and body text into mathematical representations (vectors) in a process called embedding. When a user enters a query, the model uses the vector space to retrieve the most relevant content. This retrieval is based on semantic similarity using embeddings, which are created using the text in your content’s title and body text.

In summary, the SE model parses only your content’s title and body text when creating the embeddings. Therefore, an item’s title and body data should be as clean, focused, and relevant as possible. The better the data, the better the embeddings, and the better the content retrieval. For more information, see What does an SE model do?

Note

The SE model uses only the content in an item’s body and title fields. Content in other searchable fields, such as author, source, and date, isn’t embedded by the model and therefore isn’t considered for vector-based content retrieval.

Requirements

Supported file types

Coveo has tested and supports the following file types for use with the model:

  • HTML

  • PDF

Notes
  • Other text-based file types that are supported at ingestion that aren’t listed above may also provide good results, however, they’re not officially supported by Coveo for use with the model.

  • PDFs with single-column uninterrupted paragraph-based text sections provide best results. Text in tables and multi-columned text are embedded but parsing the text is more unpredictable.

  • You can use the optical character recognition (OCR) source feature to extract text from images in PDFs and image files. Otherwise, text from images won’t be embedded or used by the model.

  • Video files aren’t supported.

Best practices

This section describes best practices when it comes to choosing the content to use for the model and how to optimize the content for best results.

Choose your content

When deciding on the content to use, consider the following:

  • Prioritize content that’s designed to answer questions such as knowledge base articles, support documents, FAQs, community answers, and product documentation.

  • Prioritize content that’s written in a conversational tone.

  • Prioritize shorter documents that are focused on a single topic.

    Note

    Avoid using very long documents that cover multiple topics. This may result in text being embedded as semantically similar, even though the context or topic is different.

  • Content should be written using a single language.

    Tip

    By default, only English content is supported. However, Coveo offers beta support for languages other than English. Learn more about multilingual content retrieval and answer generation.

    Note

    Other languages are supported as a beta. For a list of beta-supported languages, see Supported languages for machine learning models.

  • Avoid multiple documents with similar content.

  • Choose a reasonably sized dataset to keep the content focused and current.

Important

Keep the model embedding limits in mind when choosing the content for your model.

Optimize your content

To optimize your content for the model, follow these best practices:

  • Ensure that boilerplate content, such as headers, footers, and extra navigation elements, is removed from the body data when the items are indexed.

  • Review the title and body data and source mappings to make sure that the title and body contain the desired content.

Notes
  • For Web and Sitemap sources, you can use a web scraping configuration to remove boilerplate content and select the web content to index.

  • For all other types of sources, you can edit an item’s title and body mappings to make sure that the title and body contain the desired content.

Model embedding limits

The SE model converts your content’s title and body text into numerical representations (vectors) in a process called embedding. It does this by breaking the text up into smaller segments called chunks, and each chunk is mapped as a distinct vector. For more information, see Embeddings.

The model is subject to the following embedding limits based on the selected chunking strategy:

Note

For a given model, the same chunking strategy is used for all sources and item types.

  • Up to 15 million items or 50 million chunks

    Note

    The maximum number of items depends on the item allocation of your product plan.

  • 11 chunks per item

    Important

    This limit is sufficient for the SE model to capture an item’s main concepts. If an item is long with a lot of text, however, such as more than 4000 words or 5 pages, the model will embed the item’s text until the 11-chunk limit is reached. The remaining text won’t be embedded and therefore won’t be used by the model.

    To make sure that each item’s text is fully embedded, follow best practices by keeping items concise and focused.

  • 500 words per chunk

    Note

    There can be an overlap of up to 20% between chunks. In other words, the last 20% of the previous chunk can be the first 20% of the next chunk.

What’s next?