Relevance Generative Answering (RGA) model card

What’s a model card?

A model card is a document that provides a summary of key information about a Coveo Machine Learning (Coveo ML) model. It details the model’s purpose, intended use, performance, and limitations.

Model details

The Coveo Relevance Generative Answering (RGA) model provides Coveo customers' end users with generative answers to queries performed in real time. RGA is primarily designed to enhance end user search experience in Coveo-powered search solutions. The RGA model uses text data from a customer’s index to generate answers that are relevant, personalized, and secure.

  • Development team: Coveo ML team

  • Initial release date: December 14th, 2023

  • Model version: 0.290.0

  • Activation: The RGA model is created and assigned to query pipelines using the Coveo Administration Console.

Intended use

  • Intended purpose: To enhance an end user’s search experience by providing a generated answer to a search query using natural language.

  • Intended output: The answer is generated using only the customer’s content that’s indexed to the Coveo Platform. The indexing process is managed by the customer’s administrator.

  • Intended users: End users of Coveo customers.

Factors

The RGA model generates an answer through a combination of factors. The first set of factors involves content retrieval. This includes retrieving the right documents using a hybrid approach that integrates lexical and semantic search, business rules, and behavioral analytics, as well as retrieving the appropriate text chunks using semantic search. The second set of factors pertains to generating the answer based on prompt instructions and the retrieved text chunks.

Training data

The data used to train the RGA model is proprietary and tailored to each Coveo customer’s organization Specifically, the RGA model uses text data from selected content within the customer’s index. RGA breaks down each document into text chunks and creates embeddings from these chunks.

The selected content and its quality directly impact the quality of the answers generated by RGA. Higher quality data results in better embeddings and more relevant answers. For optimal results, it’s crucial to adhere to the requirements specified in Coveo’s documentation and best practices.

RGA's generated answers and the utilized text chunks can be inspected in Snowflake.

Performance

The quality and performance of the RGA model is measured by examining how well the model operates based on retrieval and generation metrics performed offline and online. Coveo also uses internal performance metrics to measure the overall reliability of the RGA model, such as response time and average build time.

  • Retrieval metrics: Coveo uses offline retrieval metrics based on publicly available datasets (for example, MTEB dataset) to assess the effectiveness and relevance of the RGA model under controlled conditions without the variability of real-time user interaction.

  • Generation metrics: Coveo uses generation metrics to evaluate the performance, quality, and accuracy of the RGA model’s outputs:

    • Coveo uses offline generation metrics based on internal or public datasets (for example, ASQA public dataset) to assess the answering capabilities of the RGA model. In practice Coveo uses the following:

      • The weighted mean[1] metric to assess when the RGA model:

        • refrains from answering when the chunks don’t contain the answer, using soft-negative[2] and hard-negative[3] samples.

        • answers when expected, or the RGA model's ability to answer when the chunk contains the answer.

      • The precision[4], recall[5], and F1 score[6] metrics to evaluate the RGA model’s ability to accurately cite the chunks that are used.

      • The weighted mean.footnote[1] metric to evaluate the repeatability of answers over time for the same end user’s question.

    • Coveo uses aggregated information to assess the average online answer rate of queries performed on the Coveo Platform, based on the weighted mean[1] metric.

1. A measure that calculates the average value of a set of data points, where each data point contributes to the final average in proportion to its assigned weight. This is particularly important when certain data points are considered more important or relevant than others, which allows for a more accurate representation of the overall data.

2. Soft negatives are examples that are somewhat similar to the positive answers, but aren’t exactly correct.

3. Hard negatives are examples that are very similar to the positive answers, and more challenging to distinguish from positive answers.

4. Precision measures the correctness of the generated outputs.

5. Recall measures the completeness of the generated outputs, or the ability of the model to not miss any important element of the target output.

6. The harmonic mean of precision and recall, which provides a single metric to evaluate the balance between precision and recall.

Limitations

Some factors might degrade the RGA model's performance:

  • Quality of training data: The effectiveness of the RGA model is largely dependent on the quality of the training data. If the customer’s selected documents that form the dataset are biased, non-representative, irrelevant, incomplete, or inadequate, the model’s performance will be affected. For instance, the RGA model’s performance might be sub-optimal if trained on non-factual documents.

  • Risk of AI hallucinations: The output of the RGA model is based on a customer’s internal content. Therefore, if a customer’s dataset contains meticulously curated informational content that’s accurate and up-to-date, the risk of AI hallucination is drastically reduced. Conversely, if a customer’s internal content contains false or inaccurate information, the risk of AI hallucination increases.

  • Language limitations: The RGA model provides generated answers only in English.

  • Indirect feedback loop: The RGA model doesn’t directly take end user feedback (thumbs up/thumbs down) into account when generating the answer. However, all behavioral signals from an end user will be taken into account by other ML models (ART, DNE) that influence the ranking of documents that RGA uses to extract text chunks and generate answers.

Best practices