How to Evaluate RAG If You Don’t Have Ground Truth Data
Vector similarity search threshold, synthetic data generation, LLM-as-a-judge, and frameworks
Evaluating a Retrieval-Augmented Generation (RAG) model is much easier when you have ground truth data against which to compare. But what if you don’t? That’s where things get a bit trickier. However, even in the absence of ground truth, there are still ways to assess how well your RAG system is performing. Below, we’ll walk through three effective strategies, ways to create a ground truth dataset from scratch, metrics you can use to evaluate when you do have a dataset, and existing frameworks that can help you with this process.
Two types of RAG evaluations: Retrieval evaluation and Generation evaluation
Each strategy below will be tagged as either retrieval evaluation, generation evaluation, or both.
How to evaluate RAG if you don’t have ground truth data?
Vector Similarity Search Threshold
Type: Retrieval Evaluation
If you’re working with a vector database like Pinecone, you’re probably familiar with the idea of vector similarity. Essentially, the database retrieves information based on how close the vectors of your query are to the vectors of potential results. Even without a “correct” answer to measure against, you can still lean on metrics like cosine similarity to gauge the quality of the retrieved documents.
Cosine Distance. [Image provided by the author]
For example, Pinecone will return cosine similarity values that show how close each result is to your query.
# Create a pinecone (vector database) index that uses cosine similarity
pc.create_index(
name=index_name,
dimension=2,
metric=”cosine”,
spec=ServerlessSpec(
cloud=’aws’,
region=’us-east-1′
)
)
# Retrieving top 3 closest vectors
query_results = index.query(
namespace=”example-namespace1″,
vector=[1.0, 1.5],
top_k=3,
include_values=True
)
# query_results
# The “score” here is the cosine similarity value
# {‘matches’: [{‘id’: ‘vec1’, ‘score’: 1.0, ‘values’: [1.0, 1.5]},
# {‘id’: ‘vec2’, ‘score’: 0.868243158, ‘values’: [2.0, 1.0]},
# {‘id’: ‘vec3’, ‘score’: 0.850068152, ‘values’: [0.1, 3.0]}],
By exposing the similarity score, you can set a passing or failing grade on retrieved documents. A higher threshold (like 0.8 or above) means having a stricter requirement, while a lower threshold will bring in more data, which could be helpful or just noisy.
This process isn’t about finding a perfect number right away — it’s about trial and error. We’ll know if we’ve hit the sweet spot when the results consistently feel useful for our specific application.
Using Multiple LLMs to Judge Responses
Type: Retrieval + Generation Evaluation
Another creative way to evaluate your RAG system is by leveraging multiple LLMs to judge responses. Even though LLMs can’t provide a perfect answer when ground truth data is missing, you can still use their feedback to compare the quality of the responses.
By comparing responses across different LLMs and seeing how they rank them, you can gauge the overall quality of the retrievals and generations. It’s not perfect, but it’s a creative way to get multiple perspectives on the quality of your system’s output.
Human-in-the-Loop Feedback: Involving the Experts
Type: Retrieval + Generation Evaluation
Sometimes, the best way to evaluate a system is the old-fashioned way — by asking humans for their judgment. Getting feedback from domain experts can provide insights that even the best models can’t match.
Setting Up Rating Criteria
To make human feedback more reliable, it helps to establish clear and consistent rating criteria. You might ask your reviewers to rate things like:
Relevance: Does the retrieved information address the query? (Retrieval evaluation)Correctness: Is the content factually accurate? (Retrieval evaluation)Fluency: Does it read well, or does it feel awkward or forced? (Generation evaluation)Completeness: Does it cover the question fully or leave gaps (Retrieval + Generation evaluation)
With these criteria in place, you can get a more structured sense of how well your system is performing.
Getting a Baseline
One smart way to evaluate the quality of your human feedback is to check how well different reviewers agree with each other. You can use metrics like Pearson correlation to see how closely their judgements align. If your reviewers disagree a lot, it might mean your criteria aren’t clear enough. It could also be a sign that the task is more subjective than you anticipated.
Pearson correlation coefficient. [Image provided by author]
Reducing Noise
Human feedback can be noisy, especially if the criteria are unclear or the task is subjective. Here are a couple of ways to deal with that:
Averaging the Scores: By averaging the ratings of multiple human reviewers, you can smooth out any individual biases or inconsistencies.Focus on Agreement: Another approach is to only consider cases where your reviewers agree. This will give you a cleaner set of evaluations and help ensure the quality of your feedback.
Creating a Ground Truth Dataset from Scratch
When it comes to evaluating a RAG system without ground truth data, another approach is to create your own dataset. It sounds daunting, but there are several strategies to make this process easier, from finding similar datasets to leveraging human feedback and even synthetically generating data. Let’s break down how you can do it.
Finding Similar Datasets Online
This might seem obvious, and most people who have come to the conclusion that they don’t a have ground truth dataset have already exhausted this option. But it’s still worth mentioning that there might be datasets out there that are similar to what you need. Perhaps it’s in a different business domain from your use case but it’s in the question-answer format that you’re working with. Sites like Kaggle have a huge variety of public datasets, and you might be surprised at how many align with your problem space.
Example:
Stanford Question Answering DatasetAmazon Question/Answer Dataset
Manually Creating Ground Truth Data
If you can’t find exactly what you need online, you can always create ground truth data manually. This is where human-in-the-loop feedback comes in handy. Remember the domain expert feedback we talked about earlier? You can use that feedback to build your own mini-dataset.
By curating a collection of human-reviewed examples — where the relevance, correctness, and completeness of the results have been validated — you create a foundation for expanding your dataset for evaluation.
There is also a great article from Katherine Munro on an experimental approach to agile chatbot development.
Training an LLM as a Judge
Once you have your minimal ground truth dataset, you can take things a step further by training an LLM to act as a judge and evaluate your model’s outputs.
But before relying on an LLM to act as a judge, we first need to ensure that it’s rating our model outputs accurately, or at least reliable. Here’s how you can approach that:
Build human-reviewed examples: Depending on your use case, 20 to 30 examples should be good enough to get a good sense of how reliable the LLM is in comparison. Refer to the previous section on best criteria to rate and how to measure conflicting ratings.Create Your LLM Judge: Prompt an LLM to give ratings based on the same criteria that you handed to your domain experts. Take the rating and compare how the LLM’s ratings align with the human ratings. Again, you can use metrics like Pearson metrics to help evaluate. A high correlation score will indicate that the LLM is performing as well as a judge.Apply prompt engineering best practices: Prompt engineering can make or break this process. Techniques like pre-warming the LLM with context or providing a few examples (few-shot learning) can dramatically improve the models’ accuracy when judging.
Creating Specific Datasets
Another way to boost the quality and quantity of your ground truth datasets is by segmenting your documents into topics or semantic groupings. Instead of looking at entire documents as a whole, break them down into smaller, more focused segments.
For example, let’s say you have a document (documentId: 123) that mentions:
“After launching product ABC, company XYZ saw a 10% increase in revenue for 2024 Q1.”
This one sentence contains two distinct pieces of information:
Launching product ABCA 10% increase in revenue for 2024 Q1
Now, you can augment each topic into its own query and context. For example:
Query 1: “What product did company XYZ launch?”Context 1: “Launching product ABC”Query 2: “What was the change in revenue for Q1 2024?”Context 2: “Company XYZ saw a 10% increase in revenue for Q1 2024”
By breaking the data into specific topics like this, you not only create more data points for training but also make your dataset more precise and focused. Plus, if you want to trace each query back to the original document for reliability, you can easily add metadata to each context segment. For instance:
Query 1: “What product did company XYZ launch?”Context 1: “Launching product ABC (documentId: 123)”Query 2: “What was the change in revenue for Q1 2024?”Context 2: “Company XYZ saw a 10% increase in revenue for Q1 2024 (documentId: 123)”
This way, each segment is tied back to its source, making your dataset even more useful for evaluation and training.
Synthetically Creating a Dataset
If all else fails, or if you need more data than you can gather manually, synthetic data generation can be a game-changer. Using techniques like data augmentation or even GPT models, you can create new data points based on your existing examples. For instance, you can take a base set of queries and contexts and tweak them slightly to create variations.
For example, starting with the query:
“What product did company XYZ launch?”
You could synthetically generate variations like:
“Which product was introduced by company XYZ?”“What was the name of the product launched by company XYZ?”
This can help you build a much larger dataset without the manual overhead of writing new examples from scratch.
There are also frameworks that can automate the process of generating synthetic data for you that we’ll explore in the last section.
Once You Have a Dataset: Time to Evaluate
Now that you’ve gathered or created your dataset, it’s time to dive into the evaluation phase. RAG model involves two key areas: retrieval and generation. Both are important and understanding how to assess each will help you fine-tune your model to better meet your needs.
Evaluating Retrieval: How Relevant is the Retrieved Data?
The retrieval step in RAG is crucial — if your model can’t pull the right information, it’s going to struggle with generating accurate responses. Here are two key metrics you’ll want to focus on:
Context Relevancy: This measures how well the retrieved context aligns with the query. Essentially, you’re asking: Is this information actually relevant to the question being asked? You can use your dataset to calculate relevance scores, either by human judgment or by comparing similarity metrics (like cosine similarity) between the query and the retrieved document.Context Recall: Context recall focuses on how much relevant information was retrieved. It’s possible that the right document was pulled, but only part of the necessary information was included. To evaluate recall, you need to check whether the context your model pulled contains all the key pieces of information to fully answer the query. Ideally, you want high recall: your retrieval should grab the information you need and nothing critical should be left behind.
Evaluating Generation: Is the Response Both Accurate and Useful?
Once the right information is retrieved, the next step is generating a response that not only answers the query but does so faithfully and clearly. Here are two critical aspects to evaluate:
Faithfulness: This measures whether the generated response accurately reflects the retrieved context. Essentially, you want to avoid hallucinations — where the model makes up information that wasn’t in the retrieved data. Faithfulness is about ensuring that the answer is grounded in the facts presented by the documents your model retrieved.Answer Relevancy: This refers to how well the generated answer matches the query. Even if the information is faithful to the retrieved context, it still needs to be relevant to the question being asked. You don’t want your model to pull out correct information that doesn’t quite answer the user’s question.
Doing a Weighted Evaluation
Once you’ve assessed both retrieval and generation, you can go a step further by combining these evaluations in a weighted way. Maybe you care more about relevancy than recall, or perhaps faithfulness is your top priority. You can assign different weights to each metric depending on your specific use case.
For example:
Retrieval: 60% context relevancy + 40% context recallGeneration: 70% faithfulness + 30% answer relevancy
This kind of weighted evaluation gives you flexibility in prioritizing what matters most for your application. If your model needs to be 100% factually accurate (like in legal or medical contexts), you may put more weight on faithfulness. On the other hand, if completeness is more important, you might focus on recall.
Existing Frameworks to Simplify Your Evaluation Process
If creating your own evaluation system feels overwhelming, don’t worry — there are some great existing frameworks that have already done a lot of the heavy lifting for you. These frameworks come with built-in metrics designed specifically to evaluate RAG systems, making it easier to assess retrieval and generation performance. Let’s look at a few of the most helpful ones.
RAGAS (Retrieval-Augmented Generation Assessment)
RAGAS is a purpose-built framework designed to assess the performance of RAG models. It includes metrics that evaluate both retrieval and generation, offering a comprehensive way to measure how well your system is doing at each step. It also offers synthetic test data generation by employing an evolutionary generation paradigm.
Inspired by works like Evol-Instruct, Ragas achieves this by employing an evolutionary generation paradigm, where questions with different characteristics such as reasoning, conditioning, multi-context, and more are systematically crafted from the provided set of documents. — RAGAS documentation
ARES: Open-Source Framework Using Synthetic Data and LLM Judge
ARES is another powerful tool that combines synthetic data generation with LLM-based evaluation. ARES uses synthetic data — data generated by AI models rather than collected from real-world interactions — to build a dataset that can be used to test and refine your RAG system.
The framework also includes an LLM judge, which, as we discussed earlier, can help evaluate model outputs by comparing them to human annotations or other reference data.
Conclusion
Even without ground truth data, these strategies can help you effectively evaluate a RAG system. Whether you’re using vector similarity thresholds, multiple LLMs, LLM-as-a-judge, retrieval metrics, or frameworks, each approach gives you a way to measure performance and improve your model’s results. The key is finding what works best for your specific needs — and not being afraid to tweak things along the way. 🙂
Questions? Feel free to reach out here or on Discord.
Curious to learn more about LLMs? Check out our AI-in-Action blog on AI agents, prompt engineering, and LLMOps.
How to Evaluate RAG If You Don’t Have Ground Truth Data was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.