How to Measure the Reliability of a Large Language Model’s Response

The basic principle of Large Language Models (LLMs) is very simple: to predict the next word (or token) in a sequence of words based on statistical patterns in their training data. However, this seemingly simple capability turns out to be incredibly sophisticated when it can do a number of amazing tasks such as text summarization, idea generation, brainstorming, code generation, information processing, and content creation. That said, LLMs do not have any memory no do they actually “understand” anything, other than sticking to their basic function: predicting the next word.

The process of next-word prediction is probabilistic. The LLM has to select each word from a probability distribution. In the process, they often generate false, fabricated, or inconsistent content in an attempt to produce coherent responses and fill in gaps with plausible-looking but incorrect information. This phenomenon is called hallucination, an inevitable, well-known feature of LLMs that warrants validation and corroboration of their outputs. 

Retrieval augment generation (RAG) methods, which make an LLM work with external knowledge sources, do minimize hallucinations to some extent, but they cannot completely eradicate them. Although advanced RAGs can provide in-text citations and URLs, verifying these references could be hectic and time-consuming. Therefore, we need an objective criterion for assessing the reliability or trustworthiness of an LLM’s response, whether it is generated from its own knowledge or an external knowledge base (RAG). 

In this article, we will discuss how the output of an LLM can be assessed for trustworthiness by a trustworthy language model which assigns a score to the LLM’s output. We will first discuss how we can use a trustworthy language model to assign scores to an LLM’s answer and explain trustworthiness. Subsequently, we will develop an example RAG with LlamaParse and Llamaindex that assesses the RAG’s answers for trustworthiness.

The entire code of this article is available in the jupyter notebook on GitHub

Assigning a Trustworthiness Score to an LLM’s Answer

To demonstrate how we can assign a trustworthiness score to an Llm’s response, I will use Cleanlab’s Trustworthy Language Model (TLM). Such TLMs use a combination of uncertainty quantification and consistency analysis to compute trustworthiness scores and explanations for LLM responses.

Cleanlab offers free trial APIs which can be obtained by creating an account at their website. We first need to install Cleanlab’s Python client:

pip install –upgrade cleanlab-studio

Cleanlab supports several proprietary models such as ‘gpt-4o’, ‘gpt-4o-mini’, ‘o1-preview’, ‘claude-3-sonnet’, ‘claude-3.5-sonnet’, ‘claude-3.5-sonnet-v2’ and others. Here is how TLM assigns a trustworhiness score to gpt-4o’s answer. The trustworthiness score ranges from 0 to 1, where higher values indicate greater trustworthiness. 

from cleanlab_studio import Studio
studio = Studio(“<CLEANLAB_API_KEY>”) # Get your API key from above
tlm = studio.TLM(options={“log”: [“explanation”], “model”: “gpt-4o”}) # GPT, Claude, etc
#set the prompt
out = tlm.prompt(“How many vowels are there in the word ‘Abracadabra’.?”)
#the TLM response contains the actual output ‘response’, trustworthiness score and explanation
print(f”Model’s response = {out[‘response’]}”)
print(f”Trustworthiness score = {out[‘trustworthiness_score’]}”)
print(f”Explanation = {out[‘log’][‘explanation’]}”)

The above code tested the response of gpt-4o for the question “How many vowels are there in the word ‘Abracadabra’.?”. The TLM’s output contains the model’s answer (response), trustworthiness score, and explanation. Here is the output of this code.

Model’s response = The word “Abracadabra” contains 6 vowels. The vowels are: A, a, a, a, a, and a.
Trustworthiness score = 0.6842228802750124
Explanation = This response is untrustworthy due to a lack of consistency in possible responses from the model. Here’s one inconsistent alternate response that the model considered (which may not be accurate either):
5.

It can be seen how the most advanced language model hallucinates for such simple tasks and produces the wrong output. Here is the response and trustworthiness score for the same question for claude-3.5-sonnet-v2.

Model’s response = Let me count the vowels in ‘Abracadabra’:
A-b-r-a-c-a-d-a-b-r-a

The vowels are: A, a, a, a, a

There are 5 vowels in the word ‘Abracadabra’.
Trustworthiness score = 0.9378276048845285
Explanation = Did not find a reason to doubt trustworthiness.

claude-3.5-sonnet-v2 produces the correct output. Let’s compare the two models’ responses to another question.

from cleanlab_studio import Studio
import markdown
from IPython.core.display import display, Markdown

# Initialize the Cleanlab Studio with API key
studio = Studio(“<CLEANLAB_API_KEY>”) # Replace with your actual API key

# List of models to evaluate
models = [“gpt-4o”, “claude-3.5-sonnet-v2”]

# Define the prompt
prompt_text = “Which one of 9.11 and 9.9 is bigger?”

# Loop through each model and evaluate
for model in models:
tlm = studio.TLM(options={“log”: [“explanation”], “model”: model})
out = tlm.prompt(prompt_text)

md_content = f”””
## Model: {model}

**Response:** {out[‘response’]}

**Trustworthiness Score:** {out[‘trustworthiness_score’]}

**Explanation:** {out[‘log’][‘explanation’]}


“””
display(Markdown(md_content))

Here is the response of the two models:

Wrong outputs generated by gpt-4o and claude-3.5-sonnet-v2, represented by low trustworthiness score

We can also generate a trustworthiness score for open-source LLMs. Let’s check the recent, much-hyped open-source LLM: deepseek-R1. I will use DeepSeek-R1-Distill-Llama-70B, based on Meta’s Llama-3.3–70B-Instruct model and distilled from DeepSeek’s larger 671-billion parameter Mixture of Experts (MoE) model. Knowledge distillation is a Machine Learning technique that aims to transfer the learnings of a large pre-trained model, the “teacher model,” to a smaller “student model.”

import streamlit as st
from langchain_groq.chat_models import ChatGroq
import os
os.environ[“GROQ_API_KEY”]=st.secrets[“GROQ_API_KEY”]
# Initialize the Groq Llama Instant model
groq_llm = ChatGroq(model=”deepseek-r1-distill-llama-70b”, temperature=0.5)
prompt = “Which one of 9.11 and 9.9 is bigger?”
# Get the response from the model
response = groq_llm.invoke(prompt)
#Initialize Cleanlab’s studio
studio = Studio(“226eeab91e944b23bd817a46dbe3c8ae”)
cleanlab_tlm = studio.TLM(options={“log”: [“explanation”]}) #for explanations
#Get the output containing trustworthiness score and explanation
output = cleanlab_tlm.get_trustworthiness_score(prompt, response=response.content.strip())
md_content = f”””
## Model: {model}
**Response:** {response.content.strip()}
**Trustworthiness Score:** {output[‘trustworthiness_score’]}
**Explanation:** {output[‘log’][‘explanation’]}

“””
display(Markdown(md_content))

Here is the output of deepseek-r1-distill-llama-70b model.

The correct output of deepseek-r1-distill-llama-70b model with a high trustworthiness score

Developing a Trustworthy RAG

We will now develop an RAG to demonstrate how we can measure the trustworthiness of an LLM response in RAG. This RAG will be developed by scraping data from given links, parsing it in markdown format, and creating a vector store.

The following libraries need to be installed for the next code.

pip install llama-parse llama-index-core llama-index-embeddings-huggingface
llama-index-llms-cleanlab requests beautifulsoup4 pdfkit nest-asyncio

To render HTML into PDF format, we also need to install wkhtmltopdf command line tool from their website.

The following libraries will be imported:

from llama_parse import LlamaParse
from llama_index.core import VectorStoreIndex
import requests
from bs4 import BeautifulSoup
import pdfkit
from llama_index.readers.docling import DoclingReader
from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.cleanlab import CleanlabTLM
from typing import Dict, List, ClassVar
from llama_index.core.instrumentation.events import BaseEvent
from llama_index.core.instrumentation.event_handlers import BaseEventHandler
from llama_index.core.instrumentation import get_dispatcher
from llama_index.core.instrumentation.events.llm import LLMCompletionEndEvent
import nest_asyncio
import os

The next steps will involve scraping data from given URLs using Python’s BeautifulSoup library, saving the scraped data in PDF file(s) using pdfkit, and parsing the data from PDF(s) to markdown file using LlamaParse which is a genAI-native document parsing platform built with LLMs and for LLM use cases.

We will first configure the LLM to be used by CleanlabTLM and the embedding model (Huggingface embedding model BAAI/bge-small-en-v1.5) that will be used to compute the embeddings of the scraped data to create the vector store.

options = {
“model”: “gpt-4o”,
“max_tokens”: 512,
“log”: [“explanation”]
}
llm = CleanlabTLM(api_key=”<CLEANLAB_API_KEY>”, options=options)#Get your free API from https://cleanlab.ai/
Settings.llm = llm
Settings.embed_model = HuggingFaceEmbedding(
model_name=”BAAI/bge-small-en-v1.5″
)

We will now define a custom event handler, GetTrustworthinessScore, that is derived from a base event handler class. This handler gets triggered by the end of an LLM completion and extracts a trustworthiness score from the response metadata. A helper function, display_response, displays the LLM’s response along with its trustworthiness score.

# Event Handler for Trustworthiness Score
class GetTrustworthinessScore(BaseEventHandler):
events: ClassVar[List[BaseEvent]] = []
trustworthiness_score: float = 0.0
@classmethod
def class_name(cls) -> str:
return “GetTrustworthinessScore”
def handle(self, event: BaseEvent) -> Dict:
if isinstance(event, LLMCompletionEndEvent):
self.trustworthiness_score = event.response.additional_kwargs.get(“trustworthiness_score”, 0.0)
self.events.append(event)
return {}
# Helper function to display LLM’s response
def display_response(response):
response_str = response.response
trustworthiness_score = event_handler.trustworthiness_score
print(f”Response: {response_str}”)
print(f”Trustworthiness score: {round(trustworthiness_score, 2)}”)

We will now generate PDFs by scraping data from given URLs. For demonstration, we will scrap data only from this Wikipedia article about large language models (Creative Commons Attribution-ShareAlike 4.0 License). 

Note: Readers are advised to always double-check the status of the content/data they are about to scrape and ensure they are allowed to do so. 

The following piece of code scrapes data from the given URLs by making an HTTP request and using BeautifulSoup Python library to parse the HTML content. HTML content is cleaned by converting protocol-relative URLs to absolute ones. Subsequently, the scraped content is converted into a PDF file(s) using pdfkit.

##########################################
# PDF Generation from Multiple URLs
##########################################
# Configure wkhtmltopdf path
wkhtml_path = r’C:Program Fileswkhtmltopdfbinwkhtmltopdf.exe’
config = pdfkit.configuration(wkhtmltopdf=wkhtml_path)
# Define URLs and assign document names
urls = {
“LLMs”: “https://en.wikipedia.org/wiki/Large_language_model”
}
# Directory to save PDFs
pdf_directory = “PDFs”
os.makedirs(pdf_directory, exist_ok=True)
pdf_paths = {}
for doc_name, url in urls.items():
try:
print(f”Processing {doc_name} from {url} …”)
response = requests.get(url)
soup = BeautifulSoup(response.text, “html.parser”)
main_content = soup.find(“div”, {“id”: “mw-content-text”})
if main_content is None:
raise ValueError(“Main content not found”)
# Replace protocol-relative URLs with absolute URLs
html_string = str(main_content).replace(‘src=”//’, ‘src=”https://’).replace(‘href=”//’, ‘href=”https://’)
pdf_file_path = os.path.join(pdf_directory, f”{doc_name}.pdf”)
pdfkit.from_string(
html_string,
pdf_file_path,
options={‘encoding’: ‘UTF-8’, ‘quiet’: ”},
configuration=config
)
pdf_paths[doc_name] = pdf_file_path
print(f”Saved PDF for {doc_name} at {pdf_file_path}”)
except Exception as e:
print(f”Error processing {doc_name}: {e}”)

After generating PDF(s) from the scraped data, we parse these PDFs using LlamaParse. We set the parsing instructions to extract the content in markdown format and parse the document(s) page-wise along with the document name and page number. These extracted entities (pages) are referred to as nodes. The parser iterates over the extracted nodes and updates each node’s metadata by appending a citation header which facilitates later referencing.

##########################################
# Parse PDFs with LlamaParse and Inject Metadata
##########################################

# Define parsing instructions (if your parser supports it)
parsing_instructions = “””Extract the document content in markdown.
Split the document into nodes (for example, by page).
Ensure each node has metadata for document name and page number.”””

# Create a LlamaParse instance
parser = LlamaParse(
api_key=”<LLAMACLOUD_API_KEY>”, #Replace with your actual key
parsing_instructions=parsing_instructions,
result_type=”markdown”,
premium_mode=True,
max_timeout=600
)
# Directory to save combined Markdown files (one per PDF)
output_md_dir = os.path.join(pdf_directory, “markdown_docs”)
os.makedirs(output_md_dir, exist_ok=True)
# List to hold all updated nodes for indexing
all_nodes = []
for doc_name, pdf_path in pdf_paths.items():
try:
print(f”Parsing PDF for {doc_name} from {pdf_path} …”)
nodes = parser.load_data(pdf_path) # Returns a list of nodes
updated_nodes = []
# Process each node: update metadata and inject citation header into the text.
for i, node in enumerate(nodes, start=1):
# Copy existing metadata (if any) and add our own keys.
new_metadata = dict(node.metadata) if node.metadata else {}
new_metadata[“document_name”] = doc_name
if “page_number” not in new_metadata:
new_metadata[“page_number”] = str(i)
# Build the citation header.
citation_header = f”[{new_metadata[‘document_name’]}, page {new_metadata[‘page_number’]}]nn”
# Prepend the citation header to the node’s text.
updated_text = citation_header + node.text
new_node = node.__class__(text=updated_text, metadata=new_metadata)
updated_nodes.append(new_node)
# Save a single combined Markdown file for the document using the updated node texts.
combined_texts = [node.text for node in updated_nodes]
combined_md = “nn—nn”.join(combined_texts)
md_filename = f”{doc_name}.md”
md_filepath = os.path.join(output_md_dir, md_filename)
with open(md_filepath, “w”, encoding=”utf-8″) as f:
f.write(combined_md)
print(f”Saved combined markdown for {doc_name} to {md_filepath}”)
# Add the updated nodes to the global list for indexing.
all_nodes.extend(updated_nodes)
print(f”Parsed {len(updated_nodes)} nodes from {doc_name}.”)
except Exception as e:
print(f”Error parsing {doc_name}: {e}”)

We now create a vector store and a query engine. We define a customer prompt template to guide the LLM’s behavior in answering the questions. Finally, we create a query engine with the created index to answer queries. For each query, we retrieve the top 3 nodes from the vector store based on their semantic similarity with the query. The LLM uses these retrieved nodes to generate the final answer.

##########################################
# Create Index and Query Engine
##########################################
# Create an index from all nodes.
index = VectorStoreIndex.from_documents(documents=all_nodes)
# Define a custom prompt template that forces the inclusion of citations.
prompt_template = “””
You are an AI assistant with expertise in the subject matter.
Answer the question using ONLY the provided context.
Answer in well-formatted Markdown with bullets and sections wherever necessary.
If the provided context does not support an answer, respond with “I don’t know.”
Context:
{context_str}
Question:
{query_str}
Answer:
“””
# Create a query engine with the custom prompt.
query_engine = index.as_query_engine(similarity_top_k=3, llm=llm, prompt_template = prompt_template)
print(“Combined index and query engine created successfully!”)

Now let’s test the RAG for some queries and their corresponding trustworthiness scores.

query = “When is mixture of experts approach used?”
response = query_engine.query(query)
display_response(response)

Response to the query ‘When is mixture of experts approach used?’ (image by author)

query = “How do you compare Deepseek model with OpenAI’s models?”
response = query_engine.query(query)
display_response(response)

Response to the query ‘How do you compare the Deepseek model with OpenAI’s models?’ (image by author)

Assigning a trustworthiness score to LLM’s response, whether generated through direct inference or RAG, helps to define the reliability of AI’s output and prioritize human verification where needed. This is particularly important for critical domains where a wrong or unreliable response could have severe consequences. 

That’s all folks! If you like the article, please follow me on Medium and LinkedIn.

The post How to Measure the Reliability of a Large Language Model’s Response appeared first on Towards Data Science.

Author:

Leave a Comment

You must be logged in to post a comment.