Enhancing RAG: Beyond Vanilla Approaches

February 25, 2025

Retrieval-Augmented Generation (RAG) is a powerful technique that enhances language models by incorporating external information retrieval mechanisms. While standard RAG implementations improve response relevance, they often struggle in complex retrieval scenarios. This article explores the limitations of a vanilla RAG setup and introduces advanced techniques to enhance its accuracy and efficiency.

The Challenge with Vanilla RAG

To illustrate RAG’s limitations, consider a simple experiment where we attempt to retrieve relevant information from a set of documents. Our dataset includes:

A primary document discussing best practices for staying healthy, productive, and in good shape.

Two additional documents on unrelated topics, but contain some similar words used in different contexts.

main_document_text = “””
Morning Routine (5:30 AM – 9:00 AM)
Wake Up Early – Aim for 6-8 hours of sleep to feel well-rested.
Hydrate First – Drink a glass of water to rehydrate your body.
Morning Stretch or Light Exercise – Do 5-10 minutes of stretching or a short workout to activate your body.
Mindfulness or Meditation – Spend 5-10 minutes practicing mindfulness or deep breathing.
Healthy Breakfast – Eat a balanced meal with protein, healthy fats, and fiber.
Plan Your Day – Set goals, review your schedule, and prioritize tasks.
…
“””

Using a standard RAG setup, we query the system with:

What should I do to stay healthy and productive?

What are the best practices to stay healthy and productive?

Helper Functions

To enhance retrieval accuracy and streamline query processing, we implement a set of essential helper functions. These functions serve various purposes, from querying the ChatGPT API to computing document embeddings and similarity scores. By leveraging these functions, we create a more efficient RAG pipeline that effectively retrieves the most relevant information for user queries.

To support our RAG improvements, we define the following helper functions:

# **Imports**
import os
import json
import openai
import numpy as np
from scipy.spatial.distance import cosine
from google.colab import userdata

# Set up OpenAI API key
os.environ[“OPENAI_API_KEY”] = userdata.get(‘AiTeam’)

def query_chatgpt(prompt, model=”gpt-4o”, response_format=openai.NOT_GIVEN):
try:
response = client.chat.completions.create(
model=model,
messages=[{“role”: “user”, “content”: prompt}],
temperature=0.0 , # Adjust for more or less creativity
response_format=response_format
)
return response.choices[0].message.content.strip()
except Exception as e:
return f”Error: {e}”

def get_embedding(text, model=”text-embedding-3-large”): #”text-embedding-ada-002″
“””Fetches the embedding for a given text using OpenAI’s API.”””
response = client.embeddings.create(
input=[text],
model=model
)
return response.data[0].embedding

def compute_similarity_metrics(embed1, embed2):
“””Computes different similarity/distance metrics between two embeddings.”””
cosine_sim = 1- cosine(embed1, embed2) # Cosine similarity

return cosine_sim

def fetch_similar_docs(query, docs, threshold = .55, top=1):
query_em = get_embedding(query)
data = []
for d in docs:
# Compute and print similarity metrics
similarity_results = compute_similarity_metrics(d[“embedding”], query_em)
if(similarity_results >= threshold):
data.append({“id”:d[“id”], “ref_doc”:d.get(“ref_doc”, “”), “score”:similarity_results})

# Sorting by value (second element in each tuple)
sorted_data = sorted(data, key=lambda x: x[“score”], reverse=True) # Ascending order
sorted_data = sorted_data[:min(top, len(sorted_data))]
return sorted_data

Evaluating the Vanilla RAG

To evaluate the effectiveness of a vanilla RAG setup, we conduct a simple test using predefined queries. Our goal is to determine whether the system retrieves the most relevant document based on semantic similarity. We then analyze the limitations and explore possible improvements.

“””# **Testing Vanilla RAG**”””

query = “what should I do to stay healthy and productive?”
r = fetch_similar_docs(query, docs)
print(“query = “, query)
print(“documents = “, r)

query = “what are the best practices to stay healthy and productive ?”
r = fetch_similar_docs(query, docs)
print(“query = “, query)
print(“documents = “, r)

Advanced Techniques for Improved RAG

To further refine the retrieval process, we introduce advanced functions that enhance the capabilities of our RAG system. These functions generate structured information that aids in document retrieval and query processing, making our system more robust and context-aware.

To address these challenges, we implement three key enhancements:

1. Generating FAQs

By automatically creating a list of frequently asked questions related to a document, we expand the range of potential queries the model can match. These FAQs are generated once and stored alongside the document, providing a richer search space without incurring ongoing costs.

def generate_faq(text):
prompt = f”’
given the following text: “””{text}”””
Ask relevant simple atomic questions ONLY (don’t answer them) to cover all subjects covered by the text. Return the result as a json list example [q1, q2, q3…]
”’
return query_chatgpt(prompt, response_format={ “type”: “json_object” })

2. Creating an Overview

A high-level summary of the document helps capture its core ideas, making retrieval more effective. By embedding the overview alongside the document, we provide additional entry points for relevant queries, improving match rates.

def generate_overview(text):
prompt = f”’
given the following text: “””{text}”””
Generate an abstract for it that tells in maximum 3 lines what is it about and use high level terms that will capture the main points,
Use terms and words that will be most likely used by average person.
”’
return query_chatgpt(prompt)

3. Query Decomposition

Instead of searching with broad user queries, we break them down into smaller, more precise sub-queries. Each sub-query is then compared against our enhanced document collection, which now includes:

The original document

The generated FAQs

The generated overview

By merging the retrieval results from these multiple sources, we significantly improve the likelihood of finding relevant information.

def decompose_query(query):
prompt = f”’
Given the user query: “””{query}”””
break it down into smaller, relevant subqueries
that can retrieve the best information for answering the original query.
Return them as a ranked json list example [q1, q2, q3…].
”’
return query_chatgpt(prompt, response_format={ “type”: “json_object” })

Evaluating the Improved RAG

Implementing these techniques, we re-run our initial queries. This time, query decomposition generates several sub-queries, each focusing on different aspects of the original question. As a result, our system successfully retrieves relevant information from both the FAQ and the original document, demonstrating a substantial improvement over the vanilla RAG approach.

“””# **Testing Advanced Functions**”””

## Generate overview of the document
overview_text = generate_overview(main_document_text)
print(overview_text)
# generate embedding
docs.append({“id”:”overview_text”, “ref_doc”: “main_document_text”, “embedding”:get_embedding(overview_text)})

## Generate FAQ for the document
main_doc_faq_arr = generate_faq(main_document_text)
print(main_doc_faq_arr)
faq =json.loads(main_doc_faq_arr)[“questions”]

for f, i in zip(faq, range(len(faq))):
docs.append({“id”: f”main_doc_faq_{i}”, “ref_doc”: “main_document_text”, “embedding”: get_embedding(f)})

## Decompose the 1st query
query = “what should I do to stay healty and productive?”
subqueries = decompose_query(query)
print(subqueries)

subqueries_list = json.loads(subqueries)[‘subqueries’]

## compute the similarities between the subqueries and documents, including FAQ
for subq in subqueries_list:
print(“query = “, subq)
r = fetch_similar_docs(subq, docs, threshold=.55, top=2)
print(r)
print(‘=================================n’)

## Decompose the 2nd query
query = “what the best practices to stay healty and productive?”
subqueries = decompose_query(query)
print(subqueries)

subqueries_list = json.loads(subqueries)[‘subqueries’]

Here are some of the FAQ that were generated:

{
“questions”: [
“How many hours of sleep are recommended to feel well-rested?”,
“How long should you spend on morning stretching or light exercise?”,
“What is the recommended duration for mindfulness or meditation in the morning?”,
“What should a healthy breakfast include?”,
“What should you do to plan your day effectively?”,
“How can you minimize distractions during work?”,
“How often should you take breaks during work/study productivity time?”,
“What should a healthy lunch consist of?”,
“What activities are recommended for afternoon productivity?”,
“Why is it important to move around every hour in the afternoon?”,
“What types of physical activities are suggested for the evening routine?”,
“What should a nutritious dinner include?”,
“What activities can help you reflect and unwind in the evening?”,
“What should you do to prepare for sleep?”,
…
]
}

Cost-Benefit Analysis

While these enhancements introduce an upfront processing cost—generating FAQs, overviews, and embeddings—this is a one-time cost per document. In contrast, a poorly optimized RAG system would lead to two major inefficiencies:

Frustrated users due to low-quality retrieval.

Increased query costs from retrieving excessive, loosely related documents.

For systems handling high query volumes, these inefficiencies compound quickly, making preprocessing a worthwhile investment.

Conclusion

By integrating document preprocessing (FAQs and overviews) with query decomposition, we create a more intelligent RAG system that balances accuracy and cost-effectiveness. This approach enhances retrieval quality, reduces irrelevant results, and ensures a better user experience.

As RAG continues to evolve, these techniques will be instrumental in refining AI-driven retrieval systems. Future research may explore further optimizations, including dynamic thresholding and reinforcement learning for query refinement.

The post Enhancing RAG: Beyond Vanilla Approaches appeared first on Towards Data Science.