MAS Is All You Need: Supercharge Your Retrieval-Augmented Generation (RAG) with a Multi-Agent…

Photo by julien Tromeur on Unsplash

MAS Is All You Need: Supercharge Your Retrieval-Augmented Generation (RAG) with a Multi-Agent System

How to build a Multi-Agent RAG with AG2 and ChromaDB

Retrieval-Augmented Generation (RAG) systems have improved rapidly in recent years. Ideally, we can distinguish their evolution into three phases: in the pre-LLM era, information retrieval systems primarily relied on traditional search algorithms and indexing techniques. These systems were limited in their ability to understand context and generate human-like responses. Then, LLMs entered the scene, resulting in a drastic paradigm shift. Now, there are agents and another paradigm shift is happening.

But let’s take a step back, what is a RAG?

How RAG Works

To understand how a RAG system works, it can be helpful to compare its processes to those of a library.

Basic components of a RAG

Ingestion. This phase is similar to stocking a library. Just as a librarian organizes books and creates an index, a RAG system prepares data by converting it into numerical representations called embeddings. These embeddings are stored in a vector database, making it easy to find relevant information laterRetrieval. when a user asks a question, it’s like asking a librarian for information. The RAG system uses the query to search the indexed data and retrieve the most relevant documents or pieces of information from the database. This process ensures that the system pulls in accurate and up-to-date content.Generation. With the retrieved information, the system generates a response by combining this information with its internal knowledge. This is similar to how a librarian synthesizes information from multiple sources to provide an answer to a question.Photo by Radu Marcusu on Unsplash

It is important to clarify that, although the ingestion phase is not strictly a component of RAG, which stands for Retrieval-Augmented Generation, I always prefer to include ingestion as a crucial part of the process. Without proper organization of knowledge, the subsequent phases are unlikely to function effectively.

RAG systems traditionally operate through sequential workflows, where distinct pipelines handle the ingestion of data, retrieval of relevant information based on user queries, and generation of responses using the retrieved data. While this architecture is straightforward and effective for many applications, it poses significant limitations in scenarios that demand complex and non-linear interactions.

For a comprehensive understanding of how to implement a traditional LLM-based Retrieval-Augmented Generation (RAG) system, I encourage you to read one of my previous articles.

Build your own RAG and run it locally on your laptop: ColBERT + DSPy + Streamlit

Unfortunately, progress in the field of Generative AI is rapid, and many aspects of that article are already outdated. However, it still serves as a valuable resource for understanding the fundamentals of the topic we are discussing. In this tutorial, we aim to combine Retrieval-Augmented Generation (RAG) systems with Multi-Agent Systems (MAS).

… Multi-Agent System… bla bla bla… I know, today, everyone is buzzing about Multi-Agent Systems (MAS) just like they once did about Generative AI, Reinforcement Learning, Machine Learning and Big Data (can you relate?). However, I will try to make this tutorial valuable for those who are approaching this field for the first time. By the end of the article, I will also share some of my thoughts regarding the limitations of multi-agent systems.

MAS = ?

In the context of artificial intelligence, an agent is defined as a system or program that perceives its environment, makes decisions, and takes actions autonomously to achieve specific goals. For example, a librarian can be considered an agent; it organizes books, researches information, and formulates responses to inquiries. Much like an AI agent, a librarian navigates through vast amounts of information, curating and providing access to knowledge while adapting to the needs of users. The agents we will develop primarily delegate the decision-making component to Large Language Models (LLMs), leveraging their advanced capabilities for processing and generating human-like text.

Photo by Xu Haiwei on Unsplash

A (LLM-based) Multi-Agent System (MAS) consists of a collection of such agents that collaborate to achieve common objectives or solve complex problems. In a MAS, each agent operates independently but can communicate, debate and coordinate with other agents to share information, delegate tasks, and enhance overall system performance.

Don’t worry, we are not going to write a Multi-Agent System (MAS) from scratch in Python. There are several frameworks available that simplify the development process. It is important to emphasize that the goal of this tutorial is not to build the ultimate Multi-Agent Retrieval-Augmented Generation system, but rather to demonstrate how easily we can construct a relatively complex system using the tools available to us.

Every piece of code shown here is also reported in the GitHub repository of this article.

Ready? Let’s go!

Environment setup

We use Anaconda in this tutorial. If you do not have it on your machine, please, download it from the official website and install it (just follow the installation script instructions).

Then, within a terminal session, we can start creating the environment with some packages we will use during the process

conda create -n “mas” python=3.12.8
conda activate mas
git clone https://github.com/ngshya/mas-is-all-you-need.git
cd mas-is-all-you-need
pip install -r requirements.txt

We need a .env file inside the project folder where we put the OpenAI API key and the ChromaDB configuration. For instance, mine looks like:

OPENAI_API_KEY=”sk-proj-abcdefg…”
CHROMA_DB_HOST=”localhost”
CHROMA_DB_PORT=8001

Data Ingestion

In the repository, we have already prepared some sample data located in the kb folder. The contents of these text files come from Wikipedia. To facilitate the ingestion of the text files within this folder, we have implemented some functions in the tools_ingestion.py file:

get_txt_file_content() read the content of a text file.process_text() transforms a long text into chunks through an LLM call.text_to_list() reduces the output of the previous function into an effective Python list.save_chunks_to_db() saves the output of text_to_list() to a persistent DB (ChromaDB in our case).path_to_db() calls in sequence get_txt_file_content() → process_text() → text_to_list() → save_chunks_to_db()text_to_db() calls in sequence process_text() → text_to_list() → save_chunks_to_db()

I won’t comment them because they are already documented in the script. We can now start ChromaDB

chmod +x start_chroma
./start_chroma

and in a separate terminal, in the same project folder, run the ingestion pipeline (remember to use the same Python environment you have created before):

chmod +x ingest
./ingest

If we look at the content of the file ingest, we can notice that the last line about Turin is commented. There is a reason for this and we will find out soon.

Retrieve

Before starting to build our MAS, we need to define some functions to retrieve information from ChromaDB. You can find the implementations of these functions in the tools_retrieve.py file. Basically, the function retrieve() connects to the DB, compute the embedding of the input query, looks at similar chunks and returns the results. We can test this script by searching for “Universities in Amsterdam”:

python tools_retrieve.py –query “Universities in Amsterdam” –n_results 2

the output should be something similar to:

[
{
“uuid”: “9a921695-9310-53b4-9f52-c42d7c6432ef”,
“distance”: 0.5576044321060181,
“source”: “kb/cities/europe/amsterdam.txt”,
“last_update”: “12 January 2025 07:11:18 UTC +0000”,
“chunk”: “n<context>Educational Institutions</context>n<content>The University of Amsterdam (abbreviated as UvA, Dutch: Universiteit van Amsterdam) is a public research university located in Amsterdam, Netherlands. Established in 1632 by municipal authorities, it is the fourth-oldest academic institution in the Netherlands still in operation. The UvA is one of two large, publicly funded research universities in the city, the other being the Vrije Universiteit Amsterdam (VU). It is also part of the largest research universities in Europe with 31,186 students, 4,794 staff, 1,340 PhD students and an annual budget of u20ac600 million. It is the largest university in the Netherlands by enrollment.</content>n”
},
{
“uuid”: “5ce692ab-b762-53f7-84bc-f95fc6585015”,
“distance”: 0.561765730381012,
“source”: “kb/cities/europe/amsterdam.txt”,
“last_update”: “12 January 2025 07:11:18 UTC +0000”,
“chunk”: “n<context>University Structure and Achievements</context>n<content>The main campus is located in central Amsterdam, with a few faculties located in adjacent boroughs. The university is organised into seven faculties: Humanities, Social and Behavioural Sciences, Economics and Business, Science, Law, Medicine, Dentistry. Close ties are harbored with other institutions internationally through its membership in the League of European Research Universities (LERU), the Institutional Network of the Universities from the Capitals of Europe (UNICA), European University Association (EUA) and Universitas 21. The University of Amsterdam has produced six Nobel Laureates and five prime ministers of the Netherlands.</content>n”
}
]

Building MAS with AG2

AG2 (formerly known as AutoGen) is an innovative open-source programming framework designed to facilitate the development of AI agents and enhance collaboration among multiple agents to tackle complex tasks. Its primary goal is to simplify the creation and research of agentic AI. While the official AG2 website claims that the framework is ready to “build production-ready multi-agent systems in minutes,” I personally believe that there is still some work needed before it can be considered fully production-ready. However, it is undeniable that AG2 provides a very user-friendly environment for creating experiments aimed at research. It is important to emphasize that there are many other frameworks available for creating multi-agent systems. For example: Letta, LangGraph, CrewAI, etc.

In this tutorial we are going to implement a MAS with:

Human → a proxy for human input.Agent Ingestion → responsible for ingesting information from text files or directly from text inputs.Agent Retrieve → responsible for extracting relevant information from the internal database to assist other agents in answering user questions.Agent Answer → responsible for providing answers to user queries using information retrieved by the Agent Ingestion.Agent Router → responsible for facilitating communication between the human user and other agents.

Human will interact only with Agent Router which will be responsible of an internal chat group that includes Agent Retrieve, Agent Answer and Agent Ingestion. Agents inside the chat group collaborate with their knowledge and tools to provide the best answer possible.

# Agents’ Topology

Human <-> Agent Router <-> [Agent Ingestion, Agent Retrieve, Agent Answer]

The complete code for the MA-RAG (Multi-Agent Retrieval-Augmented Generation) system can be found in the mas.py file. In this section, we will discuss some key components and features of the code that are particularly noteworthy.

Agents Definition

To define an agent in AG2, we use the ConversableAgent() class. For instance, to define the Agent Ingestion:

agent_ingestion = ConversableAgent(
name = “agent_ingestion”,
system_message = SYSTEM_PROMPT_AGENT_INGESTION,
description = DESCRIPTION_AGENT_INGESTION,
llm_config = llm_config,
human_input_mode = “NEVER”,
silent=False
)

ee specify:

a name (agent_ingestion);the system prompt that defines the agent (SYSTEM_PROMPT_AGENT_INGESTION is a variable defined in prompts.py);SYSTEM_PROMPT_AGENT_INGESTION = ”’

You are the **Ingestion Agent** tasked with acquiring new knowledge from various sources. Your primary responsibility is to ingest information from text files or directly from text inputs.

### Key Guidelines:
– **No New Information**: You do not contribute new information to conversations; your role is strictly to ingest and store knowledge.
– **Evaluation of Information**: Before ingesting any new knowledge, carefully assess whether the information provided is genuinely novel and relevant.
– **Step-by-Step Approach**: Take a moment to reflect and approach each task methodically. Breathe deeply and focus on the process.

### Tools Available:
1. **`path_to_db()`**: Use this tool to ingest knowledge from a specified text file.
2. **`text_to_db()`**: Utilize this tool to ingest knowledge directly from provided text.

Your mission is to enhance the database with accurate and relevant information while ensuring that you adhere to the guidelines above.

”’the description that will help during the routing of messages (DESCRIPTION_AGENT_INGESTION is a variable defined in prompts.py);DESCRIPTION_AGENT_INGESTION = ”’

I am the **Ingestion Agent** responsible for acquiring new knowledge from text files or directly from user-provided text.

”’the configuration for LLM;llm_config = {
“config_list”: [
{
“model”: “gpt-4o-mini”,
“api_key”: os.environ[“OPENAI_API_KEY”],
“temperature”: 0.7,
}
]
}whether to ask for human inputs every time a message is received (by setting human_input_mode = “NEVER” the agent will never prompt for human input);whether to not print the message sent.

Similarly, we can define all other agents (human, agent_retrieve, agent_answer, agent_router).

Adding Tools

So far, we have defined various agents; however, as they are currently configured, these agents can only receive text inputs and respond with text outputs. They are not equipped to perform more complex tasks that require specific tools. For instance, an agent in its current state cannot access the database we created in the first part of this tutorial to conduct searches.

Photo by Kajetan Sumila on Unsplash

To enable this functionality, we need to “tell” the agent that it has access to a tool capable of performing certain tasks. Our preference for implementing a tool deterministically, rather than asking the agent to figure it out on its own, is based on efficiency and reliability. A deterministic approach reduces the likelihood of errors, as the process can be clearly defined and coded. Nevertheless, we will still give the agent the responsibility and autonomy to select which tool to use, determine the parameters for its use, and decide how to combine multiple tools to address complex requests. This balance between guidance and autonomy will enhance the agent’s capabilities while maintaining a structured approach.

I hope it is clear by now that, contrary to the claims made by many non-experts who suggest that agents are “so intelligent” that they can effortlessly handle complex tasks, there is actually a significant amount of work happening behind the scenes. The foundational tools that agents rely on require careful study, implementation, and testing. Nothing occurs “automagically,” even in the realm of generative AI. Understanding this distinction is crucial for appreciating the complexity and effort involved in developing effective AI systems. While these agents can perform impressive tasks, their capabilities are the result of meticulous engineering and thoughtful design rather than innate intelligence.

Remember the functions text_to_db() and path_to_db() we created before for the ingestion? We can “register” them to Agent Ingestion in this way:

register_function(
path_to_db,
caller=agent_ingestion,
executor=agent_ingestion,
name=”path_to_db”,
description=”Ingest new knowledge from a text file given its path.”,
)

register_function(
text_to_db,
caller=agent_ingestion,
executor=agent_ingestion,
name=”text_to_db”,
description=”Ingest new knowledge from a piece of conversation.”,
)

Similarly, we can add the retrieve tool to Agent Retrieve:

register_function(
retrieve_str,
caller=agent_retrieve,
executor=agent_retrieve,
name=”retrieve_str”,
description=”Retrieve useful information from internal DB.”,
)

MAS Topology

So far, we have defined each agent, their roles, and the tools they can utilize. What remains is how these agents are organized and how they communicate with one another. We aim to create a topology in which the Human interacts with the Agent Router, which then participates in a nested chat group with other agents. This group collaborates to address the human query, autonomously determining the order of operations, selecting the appropriate tools, and formulating responses. In this setup, the Agent Router acts as a central coordinator that directs the flow of information among the agents (Agent Ingestion, Agent Retrieve, and Agent Answer). Each agent has a specific function: Agent Ingestion processes incoming data, Agent Retrieve accesses relevant information from the database, and Agent Answer proposes the final response based on the gathered insights.

To create a group chat, we can use the GroupChat() class.

group_chat = GroupChat(
agents = [
agent_router,
agent_ingestion,
agent_retrieve,
agent_answer
],
messages=[],
send_introductions=False,
max_round=10,
speaker_selection_method=”auto”,
speaker_transitions_type=”allowed”,
allowed_or_disallowed_speaker_transitions={
agent_router: [agent_ingestion, agent_retrieve, agent_answer],
agent_ingestion: [agent_router],
agent_retrieve: [agent_answer],
agent_answer: [agent_router],
},
)

In this instantiation, we list the agents that will be part of the group (agents), decide that they don’t need to introduce themselves at the beginning of the chat (send_introductions), set the max rounds of conversation to 10 (max_round), delegate the selection of the speaker at each round to the chat manager (speaker_selection_method), and constrain the conversation transitions to a particular scheme (allowed_or_disallowed_speaker_transitions).

Created the group, we need a group manager that manage the order of conversation:

group_chat_manager = GroupChatManager(
groupchat=group_chat,
llm_config=llm_config,
silent=False,
is_termination_msg=lambda msg: “(to human)” in msg[“content”].lower()
)

It is important to note the lambda function used for the is_termination_msg parameter. This function determines when the chat should terminate by checking if the last message contains the substring “(to human).” This mechanism is crucial because, in the system prompt for the Agent Router, it specifies: “Clearly indicate your message’s intended recipient. For example, use (to human) when addressing the user.” This approach provides a clear signal for when to exit the nested chat and return a response to the human user.

Now, we need to make group chat we have just created a nested chat that stats from Agent Router.

nested_chats = [
{
“recipient”: group_chat_manager,
“summary_method”: “last_msg”,
}
]

agent_router.register_nested_chats(
nested_chats,
trigger=lambda sender: sender in [human],
)

By leveraging a structured communication framework and predefined transitions between agents, we ensure efficient collaboration between agents and in the same time allow flexibility in decision-making.

Let’s start chatting

We are really ready now. To start chatting with Agent Router:

chat_results = human.initiate_chat(
agent_router,
message=input(“Ciao! How can I assist you today? 😊 : “),
max_turns = 100
)

Or, we can run the Python script directly with:

python mas.py

and you should see something like

$ python mas.py [conda:mas]
Ciao! How can I assist you today? 😊 :

just type your message and you will receive the answer from Agent Router.

Here you can find an example of conversation. The chat started with the question

[Human]Is there a university in Philadelphia, and was it established before the University of California, Berkeley?

and the answer given by Agent Router was

[Agent Router]Yes, there is a university in Philadelphia known as the **University of Pennsylvania** (commonly referred to as Penn or UPenn). It is a private Ivy League research university and was chartered prior to the U.S. Declaration of Independence, making it one of the oldest institutions of higher education in the United States. The university first convened its board of trustees in 1749, which positions it as the fourth-oldest institution, though some argue it may be the fifth-oldest due to the establishment dates of other institutions. 📚On the other hand, the **University of California, Berkeley** was founded in **1868**.Given these dates, the University of Pennsylvania was established significantly earlier than the University of California, Berkeley. 🎓**Sources**:
1. University of Pennsylvania Overview (last updated: 12 January 2025)
2. University of California, Berkeley Overview (last updated: 12 January 2025)

Note that the direct answer to the initial question was not provided in the text files we ingested at the beginning. In fact, if we look at the log of the nested chat at line 74 we can see that inside the nested chat Agent Retrieve, autonomously, makes two different queries with the retrieve tool querying information about University of Philadelphia and UC Berkeley, and then Agent Answer proposes the solution with the retrieved chunks.

A similar mechanism is involved also when we add a new data source (remember the file about Turin we left out previously?):

[Human]What do you know about the University of Turin?[Agent Router]🤔 It seems I couldn’t find specific information about the University of Turin. The details I retrieved were mainly about other universities like the University of Barcelona and UC Berkeley. If there’s a particular aspect of the University of Turin you’re curious about, or if you’d like information on something else, just let me know! 😊[Human]Information about Turin can be found in kb/cities/europe/turin.txt[Agent Router]Here is some information about Turin:### Overview**Location**: Turin (Italian: Torino) is a city in Northern Italy, …[…]

In the background, what happened was that during the first interaction in the nested chat, Agent Answer realized that none of the retrieved information pertained to Turin (line 163). However, the second time, when the agents detected a new source of information, Agent Ingestion was called upon (line 179) to process the new data and save it in the database. This dynamic illustrates the collaborative nature of the agents within the system. Initially, Agent Answer’s inability to find relevant information highlights the importance of thorough data retrieval processes. Once a new source was identified, Agent Ingestion swiftly stepped in to ensure that valuable information was incorporated into the database, thereby enhancing the agents’ ability to respond effectively in future interactions.

You can get more examples in the log file where we tested how the system is resilient to external contradictory statements (line 34) and how new information directly from the conversation is saved by Agent Ingestion (lines 54, 352).

Beyond Toy MA-RAG

We have explored how to build a RAG system based on a Multi-Agent paradigm. What we presented is, of course, a simplification of how such a system needs to function in a production environment. We intentionally left out many important aspects (such as guardrails, token consumption, chat interface design, authentication, etc.) and there are numerous areas that require significant improvement. For instance, a complete pipeline for data ingestion and knowledge base updates is essential, as well as enhancing information retrieval methods that could leverage graph-based approaches rather than relying solely on embedding similarity. Moreover, the topology of the agents can be as complex as desired. For example, multiple chat groups could be created, each specialized in a particular aspect of the overall pipeline. Additionally, we could introduce oversight/judge roles to critically assess proposed plans and solutions. The possibilities are virtually limitless, and finding the right solution for a specific use case is often a form of art itself.

Conclusion

The rapid rise in popularity of MAS certainly has elements of a bubble, but it is also driven by the potential of such systems to tackle complex tasks that were previously unimaginable. Currently, we are still in a preliminary phase of this technology, even though platforms are emerging to facilitate the creation of MAS. Reflecting on this tutorial, it is evident that, in addition to the capabilities of LLMs, the management of the knowledge base is fundamentally important for a RAG system, even when enhanced by a MAS.

Moreover, while MAS unlocks new capabilities, it also introduces complexities in programming such systems. As we increase the number of agents linearly, the number of interactions between them can potentially grow quadratically. With each interaction comes the risk of ambiguities and inefficiencies that may propagate into subsequent interactions. In summary, there are numerous opportunities but also significant new risks. What we can do is strive to understand these systems deeply to be prepared for their challenges and possibilities.

Reference

https://github.com/ngshya/mas-is-all-you-need/tree/mainhttps://ag2.ai/https://www.trychroma.com/

Contacts: LinkedIn

MAS Is All You Need: Supercharge Your Retrieval-Augmented Generation (RAG) with a Multi-Agent… was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Author:

Leave a Comment

You must be logged in to post a comment.