Beyond RAG: Precision Filtering in a Semantic World
Photo by Nathan Dumlao on Unsplash
Aligning expectations with reality by using traditional ML to bridge the gap in a LLM’s responses
Early on we all realized that LLMs only knew what was in their training data. Playing around with them was fun, sure, but they were and still are prone to hallucinations. Using such a product in its “raw” form commercially is to put it nicely — dumb as rocks (the LLM, not you… possibly). To try alleviate both the issues of hallucinations and having knowledge of unseen/private data, two main avenues can be taken. Train a custom LLM on your private data (aka the hard way), or use retrieval augmentation generation (aka the one we all basically took).
RAG is an acronym now widely used in the field of NLP and generative AI. It has evolved and led to many diverse new forms and approaches such as GraphRAG, pivoting away from the naive approach most of us first started with. The me from two years ago would just parse raw documents into a simple RAG, and then on retrieval, provide this possible (most likely) junk context to the LLM, hoping that it would be able to make sense of it, and use it to better answer the user’s question. Wow, how ignorance is bliss; also, don’t judge: we all did this. We all soon realized that “garbage in, garbage out” as our first proof-of-concepts performed… well… not so great. From this, much effort was put in by the open-source community to provide us ways to make a more sensible commercially viable application. These included, for example: reranking, semantic routing, guardrails, better document parsing, realigning the user’s question to retrieve more relevant documents, context compression, and the list could go on and on. Also, on top of this, we all 1-upped our classical NLP skills and drafted guidelines for teams curating knowledge so that the parsed documents stored in our databases were now all pretty and legible.
While working on a retrieval system that had about 16 (possible exaggeration) steps, one question kept coming up. Can my stored context really answer this question? Or to put it another way, and the one I prefer. Does this question really belong to the stored context? While the two questions seem similar, the distinction lies with the first being localized (e.g. the 10 retrieved docs) and the other globalized (with respect to the entire subject/topic space of the document database). You can think of them as one being a fine-grained filter while the other is more general. I’m sure you’re probably wondering now, but what is the point of all this? “I do cosine similarity thresholding on my retrieved docs, and everything works fine. Why are you trying to complicate things here?” OK, I made up that last thought-sentence, I know that you aren’t that mean.
To drive home my over-complication, here is an example. Say that the user asks, “Who was the first man on the moon?” Now, let’s forget that the LLM could straight up answer this one and we expect our RAG to provide context for the question… except, all our docs are about products for a fashion brand! Silly example, agreed, but in production many of us have seen that users tend to ask questions all the time that don’t align with any of the docs we have. “Yeah, but my pretext tells the LLM to ignore questions that don’t fall within a topic category. And the cosine similarity will filter out weak context for these kinds of questions anyways” or “I have catered for this using guardrails or semantic routing.” Sure, again, agreed. All these methods work, but all these options either do this too late downstream e.g. the first two examples or aren’t completely tailored for this e.g. the last two examples. What we really need is a fast classification method that can rapidly tell you if the question is “yea” or “nay” for the docs to provide context for… even before retrieving them. If you’ve guessed where this is going, you’re part of the classical ML crew 😉 Yep, that’s right, good ole outlier detection!
Outlier detection combined with NLP? Clearly someone has wayyyy too much free time to play around.
When building a production level RAG system, there are a few things that we want to make sure: efficiency (how long does a response usually take), accuracy (is the response correct and relevant), and repeatability (sometimes overlooked, but super important, check a caching library for this one). So how is an outlier detection method (OD) going to help with any of these? Let’s brainstorm quick. If the OD sees a question and immediately says “nay, it’s on outlier” (I’m anthropomorphizing here) then many steps can be skipped later downstream making this route way more efficient. Say now that the OD says “yea, all safe”, well, with a little overhead we can have a greater level of assurance that the topic space of both the question and the stored docs are aligned. With respect to repeatability, well we’re in luck again, since classic ML methods are generally repeatable so at least this additional step isn’t going to suddenly start apologizing and take us on a downward spiral of repetition and misunderstanding (I’m looking at you ChatGPT).
Wow, this has been a little long-winded, sorry, but finally I can now start showing you the cool stuff.
Muzlin, a python library, a project which I am actively involved in, has been developed exactly for these type of semantic filtering tasks by using simple ML for production ready environments. Skeptical? Well come on, let’s take a quick tour of what it can do for us.
The dataset that we will be working with is a dataset of 5.18K rows from BEIR (Scifact, CC BY-SA 4.0). To create a vectorstore we will use the scientific claim column.
https://medium.com/media/cf43be1a4d7028c3ce3c38cf00706345/href
So, with the data loaded (a bit of a small one, but hey this is just a demo!) the next step is to encode it. There are many ways in which to do this e.g. tokenizing, vector embeddings, graph node-entity relations, and more, but for this simple example let’s use vector embeddings. Muzlin has built-in support for all the popular brands (Apple, Microsoft, Google, OpenAI), well I mean their associated embedding models, but you get me. Let’s go with, hmmm, HuggingFace, because you know, it’s free and my current POC budget is… as shoestring as it gets.
https://medium.com/media/a174316654a907f3bff3d5f266121cc8/href
Sweet! If you can believe it, we’re already halfway there. Is it just me, or do so many of these LLM libraries leave you having to code an extra 1000 lines with a million dependencies only to break whenever your boss wants a demo? It’s not just me, right? Right? Anyways, rant aside there are really just two more steps to having our filter up and running. The first, is to use an outlier detection method to evaluate the embedded vectors. This allows for an unsupervised model to be constructed that will give you a likelihood value of how possible any given vector in our current or new embeddings are.
https://medium.com/media/f295a943f979ea0e7d5758190a88b02f/href
No jokes, that’s it. Your model is all done. Muzlin is fully Sklearn compatible and Pydantically validated. What’s more, MLFlow is also fully integrated for data-logging. The example above is not using it, so this result will automatically generate a joblib model in your local directory instead. Niffy, right? Currently only PyOD models are supported for this type of OD, but who knows what the future has in store.
Damn Daniel, why you making this so easy. Bet you’ve been leading me on and it’s all downhill from here.
In response to above, s..u..r..e that meme is getting way too old now. But otherwise, no jokes, the last step is at hand and it’s about as easy as all the others.
https://medium.com/media/c0b54463fb47433f28905882e447718a/href
Okay, okay, this was the longest script, but look… most of it is just to play around with it. But let’s break down what’s happening here. First, the OutlierDetector class is now expecting a model. I swear it’s not a bug, it’s a feature! In production you don’t exactly want to train the model each time on the spot just to inference, and often the training and inferencing take place on different compute instances, especially on cloud compute. So, the OutlierDetector class caters for this by letting you load an already trained model so you can inference on the go. YOLO. All you have to do now is just encode a user’s question and predict using the OD model, and hey presto well looky here, we gots ourselves a little outlier.
What does this mean now that the user’s question is an outlier? Cool thing, that’s all up to you to decide. The stored documents most likely do not have any context to provide that would answer said question in any meaningful way. And you can rather reroute this to either tell that Kyle from the testing team to stop messing around, or more seriously save tokens and have a default response like “I’m sorry Dave, I’m afraid I can’t do that” (oh HAL 9000 you’re so funny, also please don’t space me).
To sum everything up, integration is better (Ha, math joke for you math readers). But really, classical ML has been around way longer and is way more trustworthy in a production setting. I believe more tools should incorporate this ethos going forward on the generative AI roller-coaster ride we’re all on, (side note, this ride costs way too many tokens). By using outlier detection, off-topic queries can quickly be rerouted saving compute and generative costs. As an added bonus I’ve even provided an option to do this with GraphRAGs too, heck yeah — nerds unite! Go forth, and enjoy the tools that open source devs lose way too much sleep to give away freely. Bon voyage and remember to have fun!
Beyond RAG: Precision Filtering in a Semantic World was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.