Thin Agents: Creating Lean AI Services with Local Fine-Tuned LLMs

A Practical Guide to Building Simple, Lightweight, and Task-Specific AI Services with Rust, Unsloth, and Llama.cpp

Photo by Eric Krull on Unsplash

1. Intro

Suppose that you are maintaining a major call center application that manages customer support tickets. The application provides an interface to edit, submit and update support tickets, it routes tickets to the relevant department, tracks their status, and alerts relevant personnel for important changes. The application comprises multiple backend services and is almost fully automated.

Although it is one of the application’s most attractive features, the automatic routing service is also one of its pains. The routing service uses text similarity search in order to determine to which department a ticket should be routed. Most of the time it works well, but it suffers from a relatively high rate of classification errors. Therefore, you decide to replace it with a new routing service that uses an LLM in order to better understand the customer’s text in the ticket and improve the application routing feature.

The new routing service is an example of an increasingly popular use case in which LLMs are used to replace a very specific and narrow function in a wider system. In this use case, the components that use an LLM are not required to have too much knowledge, memory, or deep reasoning capabilities. Rather, they are required to be able to understand and reason about a very specific type of question or problem (such as which department should handle a certain support ticket and why).

Compared to the agentic AI vision that predominates the discourse these days, the idea of thin agents as LLM-powered microservices suggests a simpler and more pragmatic approach to the role of AI in distributed systems. It enables us to enhance specific functions in our application logic with relatively little overhead or “noise”, and focus on specific tasks rather than on our application infrastructure as a whole.

It can be argued that even thin agents still involve the concerns implicated in maintaining AI infrastructure. One could obviously rely on the big SaaS LLMs in order to avoid managing this infrastructure, and most of the time this is probably the best way to go. But sometimes it is not. When scale and usage are high, then using the major LLM providers can get costly or create an unwanted dependency on a third party service. More importantly, when the relevant task requires domain specific knowledge and style (like understanding our support system and the relevant departments), then even the big LLMs might not be good enough or sufficiently stable.

For these reasons, such use cases are usually pushed aside because they are perceived as complicated and resource-demanding. Even if we want to implement a thin LLM-powered service for a very specific and simple task, then we have to deal with fine-tuning, hosting, more expensive hardware, serving models at scale, etc.

But that is not necessarily so. Over the past few years we have seen great progress in making AI development faster and leaner. The availability of smaller models, easier and faster fine-tuning methods, and lighter hosting and inference frameworks make it significantly more straightforward to develop and serve AI-powered software on commodity hardware.

This post demonstrates a simple approach to building a thin AI agent that can run performantly on your existing applicatiom infrastructure. It will be implemented as a Rust micro-service that operates on a locally-hosted small LLM (Llama 3.2 1B). Using Unsloth open source framework, the LLM will be fine-tuned for a specific task for less than 20 minutes, and quantized to reduce its size to less than 1GB. The LLM will be served and wrapped using Rust binding for Llama.cpp in order to achieve optimal performance over CPU.

The post will proceed as follows: Section 2 presents the task at hand — building an intelligent ticket routing service, and a high level view of the thin agent approach we will take to implement it. Section 3 shows how to fine-tune a small LLM for this purpose using Unsloth as a fine-tuning framework. Section 4 demonstrates how to use Llama.cpp (and Rust binding) to efficiently load and use quantized GGUF models. Section 5 demonstrates how to wrap it in a Rust web service using Axum. Section 6 concludes.

2. The Idea of Thin Agents

The goal of the thin agents approach is to provide a highly available, cost efficient, and lightweight inference service for a specific task involving an LLM. It accomplishes this by using small and quantized models that were fine-tuned to handle a specific task, and served using an efficient inference framework. Using smaller finetuned models enables us to reduce the size of the model for better loading time and smaller memory footprint while maintaining good performance and accuracy even on CPU hardware. Accordingly, the operational nature of thin agents will allow us to handle them as other micro-services on our system, at least in terms of cost and maintainability.

As mentioned earlier, we are tasked with building an LLM-powered ticket routing service. The main requirement is to create an HTTP REST service that receives a description text from the body of the support ticket and returns a JSON string that will contain 3 extracted properties: the name of the product, a couple of descriptive keywords, and a classification of the issue that will help us route the ticket (either packaging and shipping, product defect, or general dissatisfaction).

For example, the following ticket description: “I just got the kettle I ordered last week and it’s not working. I see a light when I turn it on, but its not getting warm”. Should be processed and return the following output:

{
“product_name”: “Kettle”,
“brief_description”: “Unresponsive and not producing heat”,
“issue_classification”: “Product Defect or Problem”
}

The general design will be fairly simple. We will use an object storage location to store the model file and deploy new models. Although the model file should only be loaded once when the service is bootstrapped, the size of the model (less than 1GB) also makes it possible to make the model file available locally in each service instance container, if that is necessary for some reason.

Thin AI Agents — image is by the author

There are several model serving frameworks that can be used to serve our model. I will use Llama.cpp as it is one of the simplest and fastest to use, in addition to the fact that it supports loading quantized models in GGUF format. We will write the service code in Rust using llama.cpp’s Rust binding. This will give us a rather clean interface to llama.cpp and boost the performance of our service. As mentioned earlier, the service itself will be very lean and straightforward — most of the work will be carried out by the fine-tuned model itself while the service will simply wrap it in a thin serving layer.

3. Fine-tuning the Ticket Routing Service LLM with Unsloth

Llama 3.2 1B is a small model (it has only 1B parameters), which means it takes less space and puts less pressure on memory but at a price of less accuracy and loosing some reasoning capabilities. Fine-tuning and quantizing a small Llama will enable us to improve and stabilize its performance despite its size while making it even smaller.

Fine-tuning is often perceived as quite complicated and resource demanding. Indeed, that was the case until the recent emergence of frameworks designed to make fine-tuning more straightforward and less demanding. One of these is the open source framework Unsloth, which is getting quite a lot of traction recently due to the optimizations it adds and the ease of use that it offers. Indeed, we shall shortly see that 20 minutes on a free Collab notebook will be enough to fine-tune a small Llama, though Unsloth enables fine-tuning of much bigger models twice as fast. Additionally, Unsloth provides multiple hands-on and beginner friendly notebooks which really makes fine-tuning a breeze, and which we will use as a starting point.

The most important part of fine-tuning is having a well curated dataset that fits the problem at hand. The task our LLM needs to carry out is rather straightforward — it needs to analyze a text paragraph and extract a few important parts from it. Therefore, we will not need too much sample data to fine-tune the model to perform well.

We start by creating a training datasets according to the llama format. Each record or conversation in the llama format consists of a system, user, and assistant prompt. As you can see in the sample record below, in each record or conversation, the system prompt briefly explains the task and the required output, the user prompt provides the user ticket text, and the assistant prompt the expected JSON result.

“conversations”: [
{
“role”: “system”,
“content”: “
You are an AI designed to assist in interpreting customer support tickets.
Your task is to analyze the text of a ticket
and extract three details in JSON format:
1. Product Name: Identify the product mentioned in the ticket.
If no product name is mentioned, return “Unknown”
2. Brief Description: Summarize the issue in a few words
based on the ticket’s content.
3.Issue Classification: Classify the nature of the issue into one of the following categories:
Packaging and Shipping:
Product Defect or Problem:
General Dissatisfaction:

Return the information as a JSON string with 3 properties:
`product_name`, `brief_description`, `issue_classification`.”
},
{
“role”: “user”,
“content”: “
I received the phone, but the screen is cracked right out of the box.”
},
{
“role”: “assistant”,
“content”: “
{“product_name”: “Phone”,
“brief_description”: “Received product with cracked screen”,
“issue_classification”: “Product Defect or Problem”}”
}
]
},

The structure and content of the training data we need is simple enough that we can generate multiple samples using one of the big commercial LLMs. Bigger LLMs can be very helpful in generating sample data when they are given a few good examples and proper instructions. You can see the generated dataset I used for fine-tuning here.

Unsloth offer sample notebooks for many opens source models. The notebooks are very user friendly and work seamlessly on Collab. For our task, we will use the notebook named Llama3.2_(1B_and_3B), which already contains most of what is needed to fine-tune our model using any data set in the llama format. Because the notebooks are routinely updated and contain many options that are not necessarily relevant for our use case, I prepared a more concise copy that you can find here, and in the companion repo, which represents just what we need for our purpose.

I believe that the notebook is self-explanatory and will therefore just highlight a couple of points that need to be adapted.

First, the second cell in the notebook is where you choose the model you want to fine-tune. You can use the models from from HuggingFace or from the Unsloth repo which usually has smaller quantized versions.

model, tokenizer = FastLanguageModel.from_pretrained(
model_name = “unsloth/Llama-3.2-1B-Instruct”, # or choose “unsloth/Llama-3.2-3B-Instruct”
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
# token = “hf_…”, # use one if using gated models like meta-llama/Llama-2-7b-hf
)

Second, search for the following line which loads your dataset:

dataset = Dataset.from_json(“/content/finetune_data_sampels.json”)

If you are training on Colab (like me) then you need to upload your training data file and then load it like the example above. Its also possible to use other datasets formats or load the dataset from some API. Follow the code and explanations in the relevant Unsloth notebooks to do so.

From this point you can go on and start the training. Be sure to add the proper training parameters in the cell that constructs the SFTTrainer. Finally, after the model was trained or fine-tuned, we need to quantize and save it in GGUF format, which is optimized for fast loading and inference.

if True: model.save_pretrained_gguf(“model”, tokenizer,
quantization_method = “q4_k_m”

As you can see in the line above, the quantized model will be saved to a model directory using the “q4_k_m” quantization. Make sure you understand what the different options in that cell mean, though q4_k_m is usually a good starting point or default. Quantizing and saving the model can take some time. After its done you can simply download the GGUF file and that’s a wrap for this important part.

4. Efficient Model Serving with Rust and Llama.cpp

Llama.cpp is an open source library that “enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware — locally and in the cloud.” Its main contributor is also the creator of the GGUF model file format, which llama.cpp uses for fast loading and ease of reading. For these reasons, llama.cpp is widely regarded a great fit for use cases involving local serving of smaller models over CPU.

Llama.cpp is written in C++, but its ecosystem offers numerous binding libraries in many languages. We will use the Rust binding as it will optimize the performance of our service, in addition to being part of a comprehensive ecosystem to create performant micro-services. However, what follows can be certainly implemented using different language binding.

I will be using the llama-cpp-rs library as in my experience it offers the best developer experience and sufficient control over its functionality. The repository has a submodule that is linked to the original llama.cpp project (in order to keep using the most updated version) and therefore needs to be explicitly cloned. This can be expressed in Cargo.toml file using the following dependency table:

[dependencies.llama-cpp-2]
git = “https://github.com/utilityai/llama-cpp-rs.git”
# the specific crate in the repo
package = “llama-cpp-2”
# the revision (that was on main in Jan 25)
rev = “5f4976820b54f0dcc5954f335827d31e1794ef2e”
# to pull the submodules and the llama.cpp lib
submodules = true

You might bump into some errors due to the c++ build dependencies. (For example, on Mac, I had to install CMake to build it successfully). Also, note that llama.cpp is compute heavy so its better to build and run it in release profile, even when running tests.

Note that there are 2 modules or crates in the repo. The llama-cpp-2 module provides the binding interface functions. However, in order to maintain a cleaner and simpler API design, we will wrap it in our own Rust module that will hide llama.cpp’s internal API and offer a simpler interface based on 2 main functions: a new(model_file_path) function to load the model file, and a generate_text() function that will create a new llama session and generate the model response based on the prompt. You can find the wrapper file in the repo here, and here is also a shorten version of the main parts.

use llama_cpp_2:: ….

pub struct LlamaApp {
backend: LlamaBackend,
model: LlamaModel,
}

impl LlamaApp {
/// Creates a new instance by loading a given model file from disk.
pub fn new(model_path: &str) -> anyhow::Result<Self> {
// Initialize the backend
let backend = LlamaBackend::init()
.context(“Failed to initialize LLaMA backend”)?;

let model = LlamaModel::load_from_file(
&backend,
model_path,
&model_params)
.with_context(|| format!(“Unable to load model”))?;
Ok(Self { backend, model })
}

// Text generation function
pub fn generate_text(
&self,
prompt: &str,
ctx_size: u32,
temp: f32,
seed: Option<u32>,
) -> anyhow::Result<String> {
// Create context parameters
// (controls how many tokens the context can hold)
let ctx_params = LlamaContextParams::default()
.with_n_ctx(Some(NonZeroU32::new(ctx_size).unwrap()));
// Create a context for this model
let mut ctx = self
.model
.new_context(&self.backend, ctx_params)
.context(“Unable to create LLaMA context”)?;

// Build a sampler (decides how to pick tokens)
let mut sampler = build_sampler(seed, temp);

while n_cur <= max_generation_tokens {

}

Ok(output_text)
}
}

As you can see, the LlamaApp struct wraps the actual model and is created by loading it from a GGUF file (that can be either local or remote). Calling the generate_text() function (on a reference of the model), creates a new context that can be adjusted with its relevant params.

As we shall see in the next section, this approach will help us ensure our service can scale and avoid synchronization, as each new request will generate a new context from a reference to the actual LlamaBackend that we can share.

5. Implementing the Thin Agent Micro-service

At this point its just a matter of putting everything together. We have a small LLM that was fine-tuned to carry out the classification task at hand and a Rust module that provides loading and inference functionality using llama.cpp. All we need to do is create a thin service that wraps this functionality as a REST micro-service that loads a model, process requests, and formats its output.

We will use Axum to build our micro-service, which is one of the leading and most flexible web frameworks in Rust’s ecosystem. Because most of the functionality of the service is executed by our fine-tuned LLM, all we need to do is write a thin layer that loads and serves it.

The ticket routing service will have one handler and one route /classify_ticket . Axum provides an elegant way to store a reference for a service state, which is often used for database connectivity, configuration, etc. In the service’s initialization code, we will load the model from file and store it in the service state.

use llama_cpp::LlamaApp;

let model_path = ….

let llama = LlamaApp::new(&model_path)
.expect(“Failed to initialize LLaMA backend”);
let model = Arc::new(llama);

let app = Router::new()
.route(“/classify_ticket”, post(classify_ticket))
.with_state(model);

let listener = tokio::net::TcpListener::bind(“127.0.0.1:3000”)
.await.unwrap();
axum::serve(listener, app).await.unwrap();

We load the model weights into our llama.cpp wrapper struct — LlamaApp. Next, we wrap it in an Arc which is a thread-safe reference counter that will allow us to manage it as a shared state. Finally, we create our route and add the model as a state.

As mentioned earlier, because most of the work is done by the LLM, all we need to do is to create a thin handling layer that will create the prompt, trigger the model, and format the response.

async fn classify_ticket(
State(model): State<Arc<LlamaApp>>,
Json(request): Json<ClassifyTicketRequest>,
) -> (StatusCode, Json<ClassifyTicketResponse>) {
let prompt = get_prompt(request.text);
let agent_response = model.generate_text(&prompt, 512, 1.0, Some(42));
let agent_response = serde_json::from_str(&agent_response);
(StatusCode::OK, Json(agent_response))
}

We have defined a ClassifyTicketResponse as a struct that is based on the schema that we have fine-tuned the LLM to return. Therefore, our handler code will consist of just a few lines in which we call the model, serialize the answer and return.

Considering the fact we are using a small LLM over a CPU, we get a pretty decent performance (~2s) that can be the basis of a routing service that we can efficiently scale.

6. Conclusion

This post discussed the idea of a thin agent, which suggests a more simplified and pragmatic view of how AI-based services can power modern applications and integrate with common service infrastructure. It is based on the premise that a small fine-tuned LLM can efficiently carry out simple tasks, when they are clearly defined and specific, even with limited compute resources. Because LLM-powered applications have deeper analysis capabilities then they can easily outperform similar services that are based on heuristics or more complex logic, though their infrastructure is typically harder to manage. However, if we can harness the power of LLMs without requiring infrastructure overhead, then this approach becomes quite attractive.

We saw that with the help of frameworks such as Unsloth, and a proper training dataset, fine-tuning can be quite straightforward and faster than is commonly thought. We also learned how to use llama.cpp to efficiently load and run inference on quantized models on commodity hardware, and achieve satisfying performance. I have shown how we can take advantage of the Rust ecosystem to create a thin serving layer around a task specific LLM, thereby providing a performant and efficient solution that takes advantage of the speed and safety of Rust with minimal overhead.

Thin AI agents represent a pragmatic and efficient approach to integrating AI into our service mesh. By using small, task-specific models, fine-tuning techniques, and efficient inference frameworks, developers can create intelligent micro-services that solve focused, yet complicated, problems without the complexity and overhead of more expensive AI infrastructure. This approach not only reduces computational and financial barriers, but also opens new possibilities for intelligent, lightweight software solutions.

Hope that was useful!

Some References

Companion repo can be found hereRead about Unsloth in their documentation.Read about llama.cpp here and about GGUF files

Unless mentioned otherwise, all images in this post are by the author.

All views are mine

Thin Agents: Creating Lean AI Services with Local Fine-Tuned LLMs was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Author:

Leave a Comment

You must be logged in to post a comment.