A Multimodal AI Assistant: Combining Local and Cloud Models

January 13, 2025

Dalle-3’s interpretation of “a quirky robot wearing a tool belt and puzzling over question” . Image generated by the author.

Use LangGraph, mlx and Florence2 to build an agent that answers complex image questions, with the option to run everything locally.

In this article we’ll use LangGraph in conjunction with several specialized models to build a rudimentary agent that can answer complex questions about an image, including captioning, bounding boxes and OCR. The original idea was to build this using local models only, but after some iteration I decided to add connections to cloud based models too (i.e. GPT4o-mini) for more reliable results. We’ll explore that aspect too, and all the code for this project can be found here.

Over the past year, multimodal large language models — LLMs that take reasoning and generative capabilities beyond text to media such as images, audio and video — have become increasingly powerful, accessible and usable within production ML systems.

Closed source, cloud-based models like GPT-4o, Claude Sonnet and Google Gemini are remarkably capable at reasoning over image inputs and are much cheaper and faster than the multimodal offerings just a few months ago. Meta has joined the party by releasing the weights of multiple competitive multimodal models in its Llama 3.2 series. In addition, cloud computing services like AWS Bedrock and Databricks Mosaic AI are now hosting many of these models, allowing developers to quickly try them out without having to manage hardware requirements and cluster scaling themselves. Finally, there is an interesting trend towards a myriad of smaller, specialized open source models which are available for download from repositories like Hugging Face. A smart agent with access to these models should be able to choose which ones to call in what order to get a good answer, which bypasses the need for a single giant, general model.

One recent example with fascinating image capabilities is Florence2. Released in June 2024, it is somewhat of a foundational model for image-specific tasks such as captioning, object detection, OCR and phrase grounding (identifying objects from provided descriptions). By LLM standards it’s also small — 0.77B parameters for the most capable version — and therefore runnable on a modern laptop. Florence2 like beats massive multimodal LLMs like GPT4o at specialized image tasks, because while these larger models are great at answering general questions in text, they’re not really designed to provide numerical outputs like bounding box coordinates. With the right training data at the instruction fine tuning stage they can certainly improve — GPT4o can be fine tuned to become good at object detection, for example — but many teams don’t have the resources to do this. Intriguingly, Gemini is in fact advertised as being capable of object detection out of the box, but Florence2 is still more versatile in terms of the range of image tasks it can accomplish.

Reading about Florence2 spawned the idea for this project. If we can connect it to a text-only LLM that’s good at reasoning (Llama 3.2 3B for example) and a multimodal LLM that’s good at answering general questions about images (such as Qwen2-VL) then we could build a system that answers complicated questions over an image. It would do so by first planning which models to call with what inputs, then running those tasks and assembling the results. The agent orchestration library LangGraph, which I explored in a recent project article here, provides a great framework for designing such a system.

Also I recently purchased a new laptop: An M3 Macbook with 24GB of RAM. Such a machine can run the smallest versions of these models with reasonable latency, making it possible for local development of an image agent. This combination of increasingly capable hardware and shrinking models (or smart ways of compressing/quantizing large models) is very impressive! But it has practical challenges: For a start when Florence2-base-ft, Llama-3.2–3B-Instruct-4bit, and Qwen2-VL-2B-Instruct-4bit are all loaded up, I barely have enough RAM for anything else. That’s fine for development, but it would be a big problem for an application that people might actually find useful. Also, as we’ll see Llama-3.2–3B-Instruct-4bit is not great at producing reliable structured outputs, which caused me to switch to GPT4-o-mini for the reasoning step during development of this project.

The image agent

But what exactly is this project? Let’s introduce it with a tour and demo of the system that we’ll build. The StateGraph (take a look at this article for an intro) looks like this, and every query consists of an image and text input.

Visualization of the control flow of the agent that we will build. Image generated by the author.

We proceed through the stages, each of which is associated with a prompt.

Planning. Here the goal is just to formulate a plan in text about how to best answer the questions with the tools available. The prompt contains a list of the tools and their various modes. A more complex system might use RAG at this stage to gather the list of tools most appropriate for the problem and craft a planStructure the plan. The aim here is to create a list of plan components that the agent can step through. We take the plan text and force the model to generate a list that is consistently formatted according the Pydantic models here. Its useful to keep both the plan text and the structured plan for evaluation purposes.Execute the plan. Each element in the structured plan contains a tool name and inputs. We then proceed to call these tools in sequence and gather their results. Our agent has just two available tools: special vision (which calls Florence2) or general vision (which calls Qwen2 or GPT4o) and the routing node is used to keep track of the current plan stageAssess the result. Once each step of the plan has been executed, a model gets to see the inputs and outputs and make an assessment of whether or not the the question was answered. If not, we go back to the planning step and try to amend the old plan using these new insights. If yes, we proceed to the end. If the model revises the plan too many times, a timeout is triggered that allows the loop to break.

There are many possible improvements and extensions here, for example in the current implementation the agent just spits out the results of all the prior steps, but an LLM could be called here to formulate them into a nice answer if desired.

As an example just to see it in action, we give it the following image and input text, which is a fairly complex set of questions

The image used in this example test, photograph taken by the author.query = “””
What city is in this photo? Tell me what the tallest building is, then find all the skyscrapers and bridges
“””

After going through all the steps, this is the result we get.

[
{1: ‘This photo is of Austin, Texas. The tallest building in the image is likely The Independent, also known as the “Jenga Tower.”‘},
{2:
‘{“bboxes”: [
[961.9750366210938, 293.1780090332031, 1253.5550537109375, 783.6420288085938],
[77.67500305175781, 41.65800094604492, 307.1150207519531, 401.64599609375],
[335.7950134277344, 310.4700012207031, 474.4150085449219, 753.7739868164062],
[534.1650390625, 412.6499938964844, 684.7350463867188, 774.2100219726562],
[1365.885009765625, 510.114013671875, 1454.3150634765625, 774.2100219726562],
[1824.76513671875, 583.9979858398438, 1927.5350341796875, 758.489990234375]
], “labels”: [“skyscraper”, “skyscraper”, “skyscraper”, “skyscraper”, “skyscraper”, “skyscraper”]}’},
{3:
‘{“bboxes”: [
[493.5350341796875, 780.4979858398438, 2386.4150390625, 1035.1619873046875]
], “labels”: [“bridge”]}’}
}
]

And we can plot these results to confirm the bounding boxes.

The agent’s answer to our multi-part question about this image, where we overlay the bounding box coordinates it provides.

Impressive! One could argue about whether or not it really found all the skyscrapers here but I feel like such a system has the potential to be quite powerful and useful, especially if we were to add the ability to crop the bounding boxes, zoom in and continue the conversation.

In the following sections, let’s dive into the main steps in a bit more detail. My hope is that some of them might be informative for your projects too.

The agent state, nodes and edges

My previous article contains a more detailed discussion of agents and LangGraph, so here I’ll just touch in the agent state for this project. The AgentState is made accessible to all the nodes in the LangGraph graph, and it’s where the information relevant to a query gets stored.

Each node can be told to write to one of more variables in the state, and by default they get overwritten. This is not the behavior we want for the plan output, which is supposed to be a list of results from each step of the plan. To ensure that this list gets appended as the agent goes about its work we use the add reducer, which you can read more about here.

Each of the nodes in the graph above is a method in the class AgentNodes. They take in state, perform some action (typically calling an LLM) and output their updates to the state. As an example, here’s the node used to structure the plan, copied from the code here.

def structure_plan_node(self, state: dict) -> dict:

messages = state[“plan”]
response = self.llm_structure.call(messages)
final_plan_dict = self.post_process_plan_structure(response)
final_plan = json.dumps(final_plan_dict)

return {
“plan_structure”: final_plan,
“current_step”: 0,
“max_steps”: len(final_plan_dict),
}

The routing node is also important because it’s visited multiple times over the course of plan execution. In the current code it’s very simple, just updating the current step state value so that other nodes know which part of the plan structure list to look at.

def routing_node(self, state: dict) -> dict:

plan_stage = state.get(“current_step”, 0)
return {“current_step”: plan_stage + 1}

An extension here would be to add another LLM call in the routing node to check if the output of the previous step of the plan warrants any modifications to the next steps or early termination of the question has been answered.

Finally we need to add two conditional edges, which use data stored in the AgentStateto determine which node should be run next. For example, the choose_model edge looks at the name of the current step in the plan_structure object carried in AgentState and then uses a simple if stagement to return the name of corresponding node that should be called at that step.

def choose_model(state: dict) -> str:

current_plan = json.loads(state.get(“plan_structure”))
current_step = state.get(“current_step”, 1)
max_step = state.get(“max_steps”, 999)

if current_step > max_step:
return “finalize”
else:
step_to_execute = current_plan[str(current_step)][“tool_name”]
return step_to_execute

The entire agent structure looks like this.

edges: AgentEdges = AgentEdges()
nodes: AgentNodes = AgentNodes()
agent: StateGraph = StateGraph(AgentState)

## Nodes
agent.add_node(“planning”, nodes.plan_node)
agent.add_node(“structure_plan”, nodes.structure_plan_node)
agent.add_node(“routing”, nodes.routing_node)
agent.add_node(“special_vision”, nodes.call_special_vision_node)
agent.add_node(“general_vision”, nodes.call_general_vision_node)
agent.add_node(“assessment”, nodes.assessment_node)
agent.add_node(“response”, nodes.dump_result_node)

## Edges
agent.set_entry_point(“planning”)
agent.add_edge(“planning”, “structure_plan”)
agent.add_edge(“structure_plan”, “routing”)
agent.add_conditional_edges(
“routing”,
edges.choose_model,
{
“special_vision”: “special_vision”,
“general_vision”: “general_vision”,
“finalize”: “assessment”,
},
)
agent.add_edge(“special_vision”, “routing”)
agent.add_edge(“general_vision”, “routing”)
agent.add_conditional_edges(
“assessment”,
edges.back_to_plan,
{
“good_answer”: “response”,
“bad_answer”: “planning”,
“timeout”: “response”,
},
)
agent.add_edge(“response”, END)

And it can be vizualized in a notebook using the turorial here.

The orchestration model

The planning, structure and assessment nodes are ideally suited to a text-based LLM that can reason and produce structured outputs. The most straightforward option here is to go with a large, versatile model like GPT4o-mini, which has the benefit of excellent support for JSON output from a Pydantic schema.

With the help of some LangChain functionality, we can make class to call such a model.

from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

class StructuredOpenAICaller:
def __init__(
self, api_key, system_prompt, output_model, temperature=0, max_tokens=1000
):
self.temperature = temperature
self.max_tokens = max_tokens
self.system_prompt = system_prompt
self.output_model = output_model
self.llm = ChatOpenAI(
model=self.MODEL_NAME,
api_key=api_key,
temperature=temperature,
max_tokens=max_tokens,
)
self.chain = self._set_up_chain()

def _set_up_chain(self):
prompt = ChatPromptTemplate.from_messages(
[
(“system”, self.system_prompt.system_template),
(“human”, “{query}”),
]
)
structured_llm = self.llm.with_structured_output(self.output_model)
chain = prompt | structured_llm

return chain
def call(self, query):
return self.chain.invoke({“query”: query})

To set this up, we supply a system prompt and an output model (see here for some examples of these) and then we can just use the call method with an input string to get a response that conforms to the structure of the output model that we specified. With the code set up like this we’d need to make a new instance of StructuredOpenAICaller with every different system prompt and output model we used in the agent. I personally prefer this to keep track of the different models being used, but as the agent becomes more complex it could be modified with another method to directly update the system prompt and output model in the single instance of the class.

Can we do this with local models too? On Apple Silicon, we can use the MLX library and MLX community on Hugging Face to easily experiment with open source models like Llama3.2. LangChain also has support for MLX integration, so we can follow the structure of the class above to set up a local model.

from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.llms.mlx_pipeline import MLXPipeline
from langchain_community.chat_models.mlx import ChatMLX

class StructuredLlamaCaller:
MODEL_PATH = “mlx-community/Llama-3.2-3B-Instruct-4bit”

def __init__(
self,
system_prompt: Any,
output_model: Any,
temperature: float = 0,
max_tokens: int = 1000,
) -> None:

self.system_prompt = system_prompt
# this is the name of the Pydantic model that defines
# the structure we want to output
self.output_model = output_model
self.loaded_model = MLXPipeline.from_model_id(
self.MODEL_PATH,
pipeline_kwargs={“max_tokens”: max_tokens, “temp”: temperature, “do_sample”: False},
)
self.llm = ChatMLX(llm=self.loaded_model)
self.temperature = temperature
self.max_tokens = max_tokens
self.chain = self._set_up_chain()

def _set_up_chain(self) -> Any:
# Set up a parser
parser = PydanticOutputParser(pydantic_object=self.output_model)

# Prompt
prompt = ChatPromptTemplate.from_messages(
[
(
“system”,
self.system_prompt.system_template,
),
(“human”, “{query}”),
]
).partial(format_instructions=parser.get_format_instructions())

chain = prompt | self.llm | parser
return chain

def call(self, query: str) -> Any:
return self.chain.invoke({“query”: query})

There are a few interesting points here. For a start, we can just download the weights and config for Llama3.2 as we would any other Hugging Face model, then under the hood they are loaded into MLX using the MLXPipeline tool from LangChain. When the models are first downloaded they are automatically placed in the Hugging Face cache. Sometimes it’s desirable to list the models and their cache locations, for example if you want to copy a model to a new environment. The util scan_cache_dir will help here and can be used to make a useful result with this function.

from huggingface_hub import scan_cache_dir

def fetch_downloaded_model_details():

hf_cache_info = scan_cache_dir()

repo_paths = []
size_on_disk = []
repo_ids = []

for repo in sorted(
hf_cache_info.repos, key=lambda repo: repo.repo_path
):
repo_paths.append(str(repo.repo_path))
size_on_disk.append(repo.size_on_disk)
repo_ids.append(repo.repo_id)
repos_df = pd.DataFrame({
“local_path”:repo_paths,
“size_on_disk”:size_on_disk,
“model_name”:repo_ids
})

repos_df.set_index(“model_name”,inplace=True)
return repos_df.to_dict(orient=”index”)

Llama3.2 does not have a built-in support for structured output like GPT4o-mini, so we need to use the prompt to force it to generate JSON. LangChain’s PydanticOutputParser can help, although it it also possible to implement your own version of this as shown here.

In my experience, the version of Llama that I’m using here, namely Llama-3.2–3B-Instruct-4bit, is not reliable for structured output beyond the simplest schemas. It’s reasonably good at the “plan generation” stage of our agent given a prompt with a few examples, but even with the help of the instructions provided by PydanticOutputParser, it often fails to turn that plan into JSON. Larger and/or less quantized versions of Llama will likely be better, but they would run into RAM issues if run alongside the other models in our agent. Therefore going forwards in the project, the orchestration mdoel is set to be GPT4o-mini.

A model for general vision: Qwen2-VL

To be able to answer questions like “What’s going on in this image?” or “what city is this?”, we need a multimodal LLM. Arguably Florence2 in image captioning mode might be to give good responses to this type of question, but it’s not really designed for conversational output.

The field of multimodal models small enough to run on a laptop is still in its infancy (a recently compiled list can be found here), but the Qwen2-VL series from Alibaba is a promising development. Furthermore, we can make use of MLX-VLM, an extension of MLX specifically designed for tuning and inference of vision models, to set up one of these models within our agent framework.

from mlx_vlm import load, apply_chat_template, generate

class QwenCaller:
MODEL_PATH = “mlx-community/Qwen2-VL-2B-Instruct-4bit”

def __init__(self, max_tokens=1000, temperature=0):
self.model, self.processor = load(self.MODEL_PATH)
self.config = self.model.config
self.max_tokens = max_tokens
self.temperature = temperature

def call(self, query, image):
messages = [
{
“role”: “system”,
“content”: ImageInterpretationPrompt.system_template,
},
{“role”: “user”, “content”: query},
]
prompt = apply_chat_template(self.processor, self.config, messages)
output = generate(
self.model,
self.processor,
image,
prompt,
max_tokens=self.max_tokens,
temperature=self.temperature,
)
return output

This class will load the smallest version of Qwen2-VL and then call it with an input image and prompt to get a textual response. For more detail about the functionality of this model and others that could be used in the same way, check out this list of examples on the MLX-VLM github page. Qwen2-VL is also apparently capable of generating bounding boxes and object pointing coordinates, so this capability could also be explored and compared with Florence2.

Of course GPT-4o-mini also has vision capabilities and is likely more reliable than smaller local models. Therefore when building these sorts of applications it’s useful to add the ability to call a cloud based alternative, if anything just as a backup in case one of the local models fails. Note that input images must be converted to base64 before they can be sent to the model, but once that’s done we can also use the LangChain framework as shown below.

import base64
from io import BytesIO
from PIL import Image
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser

def convert_PIL_to_base64(image: Image, format=”jpeg”):
buffer = BytesIO()
# Save the image to this buffer in the specified format
image.save(buffer, format=format)
# Get binary data from the buffer
image_bytes = buffer.getvalue()
# Encode binary data to Base64
base64_encoded = base64.b64encode(image_bytes)
# Convert Base64 bytes to string (optional)
return base64_encoded.decode(“utf-8”)

class OpenAIVisionCaller:
MODEL_NAME = “gpt-4o-mini”

def __init__(self, api_key, system_prompt, temperature=0, max_tokens=1000):
self.temperature = temperature
self.max_tokens = max_tokens
self.system_prompt = system_prompt
self.llm = ChatOpenAI(
model=self.MODEL_NAME,
api_key=api_key,
temperature=temperature,
max_tokens=max_tokens,
)
self.chain = self._set_up_chain()

def _set_up_chain(self):
prompt = ChatPromptTemplate.from_messages(
[
(“system”, self.system_prompt.system_template),
(
“user”,
[
{“type”: “text”, “text”: “{query}”},
{
“type”: “image_url”,
“image_url”: {“url”: “data:image/jpeg;base64,{image_data}”},
},
],
),
]
)

chain = prompt | self.llm | StrOutputParser()
return chain

def call(self, query, image):
base64image = convert_PIL_to_base64(image)
return self.chain.invoke({“query”: query, “image_data”: base64image})

A specialized model for Vision: Florence2

Florence2 is seen as a specialist model in the context of our agent because while it has many capabilities its inputs must be selected from a list of predefined task prompts. Of course the model could be fine tuned to accept new prompts, but for our purposes the version downloaded directly from Hugging Face works well. The beauty of this model is that it uses a single training process and set of weights, but yet achieves high performance in multiple image tasks that previously would have demanded their own models. The key to this success lies in its large and carefully curated training dataset, FLD-5B. To learn more about the dataset, model and training I recommend this excellent article.

In our context, we use the orchestration model to turn the query into a series of Florence task prompts, which we then call in a sequence. The options available to us include captioning, object detection, phrase grounding, OCR and segmentation. For some of these options (i.e. phrase grounding and region to segmentation) an input phrase is needed, so the orchestration model generates that too. In contrast, tasks like captioning need only the image. There are many use cases for Florence2, which are explored in code here. We restrict ourselves to object detection, phrase grounding, captioning and OCR, though it would be straightforward to add more by updating the prompts associated with plan generation and structuring.

Florence2 appears to be supported by the MLX-VLM package, but at the time of writing I couldn’t find any examples of its use and so opted for an approach that uses Hugging Face transformers as shown below.

from transformers import AutoModelForCausalLM, AutoProcessor
import torch

def get_device_type():

if torch.cuda.is_available():
return “cuda”
else:
if torch.backends.mps.is_available() and torch.backends.mps.is_built():
return “mps”
else:
return “cpu”

class FlorenceCaller:

MODEL_PATH: str = “microsoft/Florence-2-base-ft”
# See https://huggingface.co/microsoft/Florence-2-base-ft for other modes
# for Florence2
TASK_DICT: dict[str, str] = {
“general object detection”: “<OD>”,
“specific object detection”: “<CAPTION_TO_PHRASE_GROUNDING>”,
“image captioning”: “<MORE_DETAILED_CAPTION>”,
“OCR”: “<OCR_WITH_REGION>”,
}

def __init__(self) -> None:
self.device: str = (
get_device_type()
) # Function to determine the device type (e.g., ‘cpu’ or ‘cuda’).

with patch(“transformers.dynamic_module_utils.get_imports”, fixed_get_imports):
self.model: AutoModelForCausalLM = AutoModelForCausalLM.from_pretrained(
self.MODEL_PATH, trust_remote_code=True
)
self.processor: AutoProcessor = AutoProcessor.from_pretrained(
self.MODEL_PATH, trust_remote_code=True
)
self.model.to(self.device)

def translate_task(self, task_name: str) -> str:
return self.TASK_DICT.get(task_name, “<DETAILED_CAPTION>”)

def call(
self, task_prompt: str, image: Any, text_input: Optional[str] = None
) -> Any:

# Get the corresponding task code for the given prompt
task_code: str = self.translate_task(task_prompt)

# Prevent text_input for tasks that do not require it
if task_code in [
“<OD>”,
“<MORE_DETAILED_CAPTION>”,
“<OCR_WITH_REGION>”,
“<DETAILED_CAPTION>”,
]:
text_input = None

# Construct the prompt based on whether text_input is provided
prompt: str = task_code if text_input is None else task_code + text_input

# Preprocess inputs for the model
inputs = self.processor(text=prompt, images=image, return_tensors=”pt”).to(
self.device
)

# Generate predictions using the model
generated_ids = self.model.generate(
input_ids=inputs[“input_ids”],
pixel_values=inputs[“pixel_values”],
max_new_tokens=1024,
early_stopping=False,
do_sample=False,
num_beams=3,
)

# Decode and process generated output
generated_text: str = self.processor.batch_decode(
generated_ids, skip_special_tokens=False
)[0]

parsed_answer: dict[str, Any] = self.processor.post_process_generation(
generated_text, task=task_code, image_size=(image.width, image.height)
)

return parsed_answer[task_code]

On Apple Silicon, the device becomes mps and the latency of these model calls is tolerable. This code should also work on GPU and CPU, though this has not been tested.

Another example and some limitations

Let’s run through another example to see the agent outputs from each step. To call the agent on an input query and image we can use the Agent.invoke method, which follows the same process as described in my previous article to add each node output to a list of results in addition to saving outputs in a LangGraph InMemoryStore object.

We’ll be using the following image, which presents an interesting challenge if we ask a tricky question like “Are there trees in this image? If so, find them and describe what they are doing”

Testing image for this section. Photo by Hannah Lim on Unsplash
from image_agent.agent.Agent import Agent
from image_agent.utils import load_secrets

secrets = load_secrets()

# use GPT4 for general vision mode
full_agent_gpt_vision = Agent(
openai_api_key=secrets[“OPENAI_API_KEY”],vision_mode=”gpt”
)

# use local model for general vision
full_agent_qwen_vision = Agent(
openai_api_key=secrets[“OPENAI_API_KEY”],vision_mode=”local”
)

In an ideal world the answer is straightforward: There are no trees.

However this turns out to be a difficult question for the agent and it’s interesting to compare the responses when it using GPT-4o-mini vs. Qwen2 as the general vision model.

When we call full_agent_qwen_vision with this query, we get a bad result: Both Qwen2 and Florence2 fall for the trick and report that trees are present (interestingly, if we change “trees” to “dogs”, we get the right answer)

Plan:
Call generalist vision with the question ‘Are there trees in this image? If so, what are they doing?’. Then call specialist vision in object specific mode with the phrase ‘cat’.

Plan_structure:
{
“1”: {“tool_name”: “general_vision”, “tool_mode”: “conversation”, “tool_input”: “Are there trees in this image? If so, what are they doing?”},
“2”: {“tool_name”: “special_vision”, “tool_mode”: “specific object detection”, “tool_input”: “tree”}
}

Plan output:
[
{1: ‘Yes, there are trees in the image. They appear to be part of a tree line against a pink background.’}
[
{2: ‘{“bboxes”: [[235.77601623535156, 427.864501953125, 321.7920227050781, 617.2275390625]], “labels”: [“tree”]}’}
]

Assessment:
The result adequately answers the user’s question by confirming the presence of trees in the image and providing a description of their appearance and context. The output from both the generalist and specialist vision tools is consistent and informative.

Qwen2 seems subject to blindly following the prompts hint that here might be trees present. Florence2 also fails here, reporting a bounding box when it should not

If asked “Are there trees in this image, If so, find them and describe what they’re doing”, both Qwen2 and Florence2 fall for the trick. Image generated by the author.If asked “Are there dogs in this image? If so, find them and describe what they’re doing”, both the Qwen and GPT-based agents will produce the correct answer. Image generated by the author.

If we call full_agent_gpt_visionwith the same query, GPT4o-mini doesn’t fall for the trick, but the call to Florence2 hasn’t changed so it still fails. We then see the query assessment step in action because the generalist and specialist models have produced conflicting results.

Node : general_vision
Task : plan_output
[
{1: ‘There are no trees in this image. It features a group of dogs sitting in front of a pink wall.’}
]

Node : special_vision
Task : plan_output
[
{2: ‘{“bboxes”: [[235.77601623535156, 427.864501953125, 321.7920227050781, 617.2275390625]], “labels”: [“tree”]}’}
]

Node : assessment
Task : answer_assessment
The result contains conflicting information.
The first part states that there are no trees in the image, while the second part provides a bounding box and label indicating that a tree is present.
This inconsistency means the user’s question is not adequately answered.

The agent then tries several times to restructure the plan, but Florence2 insists on producing a bounding box for “tree”, which the answer assessment nodes always catches as inconsistent. This is a better result than the Qwen2 agent, but points to a broader issue of false positives with Florence2. This could be addressed by having the routing node evaluate the plan after every step and then only call Florence2 if absolutely necessary.

With the basic building blocks in place, this system is ripe for experimentation, iteration and improvement and I may continue to add to the repo over the coming weeks. For now though, this article is long enough!

Thanks for making it to the end and I hope the project here prompts some inspiration for your own projects! The orchestration of multiple specialist models within agent frameworks is a powerful and increasingly accessible approach to putting LLMs to work on complex tasks. Clearly there is still a lot of room for improvI for one look forward to seeing how ideas in this field develop over the coming year.

A Multimodal AI Assistant: Combining Local and Cloud Models was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.