Florence-2: Mastering Multiple Vision Tasks with a Single VLM Model

October 14, 2024

A Guided Exploration of Florence-2’s Zero-Shot Capabilities: Captioning, Object Detection, Segmentation and OCR.

Image annotations by Author. Original image from Pexels.

Introduction

In recent years, the field of computer vision has witnessed the rise of foundation models that enable image annotation without the need for training custom models. We’ve seen models like CLIP [2] for classification, GroundingDINO [3] for object detection, and SAM [4] for segmentation — each excelling in its domain. But what if we had a single model capable of handling all these tasks together?

In this tutorial we introduce Florence-2 [1]— a novel, open-source Vision-Language Model (VLM) designed to handle a diverse range of vision and multimodal tasks, including captioning, object detection, segmentation and OCR.

Accompanied by a Colab notebook, we’ll explore Florence-2’s zero-shot capabilities to annotate an image of an old camera.

Florence-2

Background

Florence-2 was released by Microsoft in June 2024. It was designed to perform multiple vision tasks within a single model. It is an open-source model, available on Hugging Face under the permissive MIT licence.

Despite its relatively small size, with versions of 0.23B & 0.77B parameters, Florence-2 achieves state-of-the-art (SOTA) performance. Its compact size enables efficient deployment on devices with limited computing resources, while ensuring fast inference speeds.

The model was pre-trained on an enormous, high quality dataset called FLD-5B, consisting of 5.4B annotations on 126 million images. This allows Florence-2 to excel in zero-shot performance on many tasks without requiring additional training.

The original open-source weights of the Florence-2 model support the following tasks:

https://medium.com/media/bd0343e165aa5fabf1b4ff0b86af12ab/href

Additional unsupported tasks can be added by fine-tuning the model.

Task Format

Inspired by Large Language Models (LLMs), Florence-2 was designed as a sequence-to-sequence model. It takes an image and text instructions as inputs, and outputs text results. The input or output text may represent plain text or a region in the image. The region format varies depending on the task:

Bounding Boxes: ‘<X1><Y1><X2><Y2>’ for object detection tasks. The tokens represent the coordinates of the top-left and bottom-right corners of the box.Quad Boxes: ‘<X1><Y1><X2><Y2><X3><Y3><X4><Y4>’ for text detection, using the coordinates of the four corners that enclose the text.Polygon: ‘<X1><Y1>…,<Xn><Yn>’ for segmentation tasks, where the coordinates represent the vertices of the polygon in clockwise order.

Architecture

Florence-2 is built using a standard encoder-decoder transformer architecture. Here’s how the process works:

The input image is embedded by a DaViT vision encoder [5].The text prompt is embedded using BART [6], utilizing an extended tokenizer and word embedding layer.Both the vision and text embeddings are concatenated.These concatenated embeddings are processed by a transformer-based multi-modal encoder-decoder to generate the response.During training, the model minimizes the cross-entropy loss, similar to standard language models.An illustration of Florence-2’s architecture. Source: link.

Code implementation

Loading Florence-2 model and a sample image

After installing and importing the necessary libraries (as demonstrated in the accompanying Colab notebook), we begin by loading the Florence-2 model, processor and the input image of a camera:

#Load model:
model_id = ‘microsoft/Florence-2-large’
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, torch_dtype=’auto’).eval().cuda()
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

#Load image:
image = Image.open(img_path)

Auxiliary Functions

In this tutorial, we will use several auxiliary functions. The most important is the run_example core function, which generates a response from the Florence-2 model.

The run_example function combines the task prompt with any additional text input (if provided) into a single prompt. Using the processor, it generates text and image embeddings that serve as inputs to the model. The magic happens during the model.generate step, where the model’s response is generated. Here’s a breakdown of some key parameters:

max_new_tokens=1024: Sets the maximum length of the output, allowing for detailed responses.do_sample=False: Ensures a deterministic response.num_beams=3: Implements beam search with the top 3 most likely tokens at each step, exploring multiple potential sequences to find the best overall output.early_stopping=False: Ensures beam search continues until all beams reach the maximum length or an end-of-sequence token is generated.

Lastly, the model’s output is decoded and post-processed with processor.batch_decode and processor.post_process_generation to produce the final text response, which is returned by the run_example function.

def run_example(image, task_prompt, text_input=”):

prompt = task_prompt + text_input

inputs = processor(text=prompt, images=image, return_tensors=”pt”).to(‘cuda’, torch.float16)

generated_ids = model.generate(
input_ids=inputs[“input_ids”].cuda(),
pixel_values=inputs[“pixel_values”].cuda(),
max_new_tokens=1024,
do_sample=False,
num_beams=3,
early_stopping=False,
)

generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
parsed_answer = processor.post_process_generation(
generated_text,
task=task_prompt,
image_size=(image.width, image.height)
)

return parsed_answer

Additionally, we utilize auxiliary functions to visualize the results (draw_bbox ,draw_ocr_bboxes and draw_polygon) and handle the conversion between bounding boxes formats (convert_bbox_to_florence-2 and convert_florence-2_to_bbox). These can be explored in the attached Colab notebook.

Tasks

Florence-2 can perform a variety of visual tasks. Let’s explore some of its capabilities, starting with image captioning.

1. Captioning Generation Related Tasks:

1.1 Generate Captions

Florence-2 can generate image captions at various levels of detail, using the ‘<CAPTION>’ , ‘<DETAILED_CAPTION>’ or ‘<MORE_DETAILED_CAPTION>’ task prompts.

print (run_example(image, task_prompt='<CAPTION>’))
# Output: ‘A black camera sitting on top of a wooden table.’

print (run_example(image, task_prompt='<DETAILED_CAPTION>’))
# Output: ‘The image shows a black Kodak V35 35mm film camera sitting on top of a wooden table with a blurred background.’

print (run_example(image, task_prompt='<MORE_DETAILED_CAPTION>’))
# Output: ‘The image is a close-up of a Kodak VR35 digital camera. The camera is black in color and has the Kodak logo on the top left corner. The body of the camera is made of wood and has a textured grip for easy handling. The lens is in the center of the body and is surrounded by a gold-colored ring. On the top right corner, there is a small LCD screen and a flash. The background is blurred, but it appears to be a wooded area with trees and greenery.’

The model accurately describes the image and its surrounding. It even identifies the camera’s brand and model, demonstrating its OCR ability. However, in the ‘<MORE_DETAILED_CAPTION>’ task there are minor inconsistencies, which is expected from a zero-shot model.

1.2 Generate Caption for a Given Bounding Box

Florence-2 can generate captions for specific regions of an image defined by bounding boxes. For this, it takes the bounding box location as input. You can extract the category with ‘<REGION_TO_CATEGORY>’ or a description with ‘<REGION_TO_DESCRIPTION>’ .

For your convenience, I added a widget to the Colab notebook that enables you to draw a bounding box on the image, and code to convert it to Florence-2 format.

task_prompt = ‘<REGION_TO_CATEGORY>’
box_str = ‘<loc_335><loc_412><loc_653><loc_832>’
results = run_example(image, task_prompt, text_input=box_str)
# Output: ‘camera lens’task_prompt = ‘<REGION_TO_DESCRIPTION>’
box_str = ‘<loc_335><loc_412><loc_653><loc_832>’
results = run_example(image, task_prompt, text_input=box_str)
# Output: ‘camera’

In this case, the ‘<REGION_TO_CATEGORY>’ identified the lens, while the ‘<REGION_TO_DESCRIPTION>’ was less specific. However, this performance may vary with different images.

2. Object Detection Related Tasks:

2.1 Generate Bounding Boxes and Text for Objects

Florence-2 can identify densely packed regions in the image, and to provide their bounding box coordinates and their related labels or captions. To extract bounding boxes with labels, use the ’<OD>’task prompt:

results = run_example(image, task_prompt='<OD>’)
draw_bbox(image, results[‘<OD>’])

To extract bounding boxes with captions, use ‘<DENSE_REGION_CAPTION>’ task prompt:

task_prompt results = run_example(image, task_prompt= ‘<DENSE_REGION_CAPTION>’)
draw_bbox(image, results[‘<DENSE_REGION_CAPTION>’])The image on the left shows the results of the ’<OD>’ task prompt, while the image on the right demonstrates ‘<DENSE_REGION_CAPTION>’

2.2 Text Grounded Object Detection

Florence-2 can also perform text-grounded object detection. By providing specific object names or descriptions as input, Florence-2 detects bounding boxes around the specified objects.

task_prompt = ‘<CAPTION_TO_PHRASE_GROUNDING>’
results = run_example(image,task_prompt, text_input=”lens. camera. table. logo. flash.”)
draw_bbox(image, results[‘<CAPTION_TO_PHRASE_GROUNDING>’])CAPTION_TO_PHRASE_GROUNDING task with the text input: “lens. camera. table. logo. flash.”

3. Segmentation Related Tasks:

Florence-2 can also generate segmentation polygons grounded by text (‘<REFERRING_EXPRESSION_SEGMENTATION>’) or by bounding boxes (‘<REGION_TO_SEGMENTATION>’):

results = run_example(image, task_prompt='<REFERRING_EXPRESSION_SEGMENTATION>’, text_input=”camera”)
draw_polygons(image, results[task_prompt])results = run_example(image, task_prompt='<REGION_TO_SEGMENTATION>’, text_input=”<loc_345><loc_417><loc_648><loc_845>”)
draw_polygons(output_image, results[‘<REGION_TO_SEGMENTATION>’])The image on the left shows the results of the REFERRING_EXPRESSION_SEGMENTATION task with ‘camera’ text as input. The image on the right demonstrates REGION_TO_SEGMENTATION task with a bounding box around the lens provided as input.

4. OCR Related Tasks:

Florence-2 demonstrates strong OCR capabilities. It can extract text from an image with the ‘<OCR>’ task prompt, and extract both text and its location with ‘<OCR_WITH_REGION>’ :

results = run_example(image,task_prompt)
draw_ocr_bboxes(image, results[‘<OCR_WITH_REGION>’])

Concluding Remarks

Florence-2 is a versatile Vision-Language Model (VLM), capable of handling multiple vision tasks within a single model. Its zero-shot capabilities are impressive across diverse tasks such as image captioning, object detection, segmentation and OCR. While Florence-2 performs well out-of-the-box, additional fine-tuning can further adapt the model to new tasks or improve its performance on unique, custom datasets.

Thank you for reading!

Congratulations on making it all the way here. Click 👍 to show your appreciation and raise the algorithm self esteem 🤓

Want to learn more?

Explore additional articles I’ve writtenSubscribe to get notified when I publish articlesFollow me on Linkedin

Full Code as Colab notebook:

https://medium.com/media/1a5c3cc7c8f77762e24fe89f3fbbb230/href

References

[0] Code on Colab Notebook: link

[1] Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks.

[2] CLIP: Learning Transferable Visual Models From Natural Language Supervision.

[3] Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.

[4] SAM2: Segment Anything in Images and Videos.

[5] DaViT: Dual Attention Vision Transformers.

[6] BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension.

Florence-2: Mastering Multiple Vision Tasks with a Single VLM Model was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.