Segmenting Water in Satellite Images Using Paligemma
Some insights on using Google’s latest Vision Language Model
Hutt Lagoon, Australia. Depending on the season, time of day, and cloud coverage, this lake changes from red to pink or purple. Source: Google Maps.
Multimodal models are architectures that simultaneously integrate and process different data types, such as text, images, and audio. Some examples include CLIP and DALL-E from OpenAI, both released in 2021. CLIP understands images and text jointly, allowing it to perform tasks like zero-shot image classification. DALL-E, on the other hand, generates images from textual descriptions, allowing the automation and enhancement of creative processes in gaming, advertising, and literature, among other sectors.
Visual language models (VLMs) are a special case of multimodal models. VLMs generate language based on visual inputs. One prominent example is Paligemma, which Google introduced in May 2024. Paligemma can be used for Visual Question Answering, object detection, and image segmentation.
Some blog posts explore the capabilities of Paligemma in object detection, such as this excellent read from Roboflow:
Fine-tune PaliGemma for Object Detection with Custom Data
However, by the time I wrote this blog, the existing documentation on preparing data to use Paligemma for object segmentation was vague. That is why I wanted to evaluate whether it is easy to use Paligemma for this task. Here, I share my experience.
Brief introduction of Paligemma
Before going into detail on the use case, let’s briefly revisit the inner workings of Paligemma.
Architecture of Paligemma2. Source: https://arxiv.org/abs/2412.03555
Paligemma combines a SigLIP-So400m vision encoder with a Gemma language model to process images and text (see figure above). In the new version of Paligemma released in December of this year, the vision encoder can preprocess images at three different resolutions: 224px, 448px, or 896px. The vision encoder preprocesses an image and outputs a sequence of image tokens, which are linearly combined with input text tokens. This combination of tokens is further processed by the Gemma language model, which outputs text tokens. The Gemma model has different sizes, from 2B to 27B parameters.
An example of model output is shown in the following figure.
Example of an object segmentation output. Source: https://arxiv.org/abs/2412.03555
The Paligemma model was trained on various datasets such as WebLi, openImages, WIT, and others (see this Kaggle blog for more details). This means that Paligemma can identify objects without fine-tuning. However, such abilities are limited. That’s why Google recommends fine-tuning Paligemma in domain-specific use cases.
Input format
To fine-tune Paligemma, the input data needs to be in JSONL format. A dataset in JSONL format has each line as a separate JSON object, like a list of individual records. Each JSON object contains the following keys:
Image: The image’s name.
Prefix: This specifies the task you want the model to perform.
Suffix: This provides the ground truth the model learns to make predictions.
Depending on the task, you must change the JSON object’s prefix and suffix accordingly. Here are some examples:
Image captioning:{“image”: “some_filename.png”,
“prefix”: “caption en” (To indicate that the model should generate an English caption for an image),
“suffix”: “This is an image of a big, white boat traveling in the ocean.”
}Question answering:{“image”: “another_filename.jpg”,
“prefix”: “How many people are in the image?”,
“suffix”: “ten”
}Object detection:{“image”: “filename.jpeg”,
“prefix”: “detect airplane”,
“suffix”: “<loc0055><loc0115><loc1023><loc1023> airplane” (four corner bounding box coords)
}
If you have several categories to be detected, add a semicolon (;) among each category in the prefix and suffix.
A complete and clear explanation of how to prepare the data for object detection in Paligemma can be found in this Roboflow post.
Image segmentation:{“image”: “filename.jpeg”,
“prefix”: “detect airplane”,
“suffix”: “<loc0055><loc0115><loc1023><loc1023><seg063><seg108><seg045><seg028><seg056><seg052><seg114><seg005><seg042><seg023><seg084><seg064><seg086><seg077><seg090><seg054> airplane”
}
Note that for segmentation, apart from the object’s bounding box coordinates, you need to specify 16 extra segmentation tokens representing a mask that fits within the bounding box. According to Google’s Big Vision repository, those tokens are codewords with 128 entries (<seg000>…<seg127>). How do we obtain these values? In my personal experience, it was challenging and frustrating to get them without proper documentation. But I’ll give more details later.
If you are interested in learning more about Paligemma, I recommend these blogs:
Welcome PaliGemma 2 – New vision language models by GoogleIntroducing PaliGemma: Google’s Latest Visual Language Model
Satellite images of water bodies
As mentioned above, Paligemma was trained on different datasets. Therefore, this model is expected to be good at segmenting “traditional” objects such as cars, people, or animals. But what about segmenting objects in satellite images? This question led me to explore Paligemma’s capabilities for segmenting water in satellite images.
Kaggle’s Satellite Image of Water Bodies dataset is suitable for this purpose. This dataset contains 2841 images with their corresponding masks.
Here’s an example of the water bodies dataset: The RGB image is shown on the left, while the corresponding mask appears on the right.
Some masks in this dataset were incorrect, and others needed further preprocessing. Faulty examples include masks with all values set to water, while only a small portion was present in the original image. Other masks did not correspond to their RGB images. When an image is rotated, some masks make these areas appear as if they have water.
Example of a rotated mask. When reading this image in Python, the area outside the image appears as it would have water. In this case, image rotation is needed to correct this mask. Image made by the author.
Given these data limitations, I selected a sample of 164 images for which the masks did not have any of the problems mentioned above. This set of images is used to fine-tune Paligemma.
Preparing the JSONL dataset
As explained in the previous section, Paligemma needs entries that represent the object’s bounding box coordinates in normalized image-space (<loc0000>…<loc1023>) plus an extra 16 segmentation tokens representing 128 different codewords (<seg000>…<seg127>). Obtaining the bounding box coordinates in the desired format was easy, thanks to Roboflow’s explanation. But how do we obtain the 128 codewords from the masks? There was no clear documentation or examples in the Big Vision repository that I could use for my use case. I naively thought that the process of creating the segmentation tokens was similar to that of making the bounding boxes. However, this led to an incorrect representation of the water masks, which led to wrong prediction results.
By the time I wrote this blog (beginning of December), Google announced the second version of Paligemma. Following this event, Roboflow published a nice overview of preparing data to fine-tune Paligemma2 for different applications, including image segmentation. I use part of their code to finally obtain the correct segmentation codewords. What was my mistake? Well, first of all, the masks need to be resized to a tensor of shape [None, 64, 64, 1] and then use a pre-trained variational auto-encoder (VAE) to convert annotation masks into text labels. Although the usage of a VAE model was briefly mentioned in the Big Vision repository, there is no explanation or examples on how to use it.
The workflow I use to prepare the data to fine-tune Paligemma is shown below:
Steps to convert one original mask from the filtered water bodies dataset to a JSON object. This process is repeated over the 164 images of the train set and the 21 images of the test dataset to build the JSONL dataset.
As observed, the number of steps needed to prepare the data for Paligemma is large, so I don’t share code snippets here. However, if you want to explore the code, you can visit this GitHub repository. The script convert.py has all the steps mentioned in the workflow shown above. I also added the selected images so you can play with this script immediately.
When preprocessing the segmentation codewords back to segmentation masks, we note how these masks cover the water bodies in the images:
Resulting masks when decoding the segmentation codewords in the train set. Image made by the author using this Notebook.
How is Paligemma at segmenting water in satellite images?
Before fine-tuning Paligemma, I tried its segmentation capabilities on the models uploaded to Hugging Face. This platform has a demo where you can upload images and interact with different Paligemma models.
Default Paligemma model at segmenting water in satellite images.
The current version of Paligemma is generally good at segmenting water in satellite images, but it’s not perfect. Let’s see if we can improve these results!
There are two ways to fine-tune Paligemma, either through Hugging Face’s Transformer library or by using Big Vision and JAX. I went for this last option. Big Vision provides a Colab notebook, which I modified for my use case. You can open it by going to my GitHub repository:
I used a batch size of 8 and a learning rate of 0.003. I ran the training loop twice, which translates to 158 training steps. The total running time using a T4 GPU machine was 24 minutes.
The results were not as expected. Paligemma did not produce predictions in some images, and in others, the resulting masks were far from the ground truth. I also obtained segmentation codewords with more than 16 tokens in two images.
Results of the fine-tuning where there were predictions. Image made by the author.
It’s worth mentioning that I use the first Paligemma version. Perhaps the results are improved when using Paligemma2 or by tweaking the batch size or learning rate further. In any case, these experiments are out of the scope of this blog.
The demo results show that the default Paligemma model is better at segmenting water than my finetuned model. In my opinion, UNET is a better architecture if the aim is to build a model specialized in segmenting objects. For more information on how to train such a model, you can read my previous blog post:
Other limitations:
I want to mention some other challenges I encountered when fine-tuning Paligemma using Big Vision and JAX.
Setting up different model configurations is difficult because there’s still little documentation on those parameters.The first version of Paligemma has been trained to handle images of different aspect ratios resized to 224×224. Make sure to resize your input images with this size only. This will prevent raising exceptions.When fine-tuning with Big Vision and JAX, You might have JAX GPU-related problems. Ways to overcome this issue are:
a. Reducing the samples in your training and validation datasets.
b. Increasing the batch size from 8 to 16 or higher.
The fine-tuned model has a size of ~ 5GB. Make sure to have enough space in your Drive to store it.
Takeaway messages
Discovering a new AI model is exciting, especially in this age of multimodal algorithms transforming our society. However, working with state-of-the-art models can sometimes be challenging due to the lack of available documentation. Therefore, the launch of a new AI model should be accompanied by comprehensive documentation to ensure its smooth and widespread adoption, especially among professionals who are still inexperienced in this area.
Despite the difficulties I encountered fine-tuning Paligemma, the current pre-trained models are powerful at doing zero-shot object detection and image segmentation, which can be used for many applications, including assisted ML labeling.
Are you using Paligemma in your Computer Vision projects? Share your experience fine-tuning this model in the comments!
I hope you enjoyed this post. Once more, thanks for reading!
You can contact me via LinkedIn at:
https://www.linkedin.com/in/camartinezbarbosa/
Acknowledgments: I want to thank José Celis-Gil for all the fruitful discussions on data preprocessing and modeling.
Segmenting Water in Satellite Images Using Paligemma was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.