Skip to content

Cannot use the same PaliGemma object detection finetuning regime for Gemma3 #38

@Hemanth21k

Description

@Hemanth21k

I recently came across this repository and wanted to contribute towards the idea of adding object detection capability to Gemma3.
But as I performed inference on the previously trained weights on the validation dataset, I noticed that the object detection performance is VERY POOR, so I started to look into WHY, and here is what I have found.

First of all, when I tested on original Gemma3 pretrained weights, the 'it' models seemed to perform poor compared to the 'pt' models even on basic instructions which kind of makes sense a little bit because "pt" models have additional finetuning step.

Moving forward, the most important data format here that we should see here is the PaliGemma labelling where you label x,y coordinates as tokens ranging from 0 to 1023 <loc0000> .... <loc1023>.

So an average bounding box of (top_left_x,top_left_y), (bottom_right_x,bottom_right_y) would look something like <loc0012><loc0023><loc0120><loc0150>.

If you read paligemma paper, they have a section where they talk about addition of these 1024 tokens into the vocabulary for object detection purpose and perform further finetuning on object detection datasets. There are multiple steps involved to get the optimization right, because these tokens were added into the vocabulary AFTER the original Question and Answer pretraining, so they had to carefully tune both the tokenizer and their model weights.

However, they did not mention the presence of these tokens in the Gemma3 paper.
Hence I decided to check if they at least carried the same vocabulary model from paligemma and extended more vocabulary to gemma3.

The following results are outputs from a simple code I wrote to check the vocabulary of both models:


PaliGemma Tokenizer (huggingface: "google/paligemma2-3b-mix-448")
Total vocabulary: 257153
Checking <loc0000> .... <loc1023> tokens in PaliGemma vocabulary: True

Gemma 3 Tokenizer (huggingface: "google/gemma-3-4b-it"):
Total vocabulary: 262145
Checking <loc0000> .... <loc1023> tokens in Gemma3 vocabulary: False


As you can see above, when queried on the tokens key-embedding pair for both Gemma3 and Paligemma, Gemma3 do not contain the necessary tokens required to perform this object detection task.

What does this mean?

Paligemma can perform object detection task due to the special <locxxxx> tokens present in the training and vocabulary.
Gemma3 on the other hand do not have these necessary tokens to perform the intended training.

So how is the model still predicting the bounding boxes after training?

Since, <locxxxx> tokens are NOT present in the Gemma3 vocabulary, the model is forced to give outputs by reconstructing these characters.
This is comparable to asking the model to predict bounding boxes and it gives an output such as {x:200,y:250,h:40,w:30} and the evidence suggests that this is a task where most of the multimodal models fail to perform.

Finetuning such task with <locxxxx> tokens will only make it extra hard and complicated to predict bounding box accurately since there are more output tokens involved, leading to a poorer performance compared to traditional training in {x:200,y:250,h:40,w:30} format.

What should we do make it work?

Incorporate these special tokens into the vocabulary and finetune the model. There are many ways you could do this without fully training the model BUT it is important to NOT MESS UP the existing weights. Or we can simply follow PaliGemma finetuning.

In other words, maybe we're halfway to PaliGemma-3 yay (Google AI, if you're reading this, I'm open to work haha.)

References:
PaliGemma: https://arxiv.org/abs/2407.07726
Gemma3: https://blog.google/technology/developers/gemma-3/
Gemma3 technical report: https://arxiv.org/abs/2503.19786
Roboflow finetuning paligemma: https://blog.roboflow.com/how-to-fine-tune-paligemma/

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions