Fine-Tuning Florence 2: Empowering Vision-Language Models

In this blog post, we will explore how to fine-tune Florence 2, a Vision Language Model (VLM) released by Microsoft. This model allows you to ask questions about images, such as "What date is shown in this image?" By providing both text and an image, the model can generate an answer. Fine-tuning is essential for enhancing the model's ability to respond accurately to such queries. We'll guide you through the process of fine-tuning Florence 2 using a custom dataset, specifically the Document Visual Question Answering (DocVQA) dataset. By the end, you'll learn how to train the model to understand and answer questions about images.

Why Fine-Tuning is Necessary

Without fine-tuning, smaller models like Florence 2 might struggle to provide accurate answers when asked questions about images. Fine-tuning involves training the model on specific datasets to enhance its ability to understand and respond accurately. For example, fine-tuning can teach the model to detect anomalies in medical images or predict dates in documents.

Steps to Fine-Tune Florence 2

Configuration
Initial Model Testing
Dataset Preparation and Embedding
Training the Model
Saving the Model

Let's dive into each step in detail.

Step 1: Configuration

First, install the necessary libraries:

pip install datasets flash_attn timm einops transformers pillow huggingface_hub

huggingface-cli login

Here's the initial configuration code:

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoProcessor
import torch
import os
from torch.utils.data import DataLoader
from tqdm import tqdm
from transformers import (AdamW, AutoProcessor, get_scheduler)
from torch.utils.data import Dataset

# Load the dataset
data = load_dataset("HuggingFaceM4/DocumentVQA")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the model and processor
model = AutoModelForCausalLM.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True, revision='refs/pr/6').to(device)
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True, revision='refs/pr/6')

# Clear CUDA cache
torch.cuda.empty_cache()

Step 2: Initial Model Testing

Before fine-tuning, let's see how the model performs with some examples from the dataset:

def run_example(task_prompt, text_input, image):
    prompt = task_prompt + text_input

    # Ensure the image is in RGB mode
    if image.mode != "RGB":
        image = image.convert("RGB")

    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        num_beams=3
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))
    return parsed_answer

for idx in range(3):
  print(run_example("DocVQA", 'What do you see in this image?', data['train'][idx]['image']))

Step 3: Dataset Preparation and Embedding

We need to construct our dataset by combining questions, answers, and images:

class DocVQADataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        example = self.data[idx]
        question = "<DocVQA>" + example['question']
        first_answer = example['answers'][0]
        image = example['image']
        if image.mode != "RGB":
            image = image.convert("RGB")
        return question, first_answer, image

train_dataset = DocVQADataset(data['train'].select(range(1000)))
val_dataset = DocVQADataset(data['validation'].select(range(100)))

# Convert to Embeddings
def collate_fn(batch):
    questions, answers, images = zip(*batch)
    inputs = processor(text=list(questions), images=list(images), return_tensors="pt", padding=True).to(device)
    return inputs, answers

batch_size = 1
num_workers = 0
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn, num_workers=num_workers)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn, num_workers=num_workers)

Step 4: Training the Model

Now, let's train the model:

def train_model(train_loader, val_loader, model, processor, epochs=10, lr=1e-6):
    optimizer = AdamW(model.parameters(), lr=lr)
    num_training_steps = epochs * len(train_loader)
    lr_scheduler = get_scheduler(
        name="linear",
        optimizer=optimizer,
        num_warmup_steps=0,
        num_training_steps=num_training_steps,
    )

    for epoch in range(epochs):
        model.train()
        train_loss = 0
        i = -1
        for batch in tqdm(train_loader, desc=f"Training Epoch {epoch + 1}/{epochs}"):
            i += 1
            inputs, answers = batch

            input_ids = inputs["input_ids"]
            pixel_values = inputs["pixel_values"]
            labels = processor.tokenizer(text=answers, return_tensors="pt", padding=True, return_token_type_ids=False).input_ids.to(device)

            outputs = model(input_ids=input_ids, pixel_values=pixel_values, labels=labels)
            loss = outputs.loss

            loss.backward()
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()

            train_loss += loss.item()

        avg_train_loss = train_loss / len(train_loader)
        print(f"Average Training Loss: {avg_train_loss}")

        # Validation phase
        model.eval()
        val_loss = 0
        with torch.no_grad():
            for batch in tqdm(val_loader, desc=f"Validation Epoch {epoch + 1}/{epochs}"):
                inputs, answers = batch

                input_ids = inputs["input_ids"]
                pixel_values = inputs["pixel_values"]
                labels = processor.tokenizer(text=answers, return_tensors="pt", padding=True, return_token_type_ids=False).input_ids.to(device)

                outputs = model(input_ids=input_ids, pixel_values=pixel_values, labels=labels)
                loss = outputs.loss

                val_loss += loss.item()

        avg_val_loss = val_loss / len(val_loader)
        print(f"Average Validation Loss: {avg_val_loss}")

        # Save model checkpoint
        output_dir = f"./model_checkpoints/epoch_{epoch+1}"
        os.makedirs(output_dir, exist_ok=True)
        model.save_pretrained(output_dir)
        processor.save_pretrained(output_dir)
        
# Freeze the image encoder for this tutorial
for param in model.vision_tower.parameters():
  param.is_trainable = False
  
train_model(train_loader, val_loader, model, processor, epochs=1)

Step 5: Save to HuggingFace Hub

Finally, save the fine-tuned model to Hugging Face:

model.push_to_hub("USERNAME/Florence-2-FT-DocVQA")
processor.push_to_hub("USERNAME/Florence-2-FT-DocVQA")

Conclusion

Fine-tuning Florence 2 enables the model to accurately respond to complex queries about images. By following the steps outlined in this tutorial, you can train the model using your custom datasets and improve its performance. We hope you found this guide helpful. Stay tuned for more tutorials on AI and machine learning!

Happy fine-tuning! 🚀

If you have any questions or need further assistance, feel free to reach out. Don't forget to like, share, and subscribe for more content!