Accéder au contenu principal

Fine-Tuning T5Gemma-2

A hands-on, end-to-end guide to fine-tuning T5Gemma-2 (270M-270M) for LaTeX OCR, showing how to correctly train and run inference with a multimodal encoder–decoder model using a small dataset.
13 janv. 2026  · 9 min lire

T5Gemma 2 is a family of lightweight, open-weight encoder–decoder models from Google, built on Gemma 3, that support multilingual and multimodal inputs. 

With up to a 128K context window across 140+ languages and parameter-efficient design choices like tied embeddings and merged attention, these models are well-suited for text generation and image understanding tasks while remaining small enough to run on a laptop.fin

In this tutorial, we will learn how to fine-tune an encoder–decoder model on a LaTeX OCR dataset. The goal is to achieve strong performance using a minimal number of training samples. 

1. Setting Up the A100 GPU Environment on RunPod

Although it is possible to fine-tune this model on Kaggle or Google Colab, doing so often leads to unstable sessions, resource disconnects, and significantly slower training. To avoid these friction points and keep the setup simple and reliable, we will use an NVIDIA A100 GPU.

You can rent an A100 on RunPod for around $1.39 per hour, and the full training process in this tutorial should take well under 30 minutes. This setup gives you consistent performance without fighting memory constraints.

Start by going to RunPod and creating a new pod using the latest PyTorch image. Select a 1× A100 machine.

Next, edit the pod configuration and add an environment variable called HF_TOKEN. This token is required to:

  • Load gated models from Hugging Face
  • Push your fine-tuned model back to the Hugging Face Hub

Once the pod is ready, launch the notebook and install the required Python packages. Make sure you are using the latest version of transformers.

!pip -q install -U accelerate datasets pillow sentencepiece safetensors
!pip install --quiet "transformers==5.0.0rc1"
!pip install --quiet --no-deps trl

Now, import the libraries and utilities that we will use throughout the notebook.

import torch
from datasets import load_dataset
from transformers import AutoProcessor, AutoModelForSeq2SeqLM
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer, set_seed

Before training, we apply a few configurations optimized for A100 GPUs. Setting a seed ensures reproducibility, and enabling TF32 improves performance without affecting bf16 training stability.

set_seed(42)

torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

print("CUDA:", torch.cuda.is_available(), torch.cuda.get_device_name(0) if torch.cuda.is_available() else None)
print("bf16 supported:", torch.cuda.is_available() and torch.cuda.is_bf16_supported())
CUDA: True NVIDIA A100 80GB PCIe
bf16 supported: True

2. Preprocessing the LaTeX OCR Dataset for Low-Data Training

In this tutorial, we use the LaTeX OCR dataset available on Hugging Face. Each example in the dataset consists of:

  • an image containing a mathematical expression, and
  • the corresponding LaTeX source text as the target output.

The dataset provides multiple configurations. To keep training efficient and aligned with our goal of learning from limited data, we will explicitly control the dataset size.

First, load the training and validation splits. To simulate a low-data fine-tuning scenario, we randomly shuffle the dataset and select a small subset:

  • 1,000 samples for training
  • 200 samples for validation

This keeps training fast while still allowing the model to generalize.

DATASET_NAME = "full"
raw_train = load_dataset("linxy/LaTeX_OCR", name=DATASET_NAME, split="train")
raw_val   = load_dataset("linxy/LaTeX_OCR", name=DATASET_NAME, split="validation")

train_ds = raw_train.shuffle(seed=42).select(range(1000))
val_ds   = raw_val.shuffle(seed=42).select(range(200)) 

print(train_ds, val_ds)
print("Columns:", train_ds.column_names)

Output:

Dataset({
    features: ['image', 'text'],
    num_rows: 1000
}) Dataset({
    features: ['image', 'text'],
    num_rows: 200
})
Columns: ['image', 'text']

To better understand the data, let’s look at a single training example. First, inspect the image.

train_ds[10]["image"]

Next, examine the raw LaTeX string associated with that image.

train_ds[10]["text"]
'G ( \\beta , \\tilde { \\mu } ) = \\left( \\frac { \\pi \\mu \\Gamma ( \\frac { \\lambda } { \\lambda + 1 } ) } { 2 \\Gamma ( \\frac { 1 } { \\lambda + 1 } ) } \\right) ^ { \\frac { 1 } { 2 \\lambda } } g _ { 0 } ( \\beta ) g _ { S } ( \\beta , Z ) ,'

Finally, we can render this LaTeX expression directly in the notebook to verify that the text matches the image content.

from IPython.display import display, Math, Latex

latex = train_ds[10]["text"]
display(Math(latex))

3. Initializing the T5Gemma-2 Multimodal Encoder–Decoder

Now that the dataset is ready, the next step is to load the T5Gemma-2 model and its corresponding processor. We will use the 270M–270M variant, which provides a strong balance between capability and efficiency and fits comfortably on a consumer GPU.

MODEL_ID = "google/t5gemma-2-270m-270m"

processor = AutoProcessor.from_pretrained(MODEL_ID, use_fast=True)
model = AutoModelForSeq2SeqLM.from_pretrained(
    MODEL_ID,
    dtype=torch.bfloat16,   # A100 -> bf16
    device_map="auto",
)

The tokenizer is accessed through the processor. Some encoder–decoder models do not define a padding token by default, so we add one if necessary and resize the model’s token embeddings accordingly.

tokenizer = processor.tokenizer
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({"pad_token": tokenizer.eos_token})
    model.resize_token_embeddings(len(tokenizer))

For LaTeX OCR, we guide the model using a short textual prompt paired with the input image. This prompt clearly specifies the task and constrains the output format.

Finally, we define maximum sequence lengths for the encoder input and decoder target. These values are sufficient for most mathematical expressions while keeping memory usage under control.

PROMPT = "<start_of_image> Convert this image to LaTeX. Output only LaTeX."
MAX_INPUT_LEN  = 128
MAX_TARGET_LEN = 256

4. Running Baseline Inference: Zero-Shot Performance Check

Before starting fine-tuning, it is useful to run a baseline inference using the pretrained T5Gemma-2 model. This helps us understand how the model behaves on the LaTeX OCR task without any task-specific training and gives us a reference point for later improvements.

Here’s a concise outline of the process:

  1. Choose a sample image from the training dataset.
  2. Prepare model inputs by combining the image and textual prompt into the required tensor format.
  3. Run inference in evaluation mode with gradient computation disabled to conserve memory and enhance generation speed.
  4. Use beam search with a limited number of beams and mild repetition penalties for cleaner output.
  5. Decode the generated tokens back into text once the process is complete.
image = train_ds[20]["image"]

# prepare inputs
model_inputs = processor(text=PROMPT, images=image, return_tensors="pt")
model_inputs = {k: v.to("cuda") for k, v in model_inputs.items()}

# run inference
model = model.eval()
with torch.inference_mode():
    generation = model.generate(
        **model_inputs,
        max_new_tokens=100,
        do_sample=False,
        num_beams=3,    
        repetition_penalty=1.2,
        no_repeat_ngram_size=4,
        early_stopping=True,
    )

# decode
pred = processor.decode(generation[0], skip_special_tokens=True)

print("\n--- Model output ---")
print(pred)

Example output from the pretrained model:

--- Model output ---
Use this command to change this image's format:\usepackage[T1]{fontenc}\usepackage[utf8]{inputenc}\DeclareUnicodeUTF8{1234567890123}\begin{document}<tex>$\begin{equation}\begin{aligned}P=&\frac{1+(-)\frac{1}{2}\left(F_{1}(-)\frac{F_{2}(-

Now, let’s look at the ground-truth LaTeX for the same image.

train_ds[20]["text"]

Ground-truth:

'{ \\cal P } = \\frac { 1 + ( - ) ^ { F } { \\cal I } _ { 4 } ( - ) ^ { F _ { L } } } { 2 } .'

As expected, the pretrained model does not produce a correct LaTeX transcription. Instead, it generates generic LaTeX boilerplate and unrelated fragments. This behavior is normal, as the base model has not been trained specifically for LaTeX OCR.

5. Building a Custom Image-Text Data Collator

In this section, we build a custom image–text data collator to correctly batch images, prompts, and LaTeX targets for fine-tuning. Since T5Gemma-2 is a multimodal encoder–decoder model, the collator plays a critical role in ensuring that images and text are aligned correctly and passed to the model in the expected format.

Specifically, the collator:

  • Loads images from different possible formats and converts them to a consistent RGB representation
  • Ensures each training example contains exactly one image, wrapped in the structure expected by the processor
  • Attaches a fixed instruction prompt to every image to clearly define the OCR task
  • Tokenizes LaTeX targets separately with controlled truncation and padding
  • Masks padding tokens in the labels so they do not contribute to the training loss
  • Avoids truncation on the input side to prevent image–token mismatches

Together, these steps ensure stable training, correct loss computation, and reliable convergence when fine-tuning the model on LaTeX OCR data.

from typing import Any, Dict, List
import torch
from PIL import Image as PILImage

tokenizer = processor.tokenizer
tokenizer.padding_side = "right"

if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({"pad_token": tokenizer.eos_token})

pad_id = tokenizer.pad_token_id

PROMPT = "<start_of_image> Convert this equation image to LaTeX. Output only LaTeX."
MAX_TARGET_LEN = 256

def collate_fn(examples: List[Dict[str, Any]]) -> Dict[str, torch.Tensor]:
    images, prompts, targets = [], [], []

    for ex in examples:
        im = ex["image"]
        if isinstance(im, PILImage.Image):
            im = im.convert("RGB")
        elif isinstance(im, dict) and "path" in im:
            im = PILImage.open(im["path"]).convert("RGB")
        else:
            raise ValueError(f"Unexpected image type: {type(im)}")

        # IMPORTANT: one image per sample -> nested list
        images.append([im])

        prompts.append(PROMPT)
        targets.append(ex["text"])

    # ✅ NO truncation here (prevents image-token mismatch)
    model_inputs = processor(
        text=prompts,
        images=images,
        padding=True,
        truncation=False,
        return_tensors="pt",
    )

    labels = tokenizer(
        targets,
        padding=True,
        truncation=True,
        max_length=MAX_TARGET_LEN,
        return_tensors="pt",
    )["input_ids"]

    labels[labels == pad_id] = -100
    model_inputs["labels"] = labels
    return model_inputs

6. Training Configuration and Optimization Settings

In this section, we define the training configuration and optimization settings for fine-tuning T5Gemma-2 on the LaTeX OCR task. The setup is intentionally simple and optimized for fast training on a single A100 GPU, using a small dataset and a single training epoch. 

To reduce overhead and speed things up, we disable evaluation and checkpoint saving and focus only on efficient forward and backward passes.

from transformers import Seq2SeqTrainingArguments

args = Seq2SeqTrainingArguments(
    output_dir="t5gemma2-latex-ocr-1k",

    # --- core training ---
    num_train_epochs=1,
    per_device_train_batch_size=8, 
    gradient_accumulation_steps=1,

    learning_rate=1e-4,
    warmup_steps=15,
    lr_scheduler_type="linear",

    # --- precision / speed ---
    bf16=True,    
    fp16=False,
    tf32=True,     

    # --- memory ---
    gradient_checkpointing=True,

    # --- stop extra work (this is the big speed win) ---
    eval_strategy="no",           
    predict_with_generate=False,          
    save_strategy="no",                 
    report_to="none",

    # --- dataloader ---
    dataloader_num_workers=0,         
    remove_unused_columns=False,

    # --- logging ---
    logging_steps=10,
)

Finally, we initialize the Seq2SeqTrainer. We explicitly pass our custom data collator so the trainer can correctly construct multimodal batches that combine images, instruction prompts, and LaTeX target sequences during training.

trainer = Seq2SeqTrainer(
    model=model,
    args=args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    data_collator=collate_fn,
)

7. Fine-Tuning T5Gemma-2 on LaTeX OCR

With everything set up, we can now start fine-tuning the model. Training is launched with a single call to the trainer.

trainer.train()

During training, the loss decreases gradually, indicating that the model is learning to better map equation images to their corresponding LaTeX representations. 

Even with a small dataset and just one epoch, the model begins to adapt quickly to the OCR task.

8. Evaluation After Fine-Tuning

After fine-tuning, we re-run inference on both a training sample and a validation sample to see how the model’s outputs have changed. 

Compared to the baseline, which mostly produced generic LaTeX boilerplate, the fine-tuned model now generates structured LaTeX that closely matches the shape and symbols of the target equations.

We start by testing on the same training example as before. The image is passed through the processor, tokens are generated by the model, and then decoded back into LaTeX.

# pick a sample
image = train_ds[10]["image"]

# prepare inputs
model_inputs = processor(text=PROMPT, images=image, return_tensors="pt")
model_inputs = {k: v.to("cuda") for k, v in model_inputs.items()}

# run inference
model = model.eval()
with torch.inference_mode():
    generation = model.generate(
        **model_inputs,
        max_new_tokens=100,
        do_sample=False,
        num_beams=3,    
        repetition_penalty=1.2,
        no_repeat_ngram_size=4,
        early_stopping=True,
    )

# decode
pred = processor.decode(generation[0], skip_special_tokens=True)
print(pred)

As you can see, we now get proper LaTeX output, not random or unrelated text. In the training example, the prediction is largely aligned with the ground truth, with the remaining errors mostly in fractions, indices, and a few misplaced tokens.

G ( \beta , \tilde { \mu } ) = \left( \frac { \pi \mu \Gamma } { 2 \Gamma } \lambda _ { + 1 } ^ { \lambda } \right) \right) ^ { \frac { 1 } { 9 \Gamma } g _ { 0 } ( \beta ) g _ { S } ( \theta , Z ) , \right) , \qquad G ( \bar { \

Next, we test the model on a validation sample to check generalization.

image = val_ds[10]["image"]

# prepare inputs
model_inputs = processor(text=PROMPT, images=image, return_tensors="pt")
model_inputs = {k: v.to("cuda") for k, v in model_inputs.items()}

# run inference
model = model.eval()
with torch.inference_mode():
    generation = model.generate(
        **model_inputs,
        max_new_tokens=100,
        do_sample=False,
        num_beams=3,    
        repetition_penalty=1.2,
        no_repeat_ngram_size=4,
        early_stopping=True,
    )

# decode
pred = processor.decode(generation[0], skip_special_tokens=True)
print(pred)

In the validation example, the model still follows the correct LaTeX structure and symbols, although it occasionally makes mistakes in bracket placement, terms inside parentheses, and longer expressions.

f ( p , p ^ { \prime } ) = \ln \left\{ \frac { ( p _ { i } - p _ { j } ) ^ { 2 } } { \left( \frac { \psi ( p _ j ) - g ( p ; p _ { s } ) \psi ( \psi _ { i ] } ) \right\} . . . g ( \psi ; \psi ) - g \left( p ; \psi _

When compared with the ground truth, the overall structure is clearly aligned, and the model is producing a close approximation rather than unrelated output.

print(val_ds[10]["text"])
f ( p , p ^ { \prime } ) = \ln \left\{ \frac { ( p _ { i } - p _ { j } ) ^ { 2 } } { ( p _ { i } + p _ { j } ) ^ { 2 } } \right\} \left[ \psi ( p _ { j } ) - g ( p _ { i } , p _ { j } ) \psi ( p _ { i } ) \right] .

Overall, the post-fine-tuning results show a clear improvement. The model is no longer guessing generic LaTeX templates and instead produces equation-like LaTeX that closely resembles the dataset targets, even with a small training set and a short fine-tuning run.

9. Saving and Publishing the Fine-Tuned T5Gemma-2 Model

Once training is complete, the first step is to save the fine-tuned model locally so it can be reused later for inference.

trainer.save_model()

Next, we push the model to the Hugging Face Hub so others can access it, reuse it, and build on top of it.

trainer.push_to_hub()

During inspection of the repository files, you may notice that the processor configuration is not always included when pushing the model through the trainer. 

Since the processor is required to correctly handle both images and text, we explicitly push it separately to ensure the model can be loaded and used without extra setup.

processor.push_to_hub(repo_id="kingabzpro/t5gemma2-latex-ocr-1k")

With this step, the repository contains everything needed to load the model and processor with a single call. You can now visit kingabzpro/t5gemma2-latex-ocr-1k on Hugging Face to access the fine-tuned model and start using it for LaTeX OCR or further experimentation.

10. Loading the Model for Inference with Pipelines

Now that the fine-tuned model is published on the Hugging Face Hub, we can load it directly for inference using the pipeline API. This is the simplest way to test the model without manually handling processors, tokenizers, or generation logic.

We load the model from the Hub and create an image-text-to-text pipeline:

from transformers import pipeline

generator = pipeline(
    "image-text-to-text",
    model="kingabzpro/t5gemma2-latex-ocr-1k",
)

Next, we run inference on a validation sample using the same instruction prompt as before.

generator(
    val_ds[10]["image"],
    text="<start_of_image> Convert this image to LaTeX. Output only LaTeX.",
    generate_kwargs={"do_sample": False, "max_new_tokens": 100},
)

As you can see, the output is already very close to the correct LaTeX.

[{'input_text': '<start_of_image> Convert this image to LaTeX. Output only LaTeX.',
  'generated_text': '<start_of_image> Convert this image to LaTeX. Output only LaTeX.f ( p , p ^ { \\prime } ) = \\ln \\left\\{ \\begin{array} { \\begin{array} { \\begin{array} { \\begin{array} { \\begin{array} { \\end{array} \\right\\} \\begin{array} { \\begin{array} { \\begin{array} { \\end{array} \\right\\} \\begin{array} { \\begin{array} { \\begin{array} {'}]

Let’s try another validation sample and post-process the output to keep just the LaTeX string.

preds = generator(
    val_ds[30]["image"],
    text="<start_of_image> Convert this image to LaTeX. Output only LaTeX.",
    generate_kwargs={"do_sample": False, "max_new_tokens": 100},
)

prompt = preds[0]["input_text"]
gen = preds[0]["generated_text"]

# remove the prompt if the model echoed it
if gen.startswith(prompt):
    gen = gen[len(prompt):]

# remove any leftover special tokens / separators
gen = gen.replace("<start_of_image>", "").strip()
if gen.startswith("."):
    gen = gen[1:].strip()

print("\nCLEAN PREDICTED LaTeX:\n", gen)

This time, the result is clean and directly usable as LaTeX:

CLEAN PREDICTED LaTeX:
 T _ { M N } = \left\{ g N \nu \partial _ { M P _ { } } \cdot \left\{ g ^ { N } \nu \partial _ { M P _ { } } \cdot \left\{ g ^ { N } \nu \partial _ { M P _ { } } \cdot \left\{ g ^ { N } \nu \partial _ { M P _ { } } \cdot \left\{ g ^ {

At this point, the model can be used as a drop-in LaTeX OCR system. You can deploy it behind an API, integrate it into a document processing pipeline, or continue fine-tuning it with more data for even better accuracy.

If you run into any issues while running the code above, please check out the helper notebook. It contains the complete code along with outputs at each step to guide you through the process.

Conclusion

When I started training the model, I treated it like any large language model with an image encoder. After failing multiple times, I realized that this approach does not work for sequence-to-sequence, encoder–decoder models. 

I had to rethink the entire setup, including the data collator, the trainer, and training arguments, and even how inference is performed.

In this tutorial, we walked through an end-to-end workflow for fine-tuning T5Gemma-2 on a LaTeX OCR task, starting from environment setup and dataset inspection to custom data collation, efficient training, and post-training evaluation. 

Using a small dataset and a single A100 GPU, we showed that an encoder–decoder multimodal model can quickly learn to generate structured, meaningful LaTeX from equation images. 

By the end, the fine-tuned model moved well beyond generic boilerplate output and produced equation-like LaTeX that closely matches the ground truth, demonstrating how accessible and effective fine-tuning modern open models can be for real-world OCR and document understanding tasks.

If you’re looking for more hands-on examples of fine-tuning LLMs, I recommend checking out the Fine-Tuning with Llama 3 course.

T5Gemma-2 FAQs

What distinguishes T5Gemma 2 from standard Gemma models?

Unlike standard Gemma models which use a decoder-only architecture (like GPT), T5Gemma 2 uses an encoder–decoder architecture similar to T5. This structure is specifically optimized for sequence-to-sequence tasks, making it superior for translation, summarization, and converting images to text (OCR).

Can I run T5Gemma 2 on a consumer laptop?

Yes. The T5Gemma 2 (270M) variant is highly efficient and requires less than 2GB of VRAM for inference. It runs smoothly on most modern laptops with consumer-grade GPUs (like NVIDIA RTX series) or even on standard CPUs, unlike larger LLMs that require enterprise hardware.

Why is T5Gemma 2 preferred for tasks like LaTeX OCR?

T5Gemma 2 is multimodal by design, allowing it to ingest image features and output structured text. Its encoder–decoder framework helps it strictly adhere to output formats (like LaTeX syntax) with fewer hallucinations compared to decoder-only models, which often struggle to maintain structure in OCR tasks.

What library versions are required to fine-tune T5Gemma 2?

To fine-tune T5Gemma 2, you must use transformers version 5.0.0 (or newer) and the latest trl library. Older versions of Hugging Face Transformers do not support the specific tied-embedding architecture used by the T5Gemma family.


Abid Ali Awan's photo
Author
Abid Ali Awan
LinkedIn
Twitter

As a certified data scientist, I am passionate about leveraging cutting-edge technology to create innovative machine learning applications. With a strong background in speech recognition, data analysis and reporting, MLOps, conversational AI, and NLP, I have honed my skills in developing intelligent systems that can make a real impact. In addition to my technical expertise, I am also a skilled communicator with a talent for distilling complex concepts into clear and concise language. As a result, I have become a sought-after blogger on data science, sharing my insights and experiences with a growing community of fellow data professionals. Currently, I am focusing on content creation and editing, working with large language models to develop powerful and engaging content that can help businesses and individuals alike make the most of their data.

Sujets

Top DataCamp Courses

Cours

Ajustement fin avec Llama 3

2 h
3K
Optimisez Llama pour des tâches personnalisées à l'aide de TorchTune et découvrez des techniques d'optimisation efficaces telles que la quantification.
Afficher les détailsRight Arrow
Commencer le cours
Voir plusRight Arrow