Skip to main content

Tinker Tutorial: Fine-Tuning LLMs With Thinking Machines Lab

A practical guide to implementing LoRA via Tinker AI to train Qwen3-8B on Chain-of-Thought (CoT) financial data, optimizing for generalization.
Dec 5, 2025  · 15 min read

Fine-tuning large language models usually means dealing with distributed GPU infrastructure, managing cluster failures, and debugging complex training scripts. Tinker is here to solve that.

Released by Mira Murati’s Thinking Machines Lab on October 1, 2025, Tinker is a training API that handles all the infrastructure complexity while giving you full control over your algorithms and data. It enables writing simple Python scripts with four core functions, and Tinker runs them across distributed GPUs for a wide range of open-source models like Llama 70B or Qwen 235B.

The platform uses Low-Rank Adaptation (LoRA) fine-tuning to reduce costs and supports everything from supervised learning to reinforcement learning. Research teams at Princeton, Stanford, and Berkeley already use it for their work.

In this tutorial, I will walk you through installing Tinker, understanding its API, and fine-tuning a complete financial Q&A model using the Qwen3-8B base model. If you want to learn more about fine-tuning with some hands-on practice, I recommend checking out this course about Fine-Tuning with Llama 3.

What is Tinker?

Thinking Machines Lab built Tinker for people who want to customize how models learn without becoming infrastructure experts.

Who uses Tinker?

The platform targets three groups: 

  • Researchers exploring new training method
  • Developers building AI products that need custom behavior
  • Builders who want production-quality results without enterprise resources. 

Each group shares a common need: they know what they want their model to do, but standard fine-tuning interfaces don't give them enough control.

For instance, academic teams could use it to test novel algorithms. A chemistry lab might want to train models on domain-specific reasoning patterns that don't fit typical instruction-tuning templates. A startup building a financial advisor bot could use the model to follow specific output formats and reasoning chains. 

All of these use cases have in common that they need to modify the training process itself, not just swap datasets.

What makes Tinker different?

Most platforms optimize for one of two things: ease of use or flexibility. Tinker's approach is that these two don't have to conflict. The platform gives you low-level access to training through four core operations, but handles everything else automatically:

  • Training loop control: Write your own loss functions, gradient accumulation strategies, or sampling patterns.
  • Model updates: Specify exactly how weights should change during optimization.
  • Evaluation: Generate outputs or compute probabilities at any point in training.
  • State management: Save and resume training with full control over what gets persisted.

Tinker exposes these capabilities through four API primitives that you'll learn in the hands-on section.

This works because most training complexity comes from infrastructure, not from the algorithms themselves. Running a custom training loop on your laptop is straightforward. Running it across 100 GPUs with failure recovery and resource scheduling is hard. Tinker handles the second part, so you can focus on the first.

Where Tinker fits

The AI tooling ecosystem splits roughly into three layers: cloud compute providers that give you raw GPUs, managed platforms that run predefined workflows, and frameworks that help you build training systems from scratch. Tinker sits between the managed platforms and the frameworks. You get more control than platforms like Hugging Face AutoTrain, but less infrastructure work than setting up a custom cluster on Google Cloud Platform (GCP).

Tinker Pricing and credits

Tinker uses transparent per-token pricing that varies by model size and operation type. Training Qwen3-8B costs $0.40 per million tokens. The good news: new users receive $150 in credits when they're cleared from the waitlist, which more than covers several experimental training runs like the one in this tutorial.

Tinker Pricing

Source: Thinking Machines Lab

Tinker Example: Fine-Tuning Qwen3-8B for Financial Q&A

In this section, I will walk you through building a complete fine-tuning workflow using Tinker's API. We'll fine-tune Qwen3-8B on financial question-answering data using LoRA, learning all four core API primitives while seeing production best practices in action.

For a more detailed overview of the model, make sure to check out this article covering Qwen3.

Important note: Tinker is currently in beta with waitlist-based access. If you have access, you can follow along and run the code. If you're on the waitlist, you can still read through to understand how straightforward the fine-tuning process is with Tinker. When you get access, you'll be ready to start immediately.

Training results and checkpoint selection

Let's look at the training results first. Understanding what happened helps you spot these patterns when you run your own fine-tuning jobs.

We fine-tuned Qwen3-8B on the FinCoT dataset, a chain-of-thought financial reasoning dataset, using LoRA with rank 32. After filtering for sequences not longer than 10,000 tokens, we had 7,244 training examples and 500 examples for validation to work with. 

Training ran for 904 iterations across four epochs, taking approximately three hours to complete, while Tinker handled all the distributed GPU coordination. 

An epoch is one complete pass through the entire training dataset. Since there were 226 batches per epoch (7,244 examples divided by a 32 batch size), the four epochs totaled 904 iterations.

Analyzing training results

The following chart tracks three metrics: training loss (blue), validation loss (red), and learning rate (green).

Tinker training progress

Early training (iterations 0-400) looks healthy: both the training and validation losses drop together sharply, falling from over 1.4 to below 0.8 and staying within about 0.1 of each other. 

This period demonstrates that the model is effectively learning generalizable patterns in the data. The learning rate (green line) ramps up smoothly during the first 200 iterations, as warmup stabilizes training. 

Around iteration 400-600, things shift. Training loss keeps dropping, but validation loss plateaus around 0.75-0.8, widening the gap to 0.13 as the model starts to overfit. By iteration 600-900, the divergence becomes obvious: training loss plummets to 0.39 while validation loss rises to 0.8, showing the model has shifted from learning reasoning patterns to memorizing examples. 

This pattern is common when fine-tuning reasoning models. If you're interested in how reasoning capabilities develop during training, feel free to check out this tutorial on fine-tuning DeepSeek R1, which explores similar dynamics.

The model saved at checkpoint-400 offers the best generalization, with a training loss of 0.68, a validation loss of 0.73, and a small gap of just 0.05. Despite having a higher training loss than the final checkpoint, it handles new questions better. The lowest training loss rarely produces the best model, which is why validation metrics matter.

We'll verify this later by comparing the model at checkpoint-400 against the base model using an LLM judge. For now, let's walk through building this fine-tuning workflow step by step.

Step 1: Install Tinker and set up your environment

Note: The following steps break down a script that is about 300 lines long. As explanations are woven into chunks of code, we recommend you open the full script in a new tab as you are following along for the full picture.

You'll need Python 3.11 or later and a stable internet connection. Since Tinker handles the GPU training on their servers, a good connection is more important than having a powerful computer.

Start by signing into the Tinker console and creating an API key. Store this key in a .env file in your project directory:

TINKER_API_KEY=your_key_here

Once you're set up, the Tinker console becomes your central hub for monitoring training runs and managing checkpoints.

Tinker training-runs

The Tinker console tracks all your training runs with key details: unique run IDs, base model, LoRA rank, and last request times. You can search, filter, and manage multiple experiments from this interface.

Install the required packages. Tinker has peer dependencies that don't install automatically, so specify them explicitly:

pip install tinker transformers torch datasets python-dotenv numpy

The transformers and torch packages provide tokenization and tensor operations that Tinker uses internally. The datasets library handles loading training data from Hugging Face.

Now set up your imports and a retry helper function. When training on remote servers, you might see temporary connection errors that don't necessarily mean anything is actually wrong. Your training job is still running fine on Tinker's infrastructure. This retry wrapper handles those transient failures automatically:

import time
import numpy as np
from dotenv import load_dotenv
from datasets import load_dataset
import tinker
from tinker import types

def with_retry(future, max_attempts=3, delay=5):
    """Simple retry logic for API futures"""
    for attempt in range(max_attempts):
        try:
            return future.result()
        except Exception as e:
            if attempt == max_attempts - 1:
                raise
            time.sleep(delay)

This function takes a future object (an object representing an operation that hasn't finished yet) from Tinker's API and tries to get the result up to three times, waiting five seconds between each try. Most temporary failures resolve within this window.

Step 2: Initialize the ServiceClient and load the dataset

The ServiceClient is your entry point to Tinker. It finds available models and handles authentication:

load_dotenv()
service_client = tinker.ServiceClient()

You won't use it much after this initial setup. It mainly exists to create training and sampling clients for specific models.

Now it is time to load the FinCoT dataset.  It contains financial questions paired with step-by-step reasoning chains and final answers. Unlike simple Q&A datasets, FinCoT teaches the model to show its work before providing answers, an important skill for financial advisory applications where users need to understand the reasoning behind recommendations.

dataset = load_dataset("TheFinAI/FinCoT")

Tinker dataset

The dataset includes an SFT (supervised fine-tuning) split with complete reasoning chains and an RL split for reinforcement learning. Use the SFT split for training and sample from the RL split for validation to ensure the model generalizes to different examples.

train_data_raw = dataset["SFT"]  # 7,686 examples with reasoning chains
val_data_raw = dataset["RL"].shuffle(seed=42).select(range(500))  # 500 validation examples

print(f"Loaded {len(train_data_raw)} training examples")
print(f"Loaded {len(val_data_raw)} validation examples")

Step 3: Create your LoRA training client

Now create a LoRA training client for Qwen3-8B:

training_client = service_client.create_lora_training_client(
    base_model="Qwen/Qwen3-8B",
    rank=32
)

Let's break down what's happening here:

  • Qwen/Qwen3-8B: This is a model with 8 billion parameters, powerful enough for real tasks but still efficient to train. It's a good sweet spot for learning. You can see all available models in Tinker's model lineup.
  • LoRA (Low-Rank Adaptation): Instead of updating all 8 billion parameters, LoRA trains small "adapter" layers that modify the base model's behavior. Think of it like adding custom lenses to a camera rather than rebuilding the entire camera. The base model stays frozen on Tinker's servers; you're only training these tiny adapters.
  • rank=32: This controls how many trainable parameters your LoRA adapters contain. A rank of 32 works well for datasets with 5,000-10,000 examples. Larger datasets might need rank=64 or rank=128 for more adaptation capacity.

For a deeper dive into how LoRA works mathematically and how to tune rank parameters, check out Tinker's LoRA primer or DataCamp's comprehensive guide on mastering Low-Rank Adaptation.

Language models don't work with words directly. They work with "tokens," which represent pieces of text. A tokenizer breaks text into these pieces. For example, "financial" might become one token, while "understanding" might split into "under" and "standing." 

The tokenizer converts text to numbers (token IDs) that the model can process. By getting the tokenizer from the training client, you're making sure you use the exact same tokenization that Qwen3-8B expects.

tokenizer = training_client.get_tokenizer()

Step 4: Transform data into Tinker's format

Tinker needs your training data in a format called types.Datum. This format separates the model input from the loss function configuration, giving you precise control over which tokens contribute to training.

When fine-tuning language models, you typically want the model to learn how to generate good answers, not memorize the questions. You do this by assigning different weights to different parts of the input:

  • Prompt tokens (the question): Weight = 0.0 (don't learn from these)
  • Completion tokens (the answer): Weight = 1.0 (learn from these)

Let's build the data transformation function step by step. Start with the function signature:

def prepare_datum(example, max_length=10000):

Format the conversation using Qwen3's chat template

Different models expect different formatting for multi-turn conversations. Qwen3 uses special tokens like <|im_start|> and <|im_end|> to mark message boundaries. Split each message into "observation" parts (role headers) and "action" parts (actual content):

   # Qwen3 chat format - split into observation and action parts
    user_ob = "<|im_start|>user\n"
    user_ac = f"{example['Question']}<|im_end|>"
    
    assistant_ob = "\n<|im_start|>assistant\n"
    # Include both reasoning process AND final answer
    assistant_ac = f"{example['Reasoning_process']}\n\nFinal Answer: {example['Final_response']}<|im_end|>"

The FinCoT dataset provides structured reasoning, so we concatenate the Reasoning_process and Final_response fields. This teaches the model to show its work before answering.

Tokenize each part separately 

To track where weights should apply, each of the message parts needs to be tokenized separately.

   # Tokenize each part separately to track weight boundaries
    user_ob_tokens = tokenizer.encode(user_ob, add_special_tokens=False)
    user_ac_tokens = tokenizer.encode(user_ac, add_special_tokens=False)
    assistant_ob_tokens = tokenizer.encode(assistant_ob, add_special_tokens=False)
    assistant_ac_tokens = tokenizer.encode(assistant_ac, add_special_tokens=False)
    
    # Combine all tokens
    all_tokens = user_ob_tokens + user_ac_tokens + assistant_ob_tokens + assistant_ac_tokens

Filter sequences that are too long

Some FinCoT examples include extremely detailed reasoning chains that exceed Qwen3-8B's comfortable context window. While the model technically supports 32,768 tokens, very long sequences can cause training instability:

   # Check if sequence exceeds max length
    if len(all_tokens) > max_length:
        return None  # Skip this example

Filtering examples above 10,000 tokens removes only 5-6% of data, while significantly improving training reliability.

Assign weights to control what the model learns from

Set weights to 0.0 for the question (so the model doesn't learn to predict questions) and 1.0 for the answer (so it learns to generate good responses):

   # Weights: only train on assistant's answer (action part)
    weights = np.array(
        [0.0] * len(user_ob_tokens) +
        [0.0] * len(user_ac_tokens) +
        [0.0] * len(assistant_ob_tokens) +
        [1.0] * len(assistant_ac_tokens)
    )

Shift tokens and weights for next-token prediction

Language models predict the next token in a sequence. When the model sees tokens 0-9, it should predict token 10. So your targets are always the input shifted one position forward:

   # CRITICAL: Shift tokens AND weights for next-token prediction
    input_tokens_model = all_tokens[:-1]
    target_tokens = all_tokens[1:]
    weights_shifted = weights[1:]

It is important to note that you must shift the weights along with the tokens. If you don't shift the weights, the loss calculation will be misaligned: the model would get trained on predicting the first answer token, but the weight at that position would still be 0.0 from the prompt. This alignment bug can prevent the model from learning properly.

Return the formatted data

Finally, the prepare_datum() function returns the formatted data, which is ready for processing.

   return types.Datum(
        model_input=types.ModelInput.from_ints(tokens=input_tokens_model),
        loss_fn_inputs=dict(weights=weights_shifted, target_tokens=target_tokens),
    )

By applying the defined function to the raw training and validation subsets, we can now process the entire dataset:

# Process and filter training data
training_data_raw = [prepare_datum(example) for example in train_data_raw]
training_data = [d for d in training_data_raw if d is not None]
skipped = len(train_data_raw) - len(training_data)

print(f"Processed {len(training_data)} examples (skipped {skipped} too-long sequences)")
# Output: Processed 7,244 examples (skipped 442 too-long sequences)

# Process validation data
val_data_raw_processed = [prepare_datum(example) for example in val_data_raw]
val_data = [d for d in val_data_raw_processed if d is not None]

print(f"Validation set: {len(val_data)} examples")

Step 5: Train the model with the training loop

The training loop uses two of Tinker's core API primitives: forward_backward and optim_step. Let's build it step by step with validation tracking to catch overfitting early.

Define a validation loss function

First, define a validation loss function. Unlike training loss, validation loss tells you how well your model generalizes to unseen data. This is crucial for detecting overfitting.

def compute_validation_loss(val_data, batch_size=100):
    """Compute loss on validation set (forward only, no backward)"""
    batch_indices = np.random.choice(
        len(val_data), size=min(batch_size, len(val_data)), replace=False
    )
    batch = [val_data[i] for i in batch_indices]
    
    # Forward pass only (no backward!)
    fwd_future = training_client.forward(batch, loss_fn="cross_entropy")
    fwd_result = with_retry(fwd_future)
    
    # Calculate per-token loss
    loss_sum = fwd_result.metrics["loss:sum"]
    total_completion_tokens = sum(
        np.sum(np.array(val_data[i].loss_fn_inputs["weights"].data) > 0)
        for i in batch_indices
    )
    return loss_sum / total_completion_tokens if total_completion_tokens > 0 else 0

The forward() method computes the loss without calculating gradients, making it efficient for evaluation where you only need to measure performance, not update weights. Computing validation loss adds minimal overhead (it's just a forward pass, no gradient computation) but provides early warning when the gap between training and validation loss exceeds 0.2, signaling that your model is memorizing rather than learning.

Configure training parameters

We need to define the number of epochs as well as the learning rate. LoRA requires significantly higher learning rates than full fine-tuning, typically 10-20x larger.

# Training configuration
n_samples = len(training_data)  # 7,244
n_epochs = 4  # More epochs for smaller dataset
batch_size = 32

# Calculate optimal LR for Qwen3-8B with LoRA
# Formula: base_lr * lora_multiplier * (2000 / hidden_size) ** exponent
# For Qwen: base=5e-5, multiplier=10, hidden_size=4096, exponent=0.0775
learning_rate = 5e-5 * 10.0 * (2000 / 4096) ** 0.0775  # ≈ 4.7e-4

The formula accounts for model size (Qwen3-8B has a hidden size of 4,096) and uses empirically-determined exponents that vary by model family, as documented in the Tinker cookbook. For Qwen models, the exponent is 0.0775, while Llama models use 0.781. In our case, this produces a learning rate around 4.7e-4, providing stable convergence without gradient explosions.

Set up learning rate warmup

Starting with the full learning rate can cause gradient explosions early in training when the model hasn't adapted to the task. Warmup gradually increases the learning rate from near-zero to the target value over the first 200 iterations. You can see this clearly in the training visualization from earlier, where the learning rate (green line) ramps up smoothly during the first 200 iterations.

# Warmup configuration
warmup_steps = 200

# Calculate iterations from epochs
num_iterations = n_epochs * (n_samples // batch_size)  # 4 * 226 = 904
checkpoint_interval = 200
validation_interval = 50

Start the training loop

After initializing lists to track losses, we can start the training loop:

losses = []
per_token_losses = []
val_losses = []

for iteration in range(num_iterations):
    # Sample random batch
    batch_indices = np.random.choice(len(training_data), size=batch_size, replace=False)
    batch = [training_data[i] for i in batch_indices]
    
    # Apply learning rate warmup
    if iteration < warmup_steps:
        current_lr = learning_rate * (iteration + 1) / warmup_steps
    else:
        current_lr = learning_rate

The warmup logic gradually scales up the learning rate during the first 200 iterations, preventing early instability.

Call Tinker's core training primitives

In each iteration, Tinker’s core primitives are called as follows:

   # API Primitive 1: forward_backward - compute gradients
    fwdbwd_future = training_client.forward_backward(batch, loss_fn="cross_entropy")

    # API Primitive 2: optim_step - update parameters
    optim_future = training_client.optim_step(types.AdamParams(learning_rate=current_lr))

    # Wait for results with retry logic
    fwdbwd_result = with_retry(fwdbwd_future)
    optim_result = with_retry(optim_future)

forward_backward computes gradients by figuring out how wrong the model's predictions are and in which direction to adjust each parameter. It uses cross-entropy loss, which measures how well the model's predicted probability distribution matches the actual next tokens. optim_step actually updates the model parameters based on those gradients using the Adam optimizer.

Notice how both methods return "future" objects immediately. This lets Tinker process multiple operations simultaneously, which speeds things up. Calling with_retry(future) waits for that specific operation to finish.

Track and display loss metrics

We calculate both total and per-token losses and append them to the respective lists:

   # Track loss
    loss_sum = fwdbwd_result.metrics["loss:sum"]
    total_completion_tokens = sum(
        np.sum(np.array(training_data[i].loss_fn_inputs["weights"].data) > 0)
        for i in batch_indices
    )
    per_token_loss = loss_sum / total_completion_tokens

    losses.append(loss_sum)
    per_token_losses.append(per_token_loss)

    if iteration % 10 == 0:
        warmup_indicator = "🔥" if iteration < warmup_steps else ""
        print(f"{warmup_indicator} Iteration {iteration} | Train: {per_token_loss:.4f} | LR: {current_lr:.6f}")

Per-token loss normalizes the loss by the number of answer tokens, making it easier to interpret across different batch sizes. A good training run typically starts with per-token loss around 1.2-1.5 and decreases below 0.8 by checkpoint-400. If your loss starts above 2.0 or doesn't drop after 50 iterations, check your data preparation and learning rate.

Add periodic validation checks and checkpoints

Saving weights every 200 iterations creates snapshots you can compare. As you saw in the training results earlier, checkpoint-400 generalizes better than the final checkpoint despite higher training loss. Always save multiple checkpoints and evaluate them on held-out data. The checkpoints persist on Tinker's servers after your script finishes.

   # Compute validation loss periodically
    if iteration % validation_interval == 0:
        val_loss = compute_validation_loss(val_data)
        val_losses.append((iteration, val_loss))
        gap = val_loss - per_token_loss
        print(f"  📊 Iteration {iteration} | Train: {per_token_loss:.4f} | Val: {val_loss:.4f} | Gap: {gap:+.4f}")

    # Checkpoint every 200 iterations
    if iteration > 0 and iteration % checkpoint_interval == 0:
        training_client.save_weights_for_sampler(name=f"fincot-checkpoint-{iteration}")

Tinker checkpoints

Tinker organizes checkpoints into two categories: full-state checkpoints (for resuming training if interrupted) and sampler weights (lightweight checkpoints for inference). Notice checkpoint-400 appears in both formats, making it easy to both continue training and deploy the model.

If your training run gets interrupted, you can resume from any saved checkpoint using the full-state weights, allowing you to continue training without starting over.

For complete API reference on these methods, see Tinker's training and sampling documentation.

Step 6: Save your fine-tuned model

After training completes, save the final model weights:

sampling_client = training_client.save_weights_and_get_sampling_client(
    name="financial-qa-qwen3-8b-lora"
)

The weights are saved on Tinker's servers, not your computer. The LoRA adapters typically require only tens of megabytes. Tinker can optionally export them for use with external inference providers if you want to deploy the model outside their infrastructure.

Step 7: Test the model with sample questions

Finally, test your fine-tuned model using the sample API primitive. First, configure sampling parameters:

sampling_params = types.SamplingParams(
    max_tokens=200,
    temperature=0.7,
    top_p=0.9,
    stop_sequences=["<|im_end|>"]
)

Let's explain these parameters:

  • max_tokens: Maximum length of the generated response (200 tokens is roughly 150 words).
  • temperature: Controls randomness in generation. Lower values (0.1-0.5) make output more focused and deterministic; higher values (0.7-1.0) allow for greater creativity. A value of 0.7 balances coherence with variety.
  • top_p (nucleus sampling): Only considers the smallest set of tokens whose cumulative probability exceeds this value. 0.9 means "only consider tokens that together account for 90% of the probability mass." This prevents the model from choosing very unlikely tokens.
  • stop_sequences: Tells the model to stop generating when it hits this sequence. <|im_end|> is the end-of-sequence marker.

Generate responses for test questions:

questions = [
    "What are the main risks associated with investing in stocks?",
    "How does diversification help reduce portfolio risk?"
]

for question in questions:
    model_input = types.ModelInput.from_ints(tokenizer.encode(question))

    response = with_retry(
        sampling_client.sample(
            prompt=model_input,
            num_samples=1,
            sampling_params=sampling_params
        )
    )

    answer = tokenizer.decode(response.sequences[0].tokens)
    print(f"Q: {question}")
    print(f"A: {answer}\n")

Here are two example outputs from the fine-tuned model (truncated for illustration):

Q: What are the main risks associated with investing in stocks?
A: Investing in stocks carries several key risks. Market risk represents the potential for stock prices to decline due to economic downturns or market volatility. Company-specific risk arises from poor management decisions, competitive pressures, or operational failures that affect individual stocks. Liquidity risk occurs when you cannot sell shares quickly without significantly impacting the price. Additionally, investors face inflation risk, where returns may not keep pace with rising prices...

Q: How does diversification help reduce portfolio risk?
A: Diversification reduces portfolio risk by spreading investments across different asset classes, sectors, and geographic regions. When you hold multiple uncorrelated assets, losses in one investment can be offset by gains in others. For example, if technology stocks decline but healthcare stocks rise, a diversified portfolio experiences less volatility than one concentrated in technology alone. The key principle is that different assets respond differently to market conditions...

The response object contains a sequences list because you can request multiple samples per prompt. Access the tokens through sequences[0].tokens and decode them back to text using the same tokenizer.

You've now successfully demonstrated all four core API primitives:

  1. forward_backward: Compute gradients from your training data
  2. optim_step: Update model parameters with the Adam optimizer
  3. save_weights_and_get_sampling_client: Persist your fine-tuned model
  4. sample: Generate predictions from the fine-tuned model

This same workflow pattern extends to more advanced training paradigms. For reinforcement learning or preference optimization, you'd use the same primitives but with different loss functions and data formats. If you're interested in aligning models with human preferences, check out preference fine-tuning techniques, which build on the same foundation. The Tinker cookbook provides examples for these advanced scenarios.

But how do you know if the fine-tuning actually improved performance? Rather than relying on subjective impressions, you need rigorous evaluation. In the next section, you'll see results from an LLM-as-judge evaluation that compares the fine-tuned model against the base Qwen3-8B across 10 diverse financial questions.

Evaluating the fine-tuned model with an LLM judge

To measure performance objectively, we used LLM-as-judge: GPT-4o evaluated checkpoint-400 against base Qwen3-8B on 10 diverse financial questions, scoring each response on accuracy, clarity, completeness, and financial terminology.

The fine-tuned model scored 8.5/10, compared to 6.5 for the base model, winning all 10 comparisons. The verdicts reveal why: it delivers complete explanations with confident, structured reasoning, while the base hedges with "I think" and "maybe," leaving thoughts incomplete.

This validates checkpoint-400’s better real-world performance. The lower training loss at iteration 900 would have given us memorization, not reasoning.

Check the comparison script and detailed results to run similar evaluations on your models.

Conclusion

You've seen how Tinker simplifies fine-tuning without sacrificing control. Four core API primitives, forward_backward, optim_step, save_weights_and_get_sampling_client, and sample, give you the building blocks for custom training workflows while Tinker handles the distributed infrastructure.

The financial Q&A model we built demonstrates best production practices: validation tracking caught overfitting early, checkpoint-400 outperformed the final model by focusing on generalization, and LLM-as-judge evaluation confirmed a 2-point improvement over the base model. These patterns apply whether you're fine-tuning 8B or 70B parameter models.

The platform is in beta with waitlist access, but the concepts you've learned transfer directly once you gain access. The training runs I did for this tutorial (including failed and experimental ones) cost 150 starter credits, so with the initial credits, you can run this tutorial 6 times plus additional experiments, which gives plenty of room to learn and iterate on your own fine-tuning projects.

What's next? Try fine-tuning on your domain-specific data, experiment with different LoRA ranks and learning rates, or explore advanced training paradigms like reinforcement learning from human feedback. The Tinker cookbook offers examples for these scenarios. Start thinking about your use case now—what behavior do you want your model to learn?

If you want to develop a broader skillset in training and fine-tuning the latest AI models for production, make sure to check out the Associate AI Engineer for Developers study track.


Bex Tuychiev's photo
Author
Bex Tuychiev
LinkedIn

I am a data science content creator with over 2 years of experience and one of the largest followings on Medium. I like to write detailed articles on AI and ML with a bit of a sarcastıc style because you've got to do something to make them a bit less dull. I have produced over 130 articles and a DataCamp course to boot, with another one in the makıng. My content has been seen by over 5 million pairs of eyes, 20k of whom became followers on both Medium and LinkedIn. 

Topics

AI Engineering Courses

Track

Developing AI Applications

0 min
Learn to create AI-powered applications with the latest AI developer tools, including the OpenAI API, Hugging Face, and LangChain.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

Tutorial

Fine-Tuning LLMs: A Guide With Examples

Learn how fine-tuning large language models (LLMs) improves their performance in tasks like language translation, sentiment analysis, and text generation.
Josep Ferrer's photo

Josep Ferrer

Tutorial

Fine-Tuning Qwen3: A Step-by-Step Guide

A step-by-step guide for fine-tuning the Qwen3-32B model on the medical reasoning dataset within an hour.
Abid Ali Awan's photo

Abid Ali Awan

llama 4 fine tuning

Tutorial

Fine-Tuning Llama 4: A Guide With Demo Project

Learn how to fine-tune the Llama 4 Scout Instruct model on a medical reasoning dataset using RunPod GPUs.
Abid Ali Awan's photo

Abid Ali Awan

Tutorial

Chain-of-Thought Prompting: Step-by-Step Reasoning with LLMs

Unlock the full potential of Large Language Models (LLMs) with our guide on Chain-of-Thought (CoT) prompting. Learn how to enhance reasoning and problem-solving abilities in LLMs.
Andrea Valenzuela's photo

Andrea Valenzuela

Tutorial

LlaMA-Factory WebUI Beginner's Guide: Fine-Tuning LLMs

Learn how to fine-tune LLMs on custom datasets, evaluate performance, and seamlessly export and serve models using the LLaMA-Factory's low/no-code framework.
Abid Ali Awan's photo

Abid Ali Awan

Tutorial

FLAN-T5 Tutorial: Guide and Fine-Tuning

A complete guide to fine-tuning a FLAN-T5 model for a question-answering task using transformers library, and running optmized inference on a real-world scenario.
Zoumana Keita 's photo

Zoumana Keita

See MoreSee More