Track
NVIDIA Nemotron-3 is NVIDIA’s new open model family built for reasoning, coding, chat, and agentic AI workflows. It includes different model sizes, such as Nano, Super, and Ultra, so developers can choose between smaller, efficient models and larger, high-performance models.
The key update with Nemotron-3 is its focus on efficiency. The models are designed to deliver strong performance while keeping inference and fine-tuning more practical. The Nano version is especially useful for hands-on experimentation because it can run on more accessible GPU setups compared with larger models.
In this guide, we will fine-tune NVIDIA Nemotron-3-Nano-4B on a psychology question-answering dataset. We will use Low-Rank Adaptation (LoRA), Transformers Reinforcement Learning (TRL), and Hugging Face to prepare the data, train the model, save the adapter, push it to Hugging Face, and compare the responses before and after fine-tuning.
To get started with finding the latest open-source AI models, building AI agents, and fine-tuning LLMs, I recommend enrolling in our Hugging Face Fundamentals skill track.
1. Setting Up the Environment
Nemotron-3 Nano uses a hybrid architecture, so the Mamba-related packages need to be installed correctly. In a Jupyter notebook, we first remove the existing PyTorch stack and reinstall the CUDA 12.8 build of PyTorch 2.7.1, which works with the pinned mamba_ssm and causal_conv1d versions used here.
We also install the core fine-tuning libraries, including transformers, trl, accelerate, datasets, peft, and huggingface_hub.
%%capture
!pip install -U packaging ninja
# Replace the current PyTorch stack with the CUDA 12.8 build that works with these Mamba kernel pins.
!pip uninstall -y torch torchvision torchaudio triton
!pip install "torch==2.7.1" "torchvision==0.22.1" "torchaudio==2.7.1" --index-url https://download.pytorch.org/whl/cu128
!pip install -U "transformers==4.56.2" tokenizers "trl==0.22.2" accelerate datasets peft pandas tqdm huggingface_hub safetensors
!pip install -U --no-build-isolation "mamba_ssm==2.2.5" "causal_conv1d==1.5.2"
After installing the packages, check that CUDA is available and that PyTorch can detect your GPU. This notebook is tuned for a 24GB GPU, so it will warn you if your GPU has less VRAM.
import os
import platform
import torch
print(f"Python: {platform.python_version()}")
print(f"PyTorch: {torch.__version__}")
print(f"PyTorch CUDA build: {torch.version.cuda}")
print(f"CUDA available: {torch.cuda.is_available()}")
if not torch.cuda.is_available():
raise RuntimeError(
"CUDA is not available. Select a RunPod PyTorch image with GPU support."
)
for idx in range(torch.cuda.device_count()):
props = torch.cuda.get_device_properties(idx)
total_gb = props.total_memory / 1024**3
print(
f"GPU {idx}: {props.name} ({total_gb:.1f} GB VRAM, capability {props.major}.{props.minor})"
)
if torch.cuda.get_device_properties(0).total_memory < 24 * 1024**3:
print(
"Warning: this 4B LoRA notebook is tuned for GPUs with at least 24GB VRAM. Reduce batch sizes on smaller GPUs."
)
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
Output:
Python: 3.12.3
PyTorch: 2.7.1+cu128
PyTorch CUDA build: 12.8
CUDA available: True
GPU 0: NVIDIA GeForce RTX 3090 (23.6 GB VRAM, capability 8.6)
Warning: this 4B LoRA notebook is tuned for GPUs with at least 24GB VRAM. Reduce batch sizes on smaller GPUs.
Set your Hugging Face token as an environment variable named HF_TOKEN. This lets the notebook download the Nemotron-3 model and later push the fine-tuned LoRA adapter to Hugging Face.
from huggingface_hub import login
hf_token = os.environ.get("HF_TOKEN")
if not hf_token:
raise ValueError(
"Set HF_TOKEN in the RunPod environment before running this notebook."
)
login(token=hf_token)
print("Logged in to Hugging Face.")
2. Loading and Processing the Dataset
Next, we will load the psychology question-answering dataset from Hugging Face. The dataset contains a question column and two response columns: response_j and response_k. For this guide, we will use response_j as the target answer for supervised fine-tuning.
We first load the dataset, shuffle it for reproducibility, and create train, validation, and test splits.
from datasets import DatasetDict, load_dataset
DATASET_ID = "jkhedri/psychology-dataset"
TRAIN_LIMIT = 8000
VALIDATION_LIMIT = 800
TEST_LIMIT = 300
SEED = 42
raw_dataset = load_dataset(DATASET_ID)
raw_train = raw_dataset["train"].shuffle(seed=SEED)
split_1 = raw_train.train_test_split(test_size=0.15, seed=SEED)
split_2 = split_1["test"].train_test_split(test_size=0.33, seed=SEED)
def maybe_limit(split, limit):
if limit is None:
return split
return split.select(range(min(limit, len(split))))
dataset = DatasetDict(
{
"train": maybe_limit(split_1["train"], TRAIN_LIMIT),
"validation": maybe_limit(split_2["train"], VALIDATION_LIMIT),
"test": maybe_limit(split_2["test"], TEST_LIMIT),
}
)
dataset
Output:
DatasetDict({
train: Dataset({
features: ['question', 'response_j', 'response_k'],
num_rows: 8000
})
validation: Dataset({
features: ['question', 'response_j', 'response_k'],
num_rows: 800
})
test: Dataset({
features: ['question', 'response_j', 'response_k'],
num_rows: 300
})
})
Before formatting the dataset for training, check the column names and view one example. This confirms that the dataset loaded correctly and contains the expected question and response fields.
dataset["train"].column_names, dataset["train"][0]
Output:
(
['question', 'response_j', 'response_k'],
{
'question': "I'm experiencing anxiety about social situations and don't know how to cope.",
'response_j': "Social anxiety can be a difficult and isolating experience, but there are effective treatments available. Let's work on developing coping mechanisms, such as deep breathing and mindfulness, and exposure therapy to gradually confront your fears. We can also explore ways to improve social skills and build self-confidence.",
'response_k': "Just avoid social situations. It's not worth the anxiety and discomfort. You can also try using alcohol or drugs to help you feel more comfortable in social settings."
}
)
3. Formatting the Dataset for TRL Fine-Tuning
Now we will convert the dataset into the prompt-completion format expected by TRL. Each example will include a system prompt, the user’s psychology question, and the target assistant response from response_j.
The system prompt tells the model how to respond: be supportive, avoid hidden reasoning traces, give practical suggestions, and avoid acting like a licensed mental health professional.
SYSTEM_PROMPT = """/no_think
You are a supportive psychology question-answering assistant.
Do not include hidden reasoning, thinking traces, <think> tags, or </think> tags in the final answer.
Respond with empathy, practical coping suggestions, and clear next steps.
Give a complete answer in 2-4 short paragraphs or a brief paragraph plus 3-5 practical bullets.
Do not diagnose the user or claim to replace a licensed mental health professional.
If the user may be in immediate danger or crisis, encourage contacting local emergency services or a trusted crisis hotline.
Keep the answer safe, specific, and directly relevant to the user's question without being overly brief."""
CHAT_TEMPLATE_KWARGS = {"enable_thinking": False}
USER_TEMPLATE = "Question:\n\n{question}"
def clean_text(value):
return " ".join(str(value).strip().split())
def to_prompt_completion(example):
question = clean_text(example["question"])
answer = clean_text(example["response_j"])
return {
"prompt": [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": USER_TEMPLATE.format(question=question)},
],
"completion": [
{"role": "assistant", "content": answer},
],
"chat_template_kwargs": CHAT_TEMPLATE_KWARGS,
}
sft_dataset = dataset.map(
to_prompt_completion, remove_columns=dataset["train"].column_names
)
sft_dataset["train"][0]
Output:
{
'prompt': [
{
'role': 'system',
'content': "/no_think\nYou are a supportive psychology question-answering assistant.\nDo not include hidden reasoning, thinking traces, <think> tags, or </think> tags in the final answer.\nRespond with empathy, practical coping suggestions, and clear next steps.\nGive a complete answer in 2-4 short paragraphs or a brief paragraph plus 3-5 practical bullets.\nDo not diagnose the user or claim to replace a licensed mental health professional.\nIf the user may be in immediate danger or crisis, encourage contacting local emergency services or a trusted crisis hotline.\nKeep the answer safe, specific, and directly relevant to the user's question without being overly brief."
},
{
'role': 'user',
'content': "Question:\n\nI'm experiencing anxiety about social situations and don't know how to cope."
}
],
'completion': [
{
'role': 'assistant',
'content': "Social anxiety can be a difficult and isolating experience, but there are effective treatments available. Let's work on developing coping mechanisms, such as deep breathing and mindfulness, and exposure therapy to gradually confront your fears. We can also explore ways to improve social skills and build self-confidence."
}
],
'chat_template_kwargs': {'enable_thinking': False}
}
4. Loading the Nemotron-3 Base Model
Next, we will load the NVIDIA Nemotron-3 Nano 4B BF16 tokenizer and base model from Hugging Face. We also set the output directory for the LoRA adapter and limit the sequence length to 1024 tokens to keep training manageable on a 24GB GPU.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_ID = "nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16"
OUTPUT_DIR = "./nemotron-3-nano-4b-bf16-psychology-qa-lora"
MAX_SEQ_LENGTH = 1024
tokenizer = AutoTokenizer.from_pretrained(
MODEL_ID,
token=hf_token,
trust_remote_code=True,
use_fast=True,
)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
base_model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
token=hf_token,
trust_remote_code=True,
dtype=torch.bfloat16,
device_map="auto",
attn_implementation="eager",
)
base_model.config.use_cache = False
base_model.config.pad_token_id = tokenizer.pad_token_id
base_model.config.eos_token_id = tokenizer.eos_token_id
base_model.generation_config.pad_token_id = tokenizer.pad_token_id
base_model.generation_config.eos_token_id = tokenizer.eos_token_id
base_model.generation_config.use_cache = False
base_model.generation_config.do_sample = False
base_model.generation_config.top_p = None
base_model.generation_config.min_new_tokens = None
base_model.generation_config.repetition_penalty = 1.08
base_model.generation_config.no_repeat_ngram_size = 4
5. Creating Generation Helper Functions
Before fine-tuning, we will create a few helper functions to test the model’s responses. These functions build the chat prompt, generate an answer, remove any unwanted thinking tags, and store the results in a small comparison table.
import gc
import pandas as pd
from tqdm.auto import tqdm
def clear_cuda_cache():
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
def build_messages(question, system_prompt=SYSTEM_PROMPT):
return [
{"role": "system", "content": system_prompt},
{
"role": "user",
"content": USER_TEMPLATE.format(question=clean_text(question)),
},
]
def remove_thinking_text(text):
text = text.strip()
while "<think>" in text and "</think>" in text:
start = text.find("<think>")
end = text.find("</think>", start) + len("</think>")
text = (text[:start] + text[end:]).strip()
if "</think>" in text:
text = text.split("</think>")[-1].strip()
return text.replace("<think>", "").replace("</think>", "").strip()
def generate_answer(
model, tokenizer, question, system_prompt=SYSTEM_PROMPT, max_new_tokens=180
):
messages = build_messages(question, system_prompt)
device = next(model.parameters()).device
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
**CHAT_TEMPLATE_KWARGS,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
)
inputs = {key: value.to(device) for key, value in inputs.items()}
input_len = inputs["input_ids"].shape[-1]
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=False,
use_cache=False,
repetition_penalty=1.08,
no_repeat_ngram_size=4,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
decoded = tokenizer.decode(outputs[0][input_len:], skip_special_tokens=True).strip()
return remove_thinking_text(decoded)
def generate_sample_table(model, tokenizer, examples, output_column):
rows = []
model.eval()
for ex in tqdm(examples, desc=f"Generating {output_column}", leave=False):
rows.append(
{
"question": clean_text(ex["question"]),
"reference_response_j": clean_text(ex["response_j"]),
output_column: generate_answer(model, tokenizer, ex["question"]),
}
)
return pd.DataFrame(rows)
6. Running a Pre-Fine-Tuning Sample Evaluation
Before training, we will generate a few responses from the base Nemotron-3 model. This gives us a baseline so we can later compare how the model responds before and after LoRA fine-tuning.
Here, we select three examples from the test set and generate answers using the helper function we created earlier.
sample_examples = [dataset["test"][idx] for idx in range(min(3, len(dataset["test"])))]
pre_samples = generate_sample_table(
base_model,
tokenizer,
sample_examples,
"base_model_answer"
)
pre_samples
The output is a small table with the original question, the reference answer from response_j, and the answer generated by the base model. This table will be useful later when we compare it with the fine-tuned model’s responses.

7. Configuring LoRA and Training Settings
Now we will prepare the model for LoRA fine-tuning. We enable gradient checkpointing to reduce memory usage, then create a LoRA configuration that targets all linear layers in the model.
from peft import LoraConfig
base_model.gradient_checkpointing_enable()
base_model.config.use_cache = False
lora_config = LoraConfig(
r=32,
lora_alpha=64,
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM",
target_modules="all-linear",
)
Next, we define the supervised fine-tuning settings using SFTConfig. These settings control the batch size, learning rate, number of epochs, evaluation frequency, saving strategy, and BF16 training.
from trl import SFTConfig, SFTTrainer
training_args = SFTConfig(
output_dir=OUTPUT_DIR,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
gradient_accumulation_steps=8,
learning_rate=5e-5,
weight_decay=0.01,
lr_scheduler_type="linear",
warmup_ratio=0.05,
num_train_epochs=2,
logging_steps=50,
eval_strategy="steps",
eval_steps=50,
save_strategy="steps",
save_steps=100,
save_total_limit=2,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
greater_is_better=False,
gradient_checkpointing=True,
bf16=True,
fp16=False,
tf32=True,
max_length=MAX_SEQ_LENGTH,
packing=False,
completion_only_loss=True,
remove_unused_columns=False,
dataloader_num_workers=4,
optim="adamw_torch_fused",
report_to="none",
seed=SEED,
)
8. Training and Saving the LoRA Adapter
Now we can create the SFTTrainer, attach the LoRA configuration, and start fine-tuning. Before training, we also check how many parameters are trainable to confirm that the LoRA adapter was attached correctly.
trainer = SFTTrainer(
model=base_model,
args=training_args,
train_dataset=sft_dataset["train"],
eval_dataset=sft_dataset["validation"],
peft_config=lora_config,
processing_class=tokenizer,
)
trainable_params = sum(
param.numel() for param in trainer.model.parameters() if param.requires_grad
)
all_params = sum(param.numel() for param in trainer.model.parameters())
if trainable_params == 0:
raise RuntimeError(
"No trainable LoRA parameters were attached. Check target_modules before training."
)
print(f"Trainable LoRA parameters: {trainable_params:,}")
print(f"All parameters visible to trainer: {all_params:,}")
print(f"Trainable percentage: {100 * trainable_params / all_params:.4f}%")
train_result = trainer.train()
trainer.model.eval()
trainer.model.config.use_cache = False
trainer.model.generation_config.use_cache = False
train_result
During training, the training loss and validation loss should gradually reduce. This usually means the model is learning the response style from the dataset.

After training, save the LoRA adapter and tokenizer locally:
trainer.model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
Then push the fine-tuned adapter to Hugging Face:
HUB_REPO_ID = "kingabzpro/nemotron-3-nano-4b-bf16-psychology-qa-lora"
trainer.model.push_to_hub(HUB_REPO_ID, private=False)
tokenizer.push_to_hub(HUB_REPO_ID, private=False)
The fine-tuned adapter is now saved locally and uploaded to Hugging Face under the HUB_REPO_ID.

Source: kingabzpro/nemotron-3-nano-4b-bf16-psychology-qa-lora
9. Comparing Responses Before and After Fine-Tuning
Finally, we will generate answers from the fine-tuned model and compare them with the base model outputs. This helps us see whether LoRA fine-tuning improved the model’s alignment with the reference responses.
post_samples = generate_sample_table(
trainer.model,
tokenizer,
sample_examples,
"fine_tuned_answer"
)
comparison = pre_samples[
["question", "reference_response_j", "base_model_answer"]
].merge(
post_samples[["question", "fine_tuned_answer"]],
on="question",
how="left",
)
for idx, row in comparison.iterrows():
print("=" * 100)
print(f"Sample {idx + 1}")
print("=" * 100)
print("\nQUESTION:\n")
print(row["question"])
print("\nREFERENCE RESPONSE_J:\n")
print(row["reference_response_j"])
print("\nBASE MODEL ANSWER:\n")
print(row["base_model_answer"])
print("\nFINE-TUNED ANSWER:\n")
print(row["fine_tuned_answer"])
print("\n")

The fine-tuned model became more aligned with the reference response style. It was more concise and stayed closer to the dataset answers. However, the base model sometimes gave more detailed and practical responses.
For example, the fine-tuned model improved alignment on stress management and concentration-related questions, but the base model gave a stronger response for the sleep-related example because it included more helpful detail.
Overall, the fine-tuned model is better if your goal is to match the reference dataset style. If your goal is maximum helpfulness, the base model may still perform better in some cases because it can give warmer and more detailed answers.
If you have issues running the code above, refer to the notebook in the Hugging Face repo: fine-tune-nemotron-nano.ipynb
Final Thoughts
Even after fine-tuning 100+ LLMs, this model took more setup work than expected. The main challenge was the mamba_ssm dependency, which can easily break or conflict with an existing local Python environment.
Because of that, I recommend using a clean environment for this workflow. In my case, the easiest path was to rebuild the environment, install the correct PyTorch version, pin the Mamba-related packages, and then run the notebook from there.
Another limitation is quantization. For this setup, I could not simply load the model in 4-bit and fine-tune it like a standard QLoRA workflow, like in my Qwen3.5 Small tutorial. I had to load the full BF16 model and then fine-tune it with LoRA. For a 4B model, this is still manageable on a 24GB GPU, but for 12B models and above, memory usage can quickly become a problem.
That said, consumer GPU fine-tuning has become much more accessible. With a 24GB card like the RTX 3090, it is now possible to adapt strong open models to a specific style or domain without needing a large training cluster.
Overall, the Nemotron-3 Nano is a capable model, but it needs careful environment setup. Once the dependencies are working, it fine-tunes well and can adapt to a new response style with a relatively small number of examples.

As a certified data scientist, I am passionate about leveraging cutting-edge technology to create innovative machine learning applications. With a strong background in speech recognition, data analysis and reporting, MLOps, conversational AI, and NLP, I have honed my skills in developing intelligent systems that can make a real impact. In addition to my technical expertise, I am also a skilled communicator with a talent for distilling complex concepts into clear and concise language. As a result, I have become a sought-after blogger on data science, sharing my insights and experiences with a growing community of fellow data professionals. Currently, I am focusing on content creation and editing, working with large language models to develop powerful and engaging content that can help businesses and individuals alike make the most of their data.

