심리학 Q&A 데이터로 NVIDIA Nemotron-3-Nano 미세 조정하기

Hugging Face에서 모델을 다운로드한 뒤 LoRA와 TRL을 사용해 RTX 3090 GPU로 심리학 Q&A 데이터셋에 NVIDIA Nemotron-3-Nano-4B를 미세 조정하는 방법을 배워보세요.

업데이트됨 2026년 4월 29일 · 6분 읽다

NVIDIA Nemotron-3는 추론, 코딩, 대화, 에이전트형 AI 워크플로우를 위해 설계된 NVIDIA의 새로운 오픈 모델 패밀리입니다. Nano, Super, Ultra 등 다양한 크기의 모델이 포함되어 있어, 개발자는 소형이면서 효율적인 모델과 대형 고성능 모델 중에서 선택할 수 있습니다.

Nemotron-3의 핵심 업데이트는 효율성에 대한 집중입니다. 추론과 미세 조정을 더 실용적으로 유지하면서도 강력한 성능을 제공하도록 설계되었습니다. 특히 Nano 버전은 더 큰 모델에 비해 접근하기 쉬운 GPU 환경에서 실행할 수 있어 실험에 유용합니다.

이 가이드에서는 NVIDIA Nemotron-3-Nano-4B를 심리학 질문-답변 데이터셋에 미세 조정합니다. Low-Rank Adaptation (LoRA), Transformers Reinforcement Learning (TRL), 그리고 Hugging Face를 사용해 데이터를 준비하고, 모델을 학습하며, 어댑터를 저장하고, Hugging Face에 업로드한 뒤, 미세 조정 전후의 응답을 비교합니다.

최신 오픈소스 AI 모델 찾기, AI 에이전트 빌드, LLM 미세 조정을 시작하려면 저희의 Hugging Face Fundamentals 스킬 트랙을 추천합니다.

1. 환경 설정

Nemotron-3 Nano는 하이브리드 아키텍처를 사용하므로 Mamba 관련 패키지를 올바르게 설치해야 합니다. Jupyter 노트북에서 먼저 기존 PyTorch 스택을 제거하고, 여기에서 고정된 mamba_ssm 및 causal_conv1d 버전과 호환되는 CUDA 12.8 빌드의 PyTorch 2.7.1을 다시 설치합니다.

또한 핵심 미세 조정 라이브러리인 transformers, trl, accelerate, datasets, peft, huggingface_hub 등을 설치합니다.

%%capture
!pip install -U packaging ninja

# Replace the current PyTorch stack with the CUDA 12.8 build that works with these Mamba kernel pins.
!pip uninstall -y torch torchvision torchaudio triton

!pip install "torch==2.7.1" "torchvision==0.22.1" "torchaudio==2.7.1" --index-url https://download.pytorch.org/whl/cu128

!pip install -U "transformers==4.56.2" tokenizers "trl==0.22.2" accelerate datasets peft pandas tqdm huggingface_hub safetensors

!pip install -U --no-build-isolation "mamba_ssm==2.2.5" "causal_conv1d==1.5.2"

패키지 설치 후 CUDA 사용 가능 여부와 PyTorch가 GPU를 감지하는지 확인하세요. 이 노트북은 24GB GPU에 맞춰져 있어, VRAM이 더 적은 GPU에서는 경고가 표시됩니다.

import os
import platform
import torch

print(f"Python: {platform.python_version()}")
print(f"PyTorch: {torch.__version__}")
print(f"PyTorch CUDA build: {torch.version.cuda}")
print(f"CUDA available: {torch.cuda.is_available()}")

if not torch.cuda.is_available():
   raise RuntimeError(
       "CUDA is not available. Select a RunPod PyTorch image with GPU support."
   )

for idx in range(torch.cuda.device_count()):
   props = torch.cuda.get_device_properties(idx)
   total_gb = props.total_memory / 1024**3
   print(
       f"GPU {idx}: {props.name} ({total_gb:.1f} GB VRAM, capability {props.major}.{props.minor})"
   )

if torch.cuda.get_device_properties(0).total_memory < 24 * 1024**3:
   print(
       "Warning: this 4B LoRA notebook is tuned for GPUs with at least 24GB VRAM. Reduce batch sizes on smaller GPUs."
   )

torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

출력:

Python: 3.12.3
PyTorch: 2.7.1+cu128
PyTorch CUDA build: 12.8
CUDA available: True
GPU 0: NVIDIA GeForce RTX 3090 (23.6 GB VRAM, capability 8.6)
Warning: this 4B LoRA notebook is tuned for GPUs with at least 24GB VRAM. Reduce batch sizes on smaller GPUs.

HF_TOKEN이라는 환경 변수로 Hugging Face 토큰을 설정하세요. 이렇게 하면 노트북이 Nemotron-3 모델을 다운로드하고, 이후 미세 조정된 LoRA 어댑터를 Hugging Face에 푸시할 수 있습니다.

from huggingface_hub import login

hf_token = os.environ.get("HF_TOKEN")
if not hf_token:
   raise ValueError(
       "Set HF_TOKEN in the RunPod environment before running this notebook."
   )

login(token=hf_token)
print("Logged in to Hugging Face.")

2. 데이터셋 로딩 및 전처리

다음으로 Hugging Face에서 심리학 질문-답변 데이터셋을 로드합니다. 데이터셋에는 question 열과 두 개의 응답 열인 response_j와 response_k가 있습니다. 이 가이드에서는 감독 학습 미세 조정의 타깃 답변으로 response_j를 사용합니다.

먼저 데이터셋을 로드하고, 재현성을 위해 셔플한 뒤, 학습/검증/테스트 분할을 만듭니다.

from datasets import DatasetDict, load_dataset

DATASET_ID = "jkhedri/psychology-dataset"
TRAIN_LIMIT = 8000
VALIDATION_LIMIT = 800
TEST_LIMIT = 300
SEED = 42

raw_dataset = load_dataset(DATASET_ID)
raw_train = raw_dataset["train"].shuffle(seed=SEED)

split_1 = raw_train.train_test_split(test_size=0.15, seed=SEED)
split_2 = split_1["test"].train_test_split(test_size=0.33, seed=SEED)


def maybe_limit(split, limit):
    if limit is None:
        return split
    return split.select(range(min(limit, len(split))))


dataset = DatasetDict(
    {
        "train": maybe_limit(split_1["train"], TRAIN_LIMIT),
        "validation": maybe_limit(split_2["train"], VALIDATION_LIMIT),
        "test": maybe_limit(split_2["test"], TEST_LIMIT),
    }
)

dataset

출력:

DatasetDict({
    train: Dataset({
        features: ['question', 'response_j', 'response_k'],
        num_rows: 8000
    })
    validation: Dataset({
        features: ['question', 'response_j', 'response_k'],
        num_rows: 800
    })
    test: Dataset({
        features: ['question', 'response_j', 'response_k'],
        num_rows: 300
    })
})

학습에 맞게 포맷팅하기 전에 열 이름을 확인하고 예시 하나를 살펴봅니다. 이를 통해 데이터셋이 올바르게 로드되었고 기대한 질문과 응답 필드를 포함하는지 확인할 수 있습니다.

dataset["train"].column_names, dataset["train"][0]

출력:

(
    ['question', 'response_j', 'response_k'],
    {
        'question': "I'm experiencing anxiety about social situations and don't know how to cope.",
        'response_j': "Social anxiety can be a difficult and isolating experience, but there are effective treatments available. Let's work on developing coping mechanisms, such as deep breathing and mindfulness, and exposure therapy to gradually confront your fears. We can also explore ways to improve social skills and build self-confidence.",
        'response_k': "Just avoid social situations. It's not worth the anxiety and discomfort. You can also try using alcohol or drugs to help you feel more comfortable in social settings."
    }
)

3. TRL 미세 조정을 위한 데이터셋 포맷팅

이제 TRL이 기대하는 프롬프트-완성 형식으로 데이터셋을 변환합니다. 각 예시는 시스템 프롬프트, 사용자의 심리학 질문, 그리고 response_j에서 가져온 타깃 어시스턴트 응답을 포함합니다.

시스템 프롬프트는 모델의 응답 방식을 안내합니다. 지지적이고 공감적으로 답하며, 숨은 추론 흔적을 피하고, 실용적인 제안을 제공하고, 공인 정신건강 전문가처럼 행동하지 않도록 지시합니다.

SYSTEM_PROMPT = """/no_think
You are a supportive psychology question-answering assistant.
Do not include hidden reasoning, thinking traces, <think> tags, or </think> tags in the final answer.
Respond with empathy, practical coping suggestions, and clear next steps.
Give a complete answer in 2-4 short paragraphs or a brief paragraph plus 3-5 practical bullets.
Do not diagnose the user or claim to replace a licensed mental health professional.
If the user may be in immediate danger or crisis, encourage contacting local emergency services or a trusted crisis hotline.
Keep the answer safe, specific, and directly relevant to the user's question without being overly brief."""

CHAT_TEMPLATE_KWARGS = {"enable_thinking": False}
USER_TEMPLATE = "Question:\n\n{question}"


def clean_text(value):
   return " ".join(str(value).strip().split())


def to_prompt_completion(example):
   question = clean_text(example["question"])
   answer = clean_text(example["response_j"])

   return {
       "prompt": [
           {"role": "system", "content": SYSTEM_PROMPT},
           {"role": "user", "content": USER_TEMPLATE.format(question=question)},
       ],
       "completion": [
           {"role": "assistant", "content": answer},
       ],
       "chat_template_kwargs": CHAT_TEMPLATE_KWARGS,
   }


sft_dataset = dataset.map(
   to_prompt_completion, remove_columns=dataset["train"].column_names
)

sft_dataset["train"][0]

출력:

{
   'prompt': [
       {
           'role': 'system',
           'content': "/no_think\nYou are a supportive psychology question-answering assistant.\nDo not include hidden reasoning, thinking traces, <think> tags, or </think> tags in the final answer.\nRespond with empathy, practical coping suggestions, and clear next steps.\nGive a complete answer in 2-4 short paragraphs or a brief paragraph plus 3-5 practical bullets.\nDo not diagnose the user or claim to replace a licensed mental health professional.\nIf the user may be in immediate danger or crisis, encourage contacting local emergency services or a trusted crisis hotline.\nKeep the answer safe, specific, and directly relevant to the user's question without being overly brief."
       },
       {
           'role': 'user',
           'content': "Question:\n\nI'm experiencing anxiety about social situations and don't know how to cope."
       }
   ],
   'completion': [
       {
           'role': 'assistant',
           'content': "Social anxiety can be a difficult and isolating experience, but there are effective treatments available. Let's work on developing coping mechanisms, such as deep breathing and mindfulness, and exposure therapy to gradually confront your fears. We can also explore ways to improve social skills and build self-confidence."
       }
   ],
   'chat_template_kwargs': {'enable_thinking': False}
}

4. Nemotron-3 베이스 모델 로딩

다음으로 Hugging Face에서 NVIDIA Nemotron-3 Nano 4B BF16 토크나이저와 베이스 모델을 로드합니다. 또한 LoRA 어댑터의 출력 디렉터리를 설정하고, 24GB GPU에서 학습을 관리하기 위해 시퀀스 길이를 1024 토큰으로 제한합니다.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_ID = "nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16"
OUTPUT_DIR = "./nemotron-3-nano-4b-bf16-psychology-qa-lora"
MAX_SEQ_LENGTH = 1024

tokenizer = AutoTokenizer.from_pretrained(
   MODEL_ID,
   token=hf_token,
   trust_remote_code=True,
   use_fast=True,
)

if tokenizer.pad_token is None:
   tokenizer.pad_token = tokenizer.eos_token

tokenizer.padding_side = "right"

base_model = AutoModelForCausalLM.from_pretrained(
   MODEL_ID,
   token=hf_token,
   trust_remote_code=True,
   dtype=torch.bfloat16,
   device_map="auto",
   attn_implementation="eager",
)

base_model.config.use_cache = False
base_model.config.pad_token_id = tokenizer.pad_token_id
base_model.config.eos_token_id = tokenizer.eos_token_id
base_model.generation_config.pad_token_id = tokenizer.pad_token_id
base_model.generation_config.eos_token_id = tokenizer.eos_token_id
base_model.generation_config.use_cache = False
base_model.generation_config.do_sample = False
base_model.generation_config.top_p = None
base_model.generation_config.min_new_tokens = None
base_model.generation_config.repetition_penalty = 1.08
base_model.generation_config.no_repeat_ngram_size = 4

5. 응답 생성을 위한 헬퍼 함수 만들기

미세 조정 전에 모델의 응답을 테스트하기 위한 몇 가지 헬퍼 함수를 만듭니다. 이 함수들은 채팅 프롬프트를 구성하고, 답변을 생성하며, 원치 않는 thinking 태그를 제거하고, 결과를 작은 비교 표에 저장합니다.

import gc
import pandas as pd
from tqdm.auto import tqdm


def clear_cuda_cache():
   gc.collect()
   if torch.cuda.is_available():
       torch.cuda.empty_cache()


def build_messages(question, system_prompt=SYSTEM_PROMPT):
   return [
       {"role": "system", "content": system_prompt},
       {
           "role": "user",
           "content": USER_TEMPLATE.format(question=clean_text(question)),
       },
   ]


def remove_thinking_text(text):
   text = text.strip()
   while "<think>" in text and "</think>" in text:
       start = text.find("<think>")
       end = text.find("</think>", start) + len("</think>")
       text = (text[:start] + text[end:]).strip()

   if "</think>" in text:
       text = text.split("</think>")[-1].strip()

   return text.replace("<think>", "").replace("</think>", "").strip()


def generate_answer(
   model, tokenizer, question, system_prompt=SYSTEM_PROMPT, max_new_tokens=180
):
   messages = build_messages(question, system_prompt)
   device = next(model.parameters()).device

   inputs = tokenizer.apply_chat_template(
       messages,
       tokenize=True,
       **CHAT_TEMPLATE_KWARGS,
       add_generation_prompt=True,
       return_dict=True,
       return_tensors="pt",
   )

   inputs = {key: value.to(device) for key, value in inputs.items()}
   input_len = inputs["input_ids"].shape[-1]

   with torch.no_grad():
       outputs = model.generate(
           **inputs,
           max_new_tokens=max_new_tokens,
           do_sample=False,
           use_cache=False,
           repetition_penalty=1.08,
           no_repeat_ngram_size=4,
           pad_token_id=tokenizer.pad_token_id,
           eos_token_id=tokenizer.eos_token_id,
       )

   decoded = tokenizer.decode(outputs[0][input_len:], skip_special_tokens=True).strip()

   return remove_thinking_text(decoded)


def generate_sample_table(model, tokenizer, examples, output_column):
   rows = []
   model.eval()

   for ex in tqdm(examples, desc=f"Generating {output_column}", leave=False):
       rows.append(
           {
               "question": clean_text(ex["question"]),
               "reference_response_j": clean_text(ex["response_j"]),
               output_column: generate_answer(model, tokenizer, ex["question"]),
           }
       )

   return pd.DataFrame(rows)

6. 미세 조정 전 샘플 평가 실행

학습 전에 베이스 Nemotron-3 모델로 몇 가지 응답을 생성합니다. 이렇게 하면 나중에 LoRA 미세 조정 전후의 응답을 비교할 수 있는 기준선을 얻을 수 있습니다.

여기서는 테스트 세트에서 예시 세 개를 선택해 앞서 만든 헬퍼 함수를 사용해 답변을 생성합니다.

sample_examples = [dataset["test"][idx] for idx in range(min(3, len(dataset["test"])))]

pre_samples = generate_sample_table(
   base_model,
   tokenizer,
   sample_examples,
   "base_model_answer"
)

pre_samples

출력은 원래 질문, response_j의 기준 답변, 베이스 모델이 생성한 답변을 포함하는 작은 표입니다. 이 표는 나중에 미세 조정 모델의 응답과 비교할 때 유용합니다.

7. LoRA 및 학습 설정 구성

이제 LoRA 미세 조정에 대비해 모델을 준비합니다. 메모리 사용량을 줄이기 위해 그래디언트 체크포인팅을 활성화하고, 모델의 모든 선형 계층을 대상으로 하는 LoRA 구성을 만듭니다.

from peft import LoraConfig

base_model.gradient_checkpointing_enable()
base_model.config.use_cache = False

lora_config = LoraConfig(
   r=32,
   lora_alpha=64,
   lora_dropout=0.1,
   bias="none",
   task_type="CAUSAL_LM",
   target_modules="all-linear",
)

다음으로 SFTConfig를 사용해 감독 학습 미세 조정 설정을 정의합니다. 이 설정은 배치 크기, 학습률, 에폭 수, 평가 빈도, 저장 전략, BF16 학습 등을 제어합니다.

from trl import SFTConfig, SFTTrainer

training_args = SFTConfig(
   output_dir=OUTPUT_DIR,
   per_device_train_batch_size=8,
   per_device_eval_batch_size=8,
   gradient_accumulation_steps=8,
   learning_rate=5e-5,
   weight_decay=0.01,
   lr_scheduler_type="linear",
   warmup_ratio=0.05,
   num_train_epochs=2,
   logging_steps=50,
   eval_strategy="steps",
   eval_steps=50,
   save_strategy="steps",
   save_steps=100,
   save_total_limit=2,
   load_best_model_at_end=True,
   metric_for_best_model="eval_loss",
   greater_is_better=False,
   gradient_checkpointing=True,
   bf16=True,
   fp16=False,
   tf32=True,
   max_length=MAX_SEQ_LENGTH,
   packing=False,
   completion_only_loss=True,
   remove_unused_columns=False,
   dataloader_num_workers=4,
   optim="adamw_torch_fused",
   report_to="none",
   seed=SEED,
)

8. LoRA 어댑터 학습 및 저장

이제 SFTTrainer를 생성하고 LoRA 구성을 연결한 뒤 미세 조정을 시작합니다. 학습 전에 학습 가능한 파라미터 수를 확인해 LoRA 어댑터가 올바르게 연결되었는지 검증합니다.

trainer = SFTTrainer(
   model=base_model,
   args=training_args,
   train_dataset=sft_dataset["train"],
   eval_dataset=sft_dataset["validation"],
   peft_config=lora_config,
   processing_class=tokenizer,
)

trainable_params = sum(
   param.numel() for param in trainer.model.parameters() if param.requires_grad
)

all_params = sum(param.numel() for param in trainer.model.parameters())

if trainable_params == 0:
   raise RuntimeError(
       "No trainable LoRA parameters were attached. Check target_modules before training."
   )

print(f"Trainable LoRA parameters: {trainable_params:,}")
print(f"All parameters visible to trainer: {all_params:,}")
print(f"Trainable percentage: {100 * trainable_params / all_params:.4f}%")

train_result = trainer.train()

trainer.model.eval()
trainer.model.config.use_cache = False
trainer.model.generation_config.use_cache = False

train_result

학습 중에는 트레이닝 손실과 검증 손실이 점진적으로 감소하는 것이 일반적입니다. 이는 모델이 데이터셋의 응답 스타일을 학습하고 있음을 의미합니다.

학습 후 LoRA 어댑터와 토크나이저를 로컬에 저장합니다:

trainer.model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

그런 다음 미세 조정된 어댑터를 Hugging Face에 푸시합니다:

HUB_REPO_ID = "kingabzpro/nemotron-3-nano-4b-bf16-psychology-qa-lora"

trainer.model.push_to_hub(HUB_REPO_ID, private=False)
tokenizer.push_to_hub(HUB_REPO_ID, private=False)

미세 조정된 어댑터는 이제 로컬에 저장되었으며, HUB_REPO_ID로 지정된 Hugging Face에 업로드되었습니다.

출처: kingabzpro/nemotron-3-nano-4b-bf16-psychology-qa-lora

9. 미세 조정 전후 응답 비교

마지막으로 미세 조정된 모델로 답변을 생성하고 베이스 모델의 출력과 비교합니다. 이를 통해 LoRA 미세 조정이 기준 응답과의 정렬을 개선했는지 확인할 수 있습니다.

post_samples = generate_sample_table(
   trainer.model,
   tokenizer,
   sample_examples,
   "fine_tuned_answer"
)

comparison = pre_samples[
   ["question", "reference_response_j", "base_model_answer"]
].merge(
   post_samples[["question", "fine_tuned_answer"]],
   on="question",
   how="left",
)

for idx, row in comparison.iterrows():
   print("=" * 100)
   print(f"Sample {idx + 1}")
   print("=" * 100)
   print("\nQUESTION:\n")
   print(row["question"])
   print("\nREFERENCE RESPONSE_J:\n")
   print(row["reference_response_j"])
   print("\nBASE MODEL ANSWER:\n")
   print(row["base_model_answer"])
   print("\nFINE-TUNED ANSWER:\n")
   print(row["fine_tuned_answer"])
   print("\n")

미세 조정된 모델은 기준 응답 스타일과의 정렬이 더 잘 맞았습니다. 더 간결해졌고 데이터셋의 답변에 더 가깝게 유지되었습니다. 그러나 베이스 모델이 때때로 더 상세하고 실용적인 응답을 제공하기도 했습니다.

예를 들어, 스트레스 관리와 집중력 관련 질문에서는 미세 조정 모델의 정렬이 향상되었지만, 수면 관련 예시에서는 베이스 모델이 더 도움이 되는 세부 정보를 포함해 더 강한 응답을 제공했습니다.

전반적으로 기준 데이터셋의 스타일을 맞추는 것이 목표라면 미세 조정 모델이 더 적합합니다. 반면 최대한의 유용성이 목표라면, 베이스 모델이 더 따뜻하고 상세한 답변을 제공하는 경우가 있어 일부 상황에서는 여전히 더 나은 성능을 보일 수 있습니다.

위 코드 실행에 문제가 있다면 Hugging Face 저장소의 노트북을 참고하세요: fine-tune-nemotron-nano.ipynb

마무리 생각

LLM 100개 이상을 미세 조정해봤지만, 이번 모델은 예상보다 설정 작업이 더 많았습니다. 주된 과제는 기존 로컬 Python 환경과 쉽게 충돌하거나 깨질 수 있는 mamba_ssm 의존성이었습니다.

그 때문에 이 워크플로우에는 깨끗한 환경 사용을 권장합니다. 제 경우, 환경을 재구성하고 올바른 PyTorch 버전을 설치한 뒤, Mamba 관련 패키지를 고정하고 그 상태에서 노트북을 실행하는 것이 가장 쉬웠습니다.

또 다른 제약은 양자화입니다. 이 설정에서는 제 Qwen3.5 Small 튜토리얼처럼 4비트로 모델을 단순 로드해 표준 QLoRA 워크플로우로 미세 조정할 수 없었습니다. 전체 BF16 모델을 로드한 다음 LoRA로 미세 조정해야 했습니다. 4B 모델의 경우 24GB GPU에서도 충분히 관리 가능하지만, 12B 이상 모델부터는 메모리 사용량이 빠르게 문제가 될 수 있습니다.

그럼에도 소비자용 GPU에서의 미세 조정은 훨씬 더 접근 가능해졌습니다. RTX 3090 같은 24GB 카드로도 이제는 강력한 오픈 모델을 특정 스타일이나 도메인에 맞게 조정할 수 있으며, 대규모 학습 클러스터가 필요하지 않습니다.

종합하면, Nemotron-3 Nano는 역량 있는 모델이지만 세심한 환경 설정이 필요합니다. 의존성이 정상 동작하면, 비교적 적은 예시 수만으로도 새로운 응답 스타일에 잘 적응하도록 미세 조정할 수 있습니다.

주제

인공지능

대규모 언어 모델