Ir al contenido principal

Zero-Shot Classification: How It Works and When to Use It

Learn what zero-shot classification is, how it works under the hood with NLI models, how it compares to few-shot and fine-tuning, and how to apply it with Hugging Face Transformers.
11 jun 2026  · 15 min leer

What happens when you need to classify text into a new category, but you don't have a single labeled example to train on?

Well, you can’t go with the traditional classifiers route. They expect labeled examples for every category you want to predict, which means weeks of annotation work before you can start training. And as soon as a new category shows up, you're back to labeling.

Zero-shot classification is what you should look at. It skips the entire labeling step by letting a model assign labels it has never seen during training. In practice, this means you can sort customer feedback into categories like "billing complaint" or "feature request" without preparing a single training example for either.

In this article, I'll walk you through how zero-shot classification works, how it compares to traditional and few-shot approaches, and how to apply it to real NLP tasks using Hugging Face Transformers.

What is Hugging Face exactly? Enroll in our Hugging Face Fundamentals track to build AI agents and fine-tune LLMs.

What Is Zero-Shot Classification?

Zero-shot classification is a machine learning approach where a model assigns labels to data without being trained specifically on those labels.

The key part here is this "zero-shot." It means the model sees zero training examples for the categories you want to predict. You give it a piece of text and a list of possible labels, and it picks the best match based on what it already knows.

All of this knowledge comes from pretraining. Large models pick up a broad understanding of language and concepts from massive text corpora, and that general knowledge is what they fall back on when you ask them to classify something new.

How Zero-Shot Classification Works

The workflow behind zero-shot classification is simple once you break it into steps. You only need four:

  1. Input the text or data: This is whatever you want to classify - a customer review, a support ticket, a news headline, a chunk of documentation.

  2. Provide the candidate labels: You give the model a list of possible categories. These are just plain words or short phrases, like "product question", "refund request", "technical issue", or "general inquiry".

  3. Model evaluates the relationship between input and labels: The model looks at each label and scores how well it fits the input based on its pretrained understanding of language.

  4. The most likely label is selected: The highest-scoring label is returned as the prediction, often with a confidence score for each candidate.

The whole point is that you can swap the candidate labels at any time without retraining. If tomorrow you want to add a new category, just add it to the list. It’s kind of mind-blowing if you’ve only worked with supervised learning.

Zero-Shot Classification vs Traditional Classification

Both approaches solve the same problem (assigning labels to data), but they get there in different ways.

Traditional classification

Traditional classifiers learn from labeled examples. You collect a dataset where each item is tagged with the correct category, train a model on it, and the model learns the patterns that separate one class from another.

This has worked well for decades, but has two big constraints:

  • You need labeled training data: Often, a lot of it. Collecting and annotating that data takes time and money, the most valuable finite resources.
  • The label set is fixed: Once the model is trained, it can only predict the categories it saw during training. Adding a new class means going back and retraining.

Zero-shot classification

Zero-shot works the opposite way. There's no task-specific training step at all. The model uses what it learned during pretraining to evaluate any label you throw at it.

That gives you two advantages:

  • No task-specific training: You entirely skip data collection, annotation, and training.
  • Flexible label set: You can change the categories on the go. Add new ones, remove old ones, rename them - the model handles it without any retraining.

The tradeoffs

This flexibility doesn't come for free. A traditional classifier trained on a solid labeled dataset will usually outperform a zero-shot model on that specific task. It has seen exactly the kind of examples you care about and has tuned itself to them.

Zero-shot models are generalists. They're good at a lot of things, but they're rarely the best at any one thing. So the choice comes down to what you need.

If you have labeled data and care about top accuracy on a fixed set of categories, train a traditional classifier. If you don't have labeled data, or your categories change often, zero-shot is the faster path to a working solution.

The Role of Foundation Models and LLMs

Zero-shot classification became popular because large pretrained models got good enough to handle it.

Before foundation models, you couldn't just give a model a list of labels it had never seen and expect reasonable predictions. The model didn't know enough about language to make the connection. Pretraining on huge text corpora changed that. A model that has read a good chunk of the internet has already encountered words like "refund" or "complaint" in countless contexts, so matching them to new inputs becomes possible.

A few families of models are behind most zero-shot workflows today:

  • BERT and its variants: BERT-style models learn deep representations of text during pretraining. Variants like RoBERTa and DeBERTa pushed this further with better training methods and larger datasets.
  • NLI-based models: These are models fine-tuned on natural language inference tasks. They're behind most off-the-shelf zero-shot pipelines, and in the next section, I’ll explain why.
  • Modern LLMs: Large language models like the GPT family or Claude can handle zero-shot classification through prompting. You describe the task in plain language, list the categories, and the model picks one.

The common thread is scale and generality. A model trained on a narrow task can only do that task. A model trained on broad text data can be redirected toward many tasks without ever seeing labeled examples for them.

Natural Language Inference (NLI) and Zero-Shot Classification

Most zero-shot classifiers you'll see in practice are built on NLI models. This is the part that surprises people, so it's worth slowing down.

Natural language inference is its own task. Given two sentences (a premise and a hypothesis), the model decides the relationship between them. The output is one of three labels:

  • Entailment: The hypothesis follows from the premise.
  • Contradiction: The hypothesis contradicts the premise.
  • Neutral: The two sentences are unrelated, or there's not enough information to decide.

For example, if the premise is "The team finished the project two weeks early" and the hypothesis is "The team delivered on time", an NLI model should predict entailment. If the hypothesis is "The team missed the deadline", it should predict a contradiction.

This setup turns out to be a great fit for zero-shot classification. If you treat your input as the premise, you can turn each candidate label into a hypothesis.

Let's say you want to classify the sentence "My package never arrived" into one of three categories: shipping issue, billing issue, or product question. The model doesn't see these as labels. It sees them as hypotheses, usually wrapped in a simple template like "This text is about {label}":

  • Premise: "My package never arrived" | Hypothesis: "This text is about a shipping issue"

  • Premise: "My package never arrived" | Hypothesis: "This text is about a billing issue"

  • Premise: "My package never arrived" | Hypothesis: "This text is about a product question"

The NLI model scores each pair. The hypothesis with the highest entailment score wins, and that label becomes the prediction.

The model never had to learn what "shipping issue" or "billing issue" means as a label. It only had to learn what entailment looks like in general, and it picked that up during NLI fine-tuning.

This is why NLI-based zero-shot works so well. The model is doing the task it was trained for (judging entailment), and you're just framing your classification problem as a series of entailment questions.

Zero-Shot vs Few-Shot vs Fine-Tuned Models

Zero-shot isn't the only way to get a model to do a classification task without a full training run. Let me compare it with the alternatives.

Zero-shot

Zero-shot means no examples at all. You hand the model an input and a list of candidate labels, and it makes a prediction based on what it learned during pretraining.

The model has never seen labeled data for your specific task. It's working entirely from general knowledge.

Few-shot

Few-shot gives the model a couple of examples, usually inside the prompt. You show it two, five, maybe ten examples of inputs paired with the correct labels, then ask it to classify a new input the same way.

The model isn't being retrained here. It's still using its pretrained weights. The examples just serve as a reference - a quick "this is what I mean by these categories" before you ask for a prediction.

Fine-tuning

Fine-tuning is a dedicated training process. You take a pretrained model, feed it a labeled dataset for your task, and update its weights until it gets good at predicting your specific categories.

This is the heaviest of the three. You need labeled data, training infrastructure, and time. In return, the model becomes specialized for your task.

Comparing the three

The three approaches differ across three things that matter in practice: accuracy, flexibility, and cost.

  Accuracy Flexibility Cost
Zero-shot Lowest of the three, but still good for general use Highest, you can change labels anytime Lowest, no data and no training
Few-shot Better than zero-shot, especially with well-chosen examples High, you can change examples and labels in the prompt Low, you only need a few examples
Fine-tuning Highest on the task it was trained for Lowest, retraining required for new categories Highest, includes data collection, annotation, and training

Zero-shot learning compared to alternatives

A few things are worth pulling out from this table.

Accuracy isn't a clean ranking. Fine-tuning wins when you have enough labeled data and your task is stable. But on a brand new task with no examples to learn from, fine-tuning isn't even an option, and zero-shot becomes the only realistic choice.

Flexibility moves in the opposite direction. Zero-shot lets you change categories whenever you want. Fine-tuned models are locked into the label set they were trained on.

Cost is the most obvious one. Zero-shot costs almost nothing to set up. Few-shot adds a small annotation step. Fine-tuning is a project on its own and needs dedicated infrastructure.

The workflow most teams settle on is starting with zero-shot to see if the task is even doable. If accuracy isn't good enough, move to few-shot. If that's still not enough and the task matters, fine-tune.

Zero-Shot Classification in NLP

Zero-shot classification shows up in a lot of NLP workflows, especially the ones where labeled data is hard to come by or the categories keep shifting. Here are the most common applications.

Sentiment analysis

Sentiment analysis is the textbook starting point. You feed the model a piece of text and ask it to pick from labels like "positive", "negative", and "neutral".

The interesting part is how easy it is to go beyond the standard three. A traditional sentiment classifier is locked into whatever it was trained on. With zero-shot, you can use more specific labels like "frustrated", "satisfied", "confused", or "excited" and the model handles them without any retraining. This is great for product feedback or social media monitoring where the emotional categories you care about depend on the context.

Topic classification

Topic classification sorts documents into subject areas. News articles into "politics", "sports", "technology", "finance". Support tickets into "billing", "shipping", "account access", "feature request".

Zero-shot makes this trivial to set up. You don't need a labeled dataset for every new topic. If your product launches a new feature and you want to track tickets about it, you just add "new feature feedback" to your candidate label list and you're done.

Intent detection

Intent detection figures out what a user is trying to do. It's the engine behind most chatbots and voice assistants. When someone types "I need to change my password", the model needs to recognize the intent as "password reset" rather than, say, "general security question".

This is where zero-shot shines. Real products keep adding new user intents over time, and retraining an intent classifier every time the product team adds a feature is a lot of work. Zero-shot lets you keep the intent list current without changing the model.

Content moderation

Content moderation flags problematic text - things like "hate speech", "spam", "harassment", or "misinformation". Platforms have to keep their policies up to date, and the categories change as new types of abuse appear.

Zero-shot is a great fit here. Moderation teams can adjust label definitions or add new categories as policies evolve, without going back to the engineering team for a retraining cycle. It's usually paired with traditional classifiers for high-volume cases, but zero-shot handles the long tail of categories that don't have enough labeled examples to train on.

Zero-Shot Classification with Hugging Face Transformers

Hugging Face's transformers library is the easiest way to try zero-shot classification in Python. The pipeline API hides almost all of the model-loading work, so you can go from zero to predictions in a couple of lines of code.

The first time you run this code snippet, the model will be downloaded, so it’ll take some time.

Here's a complete example you can run:

from transformers import pipeline
from pprint import pprint

# Load a zero-shot classification pipeline
classifier = pipeline(
    "zero-shot-classification",
    model="facebook/bart-large-mnli"
)

# The text you want to classify
text = "My package never arrived and customer support hasn't responded in three days."

# The labels you want the model to choose from
candidate_labels = [
    "shipping issue",
    "billing issue",
    "product question",
    "general inquiry"
]

# Run the classification
result = classifier(text, candidate_labels)

pprint(result)

The output is a dictionary with three fields: the original input, the candidate labels sorted from most to least likely, and the confidence score for each label.

Pipeline output

Pipeline output

"shipping issue" wins with a confidence of 0.82. The model never saw labeled examples of shipping complaints during training, but it figured this out from the pretrained bart-large-mnli model, which was fine-tuned on NLI data exactly as covered in the earlier section.

There are a few things worth pointing out about this workflow:

  • The model is an NLI model: bart-large-mnli is BART fine-tuned on the MultiNLI dataset. When you call the pipeline, it converts your labels into hypotheses behind the scenes and runs them through the model.

  • You can change labels at any time: Nothing about the model needs to change. Swap the candidate_labels list for a different one and you're classifying into different categories.

  • Multi-label is one flag away: Set multi_label=True when calling the pipeline, and they won’t compete with each other. Each label gets an independent probability, so a single input can belong to multiple categories.

result = classifier(text, candidate_labels, multi_label=True)
pprint(result)

Multi label output

Multi-label output

That's the whole workflow. Load the pipeline, define your labels, call the classifier. Much faster and easier than building a classification model from scratch.

Common Mistakes with Zero-Shot Classification

Zero-shot classification is easy to set up, which is exactly why it's easy to misuse. Here are the mistakes that come up most often in real projects.

Choosing vague labels

Labels are how you communicate the task to the model. Vague labels lead to vague predictions.

"good" and "bad" won't tell the model much. "customer is happy with the product" and "customer is reporting a problem with the product" give the model something to work with. The more your labels look like meaningful phrases, the better the model will score them against your inputs.

Also, avoid labels that overlap. If "complaint" and "negative feedback" both appear in your candidate list, the model will split its confidence between them and neither will come out as a clear winner.

Assuming zero-shot matches fine-tuning performance

This is the most common one. Zero-shot is good, but it's not magic.

A fine-tuned model trained on a few thousand labeled examples for your specific task will almost always beat a zero-shot model on that task. If you're seeing 85% accuracy with zero-shot and you need 95% for production, no amount of label tweaking will close that gap. At that point, fine-tuning is the answer.

Use zero-shot when you don't have labeled data, when your categories change often, or when "good enough" really is good enough. Don't pick it just because it's faster to set up and then act surprised when it underperforms on a high-stakes task.

Evaluating on overly narrow datasets

It goes something like this: you build a zero-shot classifier, test it on 50 examples you wrote yourself, and it gets every one right. You ship it, and then it hits production and everyone complains how bad it is.

The 50 examples you wrote are not representative of what real users send. They're cleaner, more obvious, and more aligned with how you think about the categories. Evaluate on data that looks like what the model will actually see - user-generated text, with typos, slang, and edge cases. If you don't have that data yet, sample a few hundred real inputs and label them by hand before you trust the numbers.

Ignoring domain-specific language

General-purpose zero-shot models know general-purpose language. They don't know your industry's jargon.

Medical terminology, legal language, finance acronyms, engineering specs, all have narrow vocabularies that pretrained models have seen, but not deeply. If you ask a general zero-shot model to classify a sentence full of ICD-10 codes or SQL error messages, expect mixed results.

You have two options here. Either rewrite your labels in plain language the model is more likely to understand, or switch to a model that was pretrained on text from your domain. BioBERT for medical text, FinBERT for financial text, and similar domain-specific models often outperform general ones on specialized tasks.

Why Zero-Shot Classification Is Important in Modern AI

Zero-shot classification is one of the clearest demonstrations of how foundation models have changed what's possible in AI.

A decade ago, every classification task started with the same question: do we have labeled data, and if not, how do we get it? Annotation projects took months, needed budget, vendor coordination, and quality control. The model couldn't even be built until the data existed.

That assumption is no longer true.

A foundation model trained on broad text data already knows enough about language to handle a new classification task the moment you describe it. The label list is the only thing you need to define.

This shift matters for a few reasons that connect to bigger trends in AI:

  • It shows what foundation models are actually for: The point of pretraining at scale was to build a model general enough to be redirected toward many tasks without retraining. Zero-shot classification is one of the cleanest examples of that promise in action.
  • It cuts the dependence on supervised datasets: Labeled data is still useful, but it's no longer a must-have to create a working classifier. Teams that don't have the budget for annotation projects can still deploy valuable AI features.
  • It changes deployment speed: A classification feature that used to take a quarter to ship can now be prototyped in an afternoon. This affects what teams choose to build, because the cost of trying something has dropped to almost nothing.

The broader trend is the move from task-specific models to general-purpose ones. Fine-tuning still has its place, and so does training from scratch when the task demands it. But the default starting point has shifted. You reach for a pretrained model first and only specialize when the data justifies it.

Become an ML Scientist

Upskill in Python to become a machine learning scientist.
Start Learning for Free

Conclusion

Zero-shot classification is a reminder that not every machine learning problem needs its own training run anymore. A model that already understands language can be redirected toward a new task by changing the labels you give it. There’s no need for examples or fine-tuning.

That flexibility is the whole point. The categories you care about today might not be the ones you care about next month, and zero-shot models allow you to change your mind.

So before you start a data labeling project, run a zero-shot baseline. You'll know in an afternoon whether the task is solved, partially solved, or still needs the heavier approach. Nothing more to it.

If you’re new to LLMs, enroll in our Large Language Models (LLMs) Concepts course. It teaches you everything you need to know about LLM applications, training methodologies, and the latest research.


Dario Radečić's photo
Author
Dario Radečić
LinkedIn
Senior Data Scientist based in Croatia. Top Tech Writer with over 700 articles published, generating more than 10M views. Book Author of Machine Learning Automation with TPOT.

FAQs

What is zero-shot classification in simple terms?

Zero-shot classification is a way to assign labels to text without training a model on those specific labels. You give the model an input and a list of possible categories, and it picks the best match based on what it learned during pretraining. The "zero" refers to the number of training examples needed for each category, which is none.

When should I use zero-shot classification instead of training my own model?

Use zero-shot when you’re trying things out, don't have labeled data, when your categories change often, or when you need to ship a prototype quickly. It's also the right choice for exploratory work where you're still figuring out what the categories should be. If you have a stable label set and enough labeled examples, fine-tuning a traditional classifier will usually give you better accuracy.

How accurate is zero-shot classification?

Accuracy depends on the model you use, how clear your labels are, and how well the task fits general language understanding. A zero-shot model on a well-defined task can land anywhere from solid to production-ready, but it will almost always lose to a fine-tuned model trained on enough labeled data for the same task. Treat zero-shot as a strong baseline.

What's the difference between zero-shot classification and NLI?

Natural language inference (NLI) is a task where a model decides if one sentence entails, contradicts, or is neutral toward another. Zero-shot classification uses NLI models as the engine behind the scenes: your input becomes the premise, and each candidate label is turned into a hypothesis. The label whose hypothesis gets the highest entailment score wins.

Can I use zero-shot classification for non-English text?

Yes, but you need a multilingual model. Models like xlm-roberta-large-xnli are trained on multilingual NLI data and can handle zero-shot classification across dozens of languages. Stick with bart-large-mnli and similar English-only models for English text, and switch to a multilingual variant when your inputs aren't in English.

Temas

Learn with DataCamp

programa

Científico especializado en machine learning en Python

85 h
Descubre el machine learning con Python y trabaja para convertirte en un científico especializado en machine learning. Explora el aprendizaje supervisado, no supervisado y profundo.
Ver detallesRight Arrow
Iniciar curso
Ver másRight Arrow
Relacionado

blog

Zero-Shot Learning: A Guide With Examples

Learn what zero-shot learning is, how it works, its applications, and its challenges in artificial intelligence.
Dr Ana Rojo-Echeburúa's photo

Dr Ana Rojo-Echeburúa

8 min

blog

What is Few-Shot Learning? Unlocking Insights with Limited Data

Unlock the power of few-shot learning and learn how to extract valuable insights from minimal data. Explore techniques, applications, and benefits.

Victor Jotham Ashioya

7 min

blog

Classification in Machine Learning: An Introduction

Learn about classification in machine learning, looking at what it is, how it's used, and some examples of classification algorithms.
Zoumana Keita 's photo

Zoumana Keita

14 min

blog

Overview of Advanced Transfer Learning Techniques

Learn how advanced transfer learning techniques like domain adaptation, multi-task learning, and few-shot learning can improve model performance and generalization.
Stanislav Karzhev's photo

Stanislav Karzhev

12 min

Tutorial

Zero-Shot Prompting: Examples, Theory, Use Cases

Zero-shot prompting is a technique in which an AI model is given a task or question without any prior examples or specific training on that task, relying solely on its pre-existing knowledge to generate a response.
Dr Ana Rojo-Echeburúa's photo

Dr Ana Rojo-Echeburúa

Tutorial

Few-Shot Prompting: Examples, Theory, Use Cases

Few-shot prompting is a technique in which an AI model is given a few examples of a task to learn from before generating a response, using those examples to improve its performance on similar tasks.
Dr Ana Rojo-Echeburúa's photo

Dr Ana Rojo-Echeburúa

Ver másVer más