Stemming and Lemmatization in Python

This tutorial covers stemming and lemmatization from a practical standpoint using the Python Natural Language ToolKit (NLTK) package.

Diperbarui 1 Jun 2026 · 12 mnt baca

Jelajahi dengan AI

Buka di ChatGPT Buka di Claude Buka di Perplexity

The modern English language is considered a weakly inflected language. This means there are many words in English derived from another word; for example, the inflected word “normality” is derived from the word “norm,” which is the root form. All inflected languages consist of words with common root forms, but the degree of inflection varies based on the language.

When working with text, sometimes it’s necessary to apply normalization techniques to get words to their root form from their derived versions. This helps reduce randomness and bring the words in the corpus closer to the predefined standard, improving the processing efficiency since the computer has fewer features to deal with.

Two of the most popular text normalization techniques in Natural Language Processing (NLP) are stemming and lemmatization.

Researchers have studied these techniques for years; NLP practitioners typically use them to prepare words, text, and documents for further processing in a number of tasks.

This tutorial will cover stemming and lemmatization from a practical standpoint using the Python Natural Language ToolKit (NLTK) package.

Check out this DataLab workbook for an overview of all the code in this tutorial. To edit and run the code, create a copy of the workbook to run and edit this code.

What Is Stemming?

Stemming is a technique used to reduce an inflected word down to its word stem. For example, the words “programming,” “programmer,” and “programs” can all be reduced down to the common word stem “program.” In other words, “program” can be used as a synonym for the prior three inflection words.

Performing this text-processing technique is often useful for dealing with sparsity and/or standardizing vocabulary. Not only does it help with reducing redundancy, as most of the time the word stem and their inflected words have the same meaning, it also allows NLP models to learn links between inflected words and their word stem, which helps the model understand their usage in similar contexts.

Stemming algorithms function by taking a list of frequent prefixes and suffixes found in inflected words and chopping off the end or beginning of the word. This can occasionally result in word stems that are not real words — a limitation we'll cover below.

Advantages of stemming

Improved model performance: Stemming reduces the number of unique words that need to be processed by an algorithm, which can improve its performance. Additionally, it can also make the algorithm run faster and more efficiently.
Grouping similar words: Words with a similar meaning can be grouped together, even if they have distinct forms. This can be a useful technique in tasks such as document classification, where it’s important to identify key topics or themes within a document.
Easier to analyze and understand: Since stemming typically reduces the size of the vocabulary, it’s much easier to analyze, compare, and understand texts. This is helpful in tasks such as sentiment analysis, where the goal is to determine the sentiment of a document.

Disadvantages of stemming

Overstemming / false positives: This is when a stemming algorithm reduces separate inflected words to the same word stem even though they are not related; for example, the Porter Stemmer algorithm stems "universal", "university", and "universe" to the same word stem. Though they are etymologically related, their meanings in the modern day are from widely different domains. Treating them as synonyms will reduce relevance in search results.
Understemming / false negatives: This is when a stemming algorithm reduces inflected words to different word stems, but they should be the same. For example, the Porter Stemmer algorithm does not reduce the words “alumnus,” “alumnae,” and “alumni” to the same word stem, although they should be treated as synonyms.
Language challenges: As the target language's morphology, spelling, and character encoding get more complicated, stemmers become more difficult to design; For example, an Italian stemmer is more complicated than an English stemmer because there is a higher number of verb inflections. A Russian stemmer is even more complex due to more noun declensions.

What Is Lemmatization?

Lemmatization is another technique used to reduce inflected words to their root word. It describes the algorithmic process of identifying an inflected word’s “lemma” (dictionary form) based on its intended meaning.

As opposed to stemming, lemmatization relies on accurately determining the intended part-of-speech and the meaning of a word based on its context. This means it takes into consideration where the inflected word falls within a sentence, as well as within the larger context surrounding that sentence, such as neighboring sentences or even an entire document.

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma
Standford NLP Group, Standford NLP Group

In other words, lemmatizing a document means doing things correctly since it involves using a vocabulary and performing morphological analysis of words to remove only the inflectional ends and return the base or dictionary form of a word, which is known as the “lemma.” For example, you can expect a lemmatization algorithm to map “runs,” “running,” and “ran” to the lemma, “run.”

Advantages of lemmatization

Accuracy: Lemmatization does not merely cut words off as you see in stemming algorithms. Analysis of words is conducted based on the word’s POS to take context into consideration when producing lemmas. Also, lemmatization leads to real dictionary words being produced.

Disadvantages of lemmatization

Time-consuming: Compared to stemming, lemmatization is a slow and time-consuming process. This is because lemmatization involves performing morphological analysis and deriving the meaning of words from a dictionary.

Start Learning Python For Free

Introduction to Data Science in Python

BasicSkill Level

4 Hr

497.8K learners

Dive into data science using Python and learn how to effectively analyze and visualize your data. No coding experience or skills needed.

See Details

Introduction to Natural Language Processing in Python

BasicSkill Level

4 Hr

140.8K learners

Learn fundamental natural language processing techniques using Python and how to apply them to extract insights from real-world text data.

See Details

Stemming and Lemmatization Examples in Python with NLTK

Now you have an overview of stemming and lemmatization. In this section, we are going to get hands-on and demonstrate examples of both techniques using Python and a library called NLTK.

The Python NLTK package

Natural Language Tool Kit (NLTK) is a Python library used to build programs capable of processing natural language. The library can perform different operations such as tokenizing, stemming, classification, parsing, tagging, semantic reasoning, sentiment analysis, and more.

The latest version is NLTK 3.9.x, and it requires Python 3.10 or higher (up to 3.14), but you don't have to worry about this since it comes preinstalled in the DataLab workbook — just import nltk and you're good to go.

Python stemming example

One of the most popular stemming algorithms is called the “Porter stemmer.” The porter stemmer was first proposed by Martin Porter in a 1980 paper titled "An algorithm for suffix stripping." The paper has become one of the most common algorithms for stemming in English.

Let’s see how it works:

import nltk
from nltk.stem import PorterStemmer
nltk.download("punkt_tab")  # Use punkt_tab in NLTK 3.9+; punkt is deprecated

# Initialize Python porter stemmer
ps = PorterStemmer()

# Example inflections to reduce
example_words = ["program","programming","programer","programs","programmed"]

# Perform stemming
print("{0:20}{1:20}".format("--Word--","--Stem--"))
for word in example_words:
   print ("{0:20}{1:20}".format(word, ps.stem(word)))

"""
--Word--            --Stem--            
program             program             
programming         program             
programer           program             
programs            program             
programmed          program

"""

This is a pretty simple example; we expected these results from our porter stemmer as mentioned in the “Stemming” section above.

Let’s try a trickier example:

import string
from nltk.tokenize import word_tokenize

example_sentence = "Python programmers often tend like programming in python because it's like english. We call people who program in python pythonistas."

# Remove punctuation
example_sentence_no_punct = example_sentence.translate(str.maketrans("", "", string.punctuation))

# Create tokens
word_tokens = word_tokenize(example_sentence_no_punct)

# Perform stemming
print("{0:20}{1:20}".format("--Word--","--Stem--"))
for word in word_tokens:
    print ("{0:20}{1:20}".format(word, ps.stem(word)))

"""
--Word--            --Stem--            
Python              python              
programmers         programm            
often               often               
tend                tend                
like                like                
programming         program             
in                  in                  
python              python              
because             becaus              
its                 it                  
like                like                
english             english             
We                  we                  
call                call                
people              peopl               
who                 who                 
program             program             
in                  in                  
python              python              
pythonistas         pythonista
"""

Here you can see some of the output words are not part of the english dictionary (i.e., “becaus,” “people,” and “programm.”). Another thing to notice is that context is not taken into consideration. For instance, “programmers” is a plural noun but it was reduced down to “program,” which can be a noun or a verb – in other words, the root words are ambiguous.

Python lemmatization example

The motivation behind context-sensitive lemmatizers was to improve the performance on unseen and ambiguous words. In our lemmatization example, we will be using a popular lemmatizer called WordNet lemmatizer.

Wordnet is a large, free, and publicly available lexical database for the English language aiming to establish structured semantic relationships between words.

Let’s see in action:

from nltk.stem import WordNetLemmatizer
nltk.download("wordnet")
nltk.download("omw-1.4")

# Initialize wordnet lemmatizer
wnl = WordNetLemmatizer()

# Example inflections to reduce
example_words = ["program","programming","programer","programs","programmed"]

# Perform lemmatization
print("{0:20}{1:20}".format("--Word--","--Lemma--"))

for word in example_words:
   # pos="v" tells the lemmatizer to treat words as verbs
   # Other options: "n" (noun), "a" (adjective), "r" (adverb)
   print("{0:20}{1:20}".format(word, wnl.lemmatize(word, pos="v")))

"""
--Word--            --Lemma--           
program             program             
programming         program             
programer           programer           
programs            program             
programmed          program
"""

Input words passed to our lemmatizer will remain unchanged if it cannot be found in WordNet. This means context must be provided, which is done by giving the value for the part-of-speech parameter, pos, in wordnet_lemmatizer.lemmatize.

Notice the word “programmer” were not cut down to “program” by our lemmatizer: this is because we told our lemmatizer to only stem verbs.

Let’s pass our lemmatizer some something more complicated to see how it fairs…

example_sentence = "Python programmers often tend like programming in python because it's like english. We call people who program in python pythonistas."

# Remove punctuation
example_sentence_no_punct = example_sentence.translate(str.maketrans("", "", string.punctuation))

word_tokens = word_tokenize(example_sentence_no_punct)

# Perform lemmatization
print("{0:20}{1:20}".format("--Word--","--Lemma--"))
for word in word_tokens:
   print ("{0:20}{1:20}".format(word, wnl.lemmatize(word, pos="v")))
"""
--Word--            --Lemma--           
Python              Python              
programmers         programmers         
often               often               
tend                tend                
like                like                
programming         program             
in                  in                  
python              python              
because             because             
its                 its                 
like                like                
english             english             
We                  We                  
call                call                
people              people              
who                 who                 
program             program             
in                  in                  
python              python              
pythonistas         pythonistas 
"""

All words returned by the lemmatization algorithm is in the english dictionary - minus “pythonistas,” which is more of an informal term used to refer to python programmers.

Stemming vs. Lemmatization

You’ve seen how to implement both techniques, but how do they compare?

Stemming and lemmatization are both text-processing techniques that aim to reduce inflected words to a common base root. Despite the correlation in the overarching objective, the two techniques are not the same.

The main differences between stemming and lemmatization lay in how each technique arrives at the objective of reducing inflected words to a common base root.

Stemming algorithms attempt to find the common base roots of various inflections by cutting off the endings or beginnings of the word. The chop is based on a list of common prefixes and suffixes that can typically be found in inflected words. This non-discriminatory nature act of chopping words may occasionally lead to finding meaningful word stems, but other times it does not.

On the other hand, lemmatization algorithms attempt to find common base roots from inflected words by conducting a morphological analysis. To accurately reduce inflections, a detailed dictionary must be kept so the algorithm can search through to link an inflected word back to its lemma.

The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatization commonly only collapses the different inflectional forms of a lemma.
Stanford NLP, IR Book

The crude heuristic approach taken by stemming algorithms typically means they’re fast and efficient but not always accurate. In contrast, lemmatization algorithms sacrifice speed and efficiency for accuracy, thus, resulting in meaningful base roots.

For example, a stemming algorithm may reduce “saw” down to “s.” A lemmatization algorithm will consider whether “saw” is a noun (the hand tool for cutting) or a verb (to see) based on the context in which it is used before deciding to return a lemma – if it’s a noun it will return “saw,” and if it’s a verb it will return “see.”

These points may be pretty clear to you by now, so here’s the million dollar question – “should I use stemming or lemmatization for text preprocessing?”

Like most things software related; it depends.

Do you care about speed and efficiency? If so, choose stemming.

Is context important for your application? If you said “yes,” then use lemmatization.

Which technique you use completely depends on the application you are working on and your goals for the project. You may want to run experiments with both techniques and compare the results to see which approach resulted in the outcomes that most align with your project goals.

Something we have not touched on much in this tutorial is how lemmatization algorithms are created; this is because there are several libraries (such as SpaCy, NLTK, etc.) compatible with different languages. However, if you had to create your own lemmatizer for an unavailable language (i.e., Akan), you would need a good knowledge and understanding of the target language to build a lemmatizer. Stemming algorithms are much easier to build for such scenarios.

	Stemming	Lemmatization
Speed	Fast	Slow
Accuracy	Lower	Higher
Output	May not be a real word	Always a real word
Uses context?	No	Yes
Best for	Search engines, spam filters	Chatbots, sentiment analysis, QA

Conclusion

To summarize, stemming and lemmatization are techniques used for text processing in NLP. They both aim to reduce inflections down to common base root words, but each takes a different approach in doing so.

Lemmatization is much more accurate than stemming. It always returns real dictionary words and takes context into consideration. For most modern NLP pipelines, lemmatization is preferred when accuracy matters. However, if you're working at scale or building a quick prototype, stemming may be good enough and considerably faster.

Author

Kurtis Pykes

Do I need to download anything extra to use NLTK's stemmer and lemmatizer?

Why does my lemmatizer return the same word I put in without changing it?

Is stemming or lemmatization better for search engines?

Can I use these techniques with languages other than English?

Does lemmatization always produce better NLP model results than stemming?

Topik

Artificial Intelligence

Python

Machine Learning

Learn more about Python and Natural Language Processing

Kursus

Pengantar Deep Learning dengan Python

4 Hr

264K

Pelajari dasar-dasar jaringan saraf tiruan dan cara membangun model pembelajaran mendalam menggunakan Keras 2.0 dalam Python.

Lihat Detail

Mulai Kursus

Kursus

Pengantar Natural Language Processing di Python

4 Hr

141.3K

Pelajari teknik dasar pemrosesan bahasa alami menggunakan Python dan cara menerapkannya untuk mengekstrak wawasan dari data teks dunia nyata.

Lihat Detail

Mulai Kursus

Kursus

NLP Lanjutan dengan spaCy

5 Hr

21.7K

Lihat Detail

Mulai Kursus

Lihat Lebih Banyak

Terkait

Tutorials

NLTK Sentiment Analysis Tutorial for Beginners

Python NLTK (natural language toolkit) sentiment analysis tutorial. Learn how to create and develop sentiment analysis using Python. Follow specific steps to mine and analyze text for natural language processing.

Moez Ali

Tutorials

Python Sentiment Analysis Tutorial

We help simplify sentiment analysis using Python in this tutorial. You will learn how to build your own sentiment analysis classifier using Python and understand the basics of NLP (natural language processing).

Sayak Paul

Tutorials

Web Scraping & NLP in Python

Learn to scrape novels from the web and plot word frequency distributions; You will gain experience with Python packages requests, BeautifulSoup and nltk.

Hugo Bowne-Anderson

Tutorials

Python Machine Learning: Scikit-Learn Tutorial

An easy-to-follow scikit-learn tutorial that will help you get started with Python machine learning.

Kurtis Pykes

Tutorials

Python Bag of Words Model: A Complete Guide

Explore everything you need to know about how to implement the bag of words model in Python.

Derrick Mwiti

Tutorials

Latent Semantic Analysis using Python

In this tutorial, you will learn how to discover the hidden topics from given documents using Latent Semantic Analysis in python.

Avinash Navlani

Lihat Lebih Banyak Lihat Lebih Banyak

What Is Stemming?

Advantages of stemming

Disadvantages of stemming

What Is Lemmatization?

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma

Advantages of lemmatization

Disadvantages of lemmatization

Start Learning Python For Free

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}Introduction to Data Science in Python

Introduction to Natural Language Processing in Python

Stemming and Lemmatization Examples in Python with NLTK

The Python NLTK package

Python stemming example

Python lemmatization example

Stemming vs. Lemmatization

The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatization commonly only collapses the different inflectional forms of a lemma.

Conclusion

FAQs

Is stemming or lemmatization better for search engines?

Can I use these techniques with languages other than English?

Does lemmatization always produce better NLP model results than stemming?

NLTK Sentiment Analysis Tutorial for Beginners

Python Sentiment Analysis Tutorial

Web Scraping & NLP in Python

Python Machine Learning: Scikit-Learn Tutorial

Python Bag of Words Model: A Complete Guide

Latent Semantic Analysis using Python

Pengantar Deep Learning dengan Python

Pengantar Natural Language Processing di Python

NLP Lanjutan dengan spaCy

NLTK Sentiment Analysis Tutorial for Beginners

Python Sentiment Analysis Tutorial

Web Scraping & NLP in Python

Python Machine Learning: Scikit-Learn Tutorial

Python Bag of Words Model: A Complete Guide

Latent Semantic Analysis using Python

Introduction to Data Science in Python