The modern English language is considered a weakly inflected language. This means there are many words in English derived from another word; for example, the inflected word “normality” is derived from the word “norm,” which is the root form. All inflected languages consist of words with common root forms, but the degree of inflection varies based on the language.
When working with text, sometimes it’s necessary to apply normalization techniques to get words to their root form from their derived versions. This helps reduce randomness and bring the words in the corpus closer to the predefined standard, improving the processing efficiency since the computer has fewer features to deal with.
Two of the most popular text normalization techniques in Natural Language Processing (NLP) are stemming and lemmatization.
Researchers have studied these techniques for years; NLP practitioners typically use them to prepare words, text, and documents for further processing in a number of tasks.
This tutorial will cover stemming and lemmatization from a practical standpoint using the Python Natural Language ToolKit (NLTK) package.
Check out this DataLab workbook for an overview of all the code in this tutorial. To edit and run the code, create a copy of the workbook to run and edit this code.
What Is Stemming?
Stemming is a technique used to reduce an inflected word down to its word stem. For example, the words “programming,” “programmer,” and “programs” can all be reduced down to the common word stem “program.” In other words, “program” can be used as a synonym for the prior three inflection words.
Performing this text-processing technique is often useful for dealing with sparsity and/or standardizing vocabulary. Not only does it help with reducing redundancy, as most of the time the word stem and their inflected words have the same meaning, it also allows NLP models to learn links between inflected words and their word stem, which helps the model understand their usage in similar contexts.
Stemming algorithms function by taking a list of frequent prefixes and suffixes found in inflected words and chopping off the end or beginning of the word. This can occasionally result in word stems that are not real words — a limitation we'll cover below.
Advantages of stemming
- Improved model performance: Stemming reduces the number of unique words that need to be processed by an algorithm, which can improve its performance. Additionally, it can also make the algorithm run faster and more efficiently.
- Grouping similar words: Words with a similar meaning can be grouped together, even if they have distinct forms. This can be a useful technique in tasks such as document classification, where it’s important to identify key topics or themes within a document.
- Easier to analyze and understand: Since stemming typically reduces the size of the vocabulary, it’s much easier to analyze, compare, and understand texts. This is helpful in tasks such as sentiment analysis, where the goal is to determine the sentiment of a document.
Disadvantages of stemming
- Overstemming / false positives: This is when a stemming algorithm reduces separate inflected words to the same word stem even though they are not related; for example, the Porter Stemmer algorithm stems "universal", "university", and "universe" to the same word stem. Though they are etymologically related, their meanings in the modern day are from widely different domains. Treating them as synonyms will reduce relevance in search results.
- Understemming / false negatives: This is when a stemming algorithm reduces inflected words to different word stems, but they should be the same. For example, the Porter Stemmer algorithm does not reduce the words “alumnus,” “alumnae,” and “alumni” to the same word stem, although they should be treated as synonyms.
- Language challenges: As the target language's morphology, spelling, and character encoding get more complicated, stemmers become more difficult to design; For example, an Italian stemmer is more complicated than an English stemmer because there is a higher number of verb inflections. A Russian stemmer is even more complex due to more noun declensions.
What Is Lemmatization?
Lemmatization is another technique used to reduce inflected words to their root word. It describes the algorithmic process of identifying an inflected word’s “lemma” (dictionary form) based on its intended meaning.
As opposed to stemming, lemmatization relies on accurately determining the intended part-of-speech and the meaning of a word based on its context. This means it takes into consideration where the inflected word falls within a sentence, as well as within the larger context surrounding that sentence, such as neighboring sentences or even an entire document.
Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma
Standford NLP Group, Standford NLP Group
In other words, lemmatizing a document means doing things correctly since it involves using a vocabulary and performing morphological analysis of words to remove only the inflectional ends and return the base or dictionary form of a word, which is known as the “lemma.” For example, you can expect a lemmatization algorithm to map “runs,” “running,” and “ran” to the lemma, “run.”
Advantages of lemmatization
- Accuracy: Lemmatization does not merely cut words off as you see in stemming algorithms. Analysis of words is conducted based on the word’s POS to take context into consideration when producing lemmas. Also, lemmatization leads to real dictionary words being produced.
Disadvantages of lemmatization
- Time-consuming: Compared to stemming, lemmatization is a slow and time-consuming process. This is because lemmatization involves performing morphological analysis and deriving the meaning of words from a dictionary.
Start Learning Python For Free
Introduction to Natural Language Processing in Python
Stemming and Lemmatization Examples in Python with NLTK
Now you have an overview of stemming and lemmatization. In this section, we are going to get hands-on and demonstrate examples of both techniques using Python and a library called NLTK.
The Python NLTK package
Natural Language Tool Kit (NLTK) is a Python library used to build programs capable of processing natural language. The library can perform different operations such as tokenizing, stemming, classification, parsing, tagging, semantic reasoning, sentiment analysis, and more.
The latest version is NLTK 3.9.x, and it requires Python 3.10 or higher (up to 3.14), but you don't have to worry about this since it comes preinstalled in the DataLab workbook — just import nltk and you're good to go.
Python stemming example
One of the most popular stemming algorithms is called the “Porter stemmer.” The porter stemmer was first proposed by Martin Porter in a 1980 paper titled "An algorithm for suffix stripping." The paper has become one of the most common algorithms for stemming in English.
Let’s see how it works:
import nltk
from nltk.stem import PorterStemmer
nltk.download("punkt_tab") # Use punkt_tab in NLTK 3.9+; punkt is deprecated
# Initialize Python porter stemmer
ps = PorterStemmer()
# Example inflections to reduce
example_words = ["program","programming","programer","programs","programmed"]
# Perform stemming
print("{0:20}{1:20}".format("--Word--","--Stem--"))
for word in example_words:
print ("{0:20}{1:20}".format(word, ps.stem(word)))
"""
--Word-- --Stem--
program program
programming program
programer program
programs program
programmed program
"""
This is a pretty simple example; we expected these results from our porter stemmer as mentioned in the “Stemming” section above.
Let’s try a trickier example:
import string
from nltk.tokenize import word_tokenize
example_sentence = "Python programmers often tend like programming in python because it's like english. We call people who program in python pythonistas."
# Remove punctuation
example_sentence_no_punct = example_sentence.translate(str.maketrans("", "", string.punctuation))
# Create tokens
word_tokens = word_tokenize(example_sentence_no_punct)
# Perform stemming
print("{0:20}{1:20}".format("--Word--","--Stem--"))
for word in word_tokens:
print ("{0:20}{1:20}".format(word, ps.stem(word)))
"""
--Word-- --Stem--
Python python
programmers programm
often often
tend tend
like like
programming program
in in
python python
because becaus
its it
like like
english english
We we
call call
people peopl
who who
program program
in in
python python
pythonistas pythonista
"""
Here you can see some of the output words are not part of the english dictionary (i.e., “becaus,” “people,” and “programm.”). Another thing to notice is that context is not taken into consideration. For instance, “programmers” is a plural noun but it was reduced down to “program,” which can be a noun or a verb – in other words, the root words are ambiguous.
Python lemmatization example
The motivation behind context-sensitive lemmatizers was to improve the performance on unseen and ambiguous words. In our lemmatization example, we will be using a popular lemmatizer called WordNet lemmatizer.
Wordnet is a large, free, and publicly available lexical database for the English language aiming to establish structured semantic relationships between words.
Let’s see in action:
from nltk.stem import WordNetLemmatizer
nltk.download("wordnet")
nltk.download("omw-1.4")
# Initialize wordnet lemmatizer
wnl = WordNetLemmatizer()
# Example inflections to reduce
example_words = ["program","programming","programer","programs","programmed"]
# Perform lemmatization
print("{0:20}{1:20}".format("--Word--","--Lemma--"))
for word in example_words:
# pos="v" tells the lemmatizer to treat words as verbs
# Other options: "n" (noun), "a" (adjective), "r" (adverb)
print("{0:20}{1:20}".format(word, wnl.lemmatize(word, pos="v")))
"""
--Word-- --Lemma--
program program
programming program
programer programer
programs program
programmed program
"""
Input words passed to our lemmatizer will remain unchanged if it cannot be found in WordNet. This means context must be provided, which is done by giving the value for the part-of-speech parameter, pos, in wordnet_lemmatizer.lemmatize.
Notice the word “programmer” were not cut down to “program” by our lemmatizer: this is because we told our lemmatizer to only stem verbs.
Let’s pass our lemmatizer some something more complicated to see how it fairs…
example_sentence = "Python programmers often tend like programming in python because it's like english. We call people who program in python pythonistas."
# Remove punctuation
example_sentence_no_punct = example_sentence.translate(str.maketrans("", "", string.punctuation))
word_tokens = word_tokenize(example_sentence_no_punct)
# Perform lemmatization
print("{0:20}{1:20}".format("--Word--","--Lemma--"))
for word in word_tokens:
print ("{0:20}{1:20}".format(word, wnl.lemmatize(word, pos="v")))
"""
--Word-- --Lemma--
Python Python
programmers programmers
often often
tend tend
like like
programming program
in in
python python
because because
its its
like like
english english
We We
call call
people people
who who
program program
in in
python python
pythonistas pythonistas
"""
All words returned by the lemmatization algorithm is in the english dictionary - minus “pythonistas,” which is more of an informal term used to refer to python programmers.
Stemming vs. Lemmatization
You’ve seen how to implement both techniques, but how do they compare?
Stemming and lemmatization are both text-processing techniques that aim to reduce inflected words to a common base root. Despite the correlation in the overarching objective, the two techniques are not the same.
The main differences between stemming and lemmatization lay in how each technique arrives at the objective of reducing inflected words to a common base root.
Stemming algorithms attempt to find the common base roots of various inflections by cutting off the endings or beginnings of the word. The chop is based on a list of common prefixes and suffixes that can typically be found in inflected words. This non-discriminatory nature act of chopping words may occasionally lead to finding meaningful word stems, but other times it does not.
On the other hand, lemmatization algorithms attempt to find common base roots from inflected words by conducting a morphological analysis. To accurately reduce inflections, a detailed dictionary must be kept so the algorithm can search through to link an inflected word back to its lemma.
The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatization commonly only collapses the different inflectional forms of a lemma.
Stanford NLP, IR Book
The crude heuristic approach taken by stemming algorithms typically means they’re fast and efficient but not always accurate. In contrast, lemmatization algorithms sacrifice speed and efficiency for accuracy, thus, resulting in meaningful base roots.
For example, a stemming algorithm may reduce “saw” down to “s.” A lemmatization algorithm will consider whether “saw” is a noun (the hand tool for cutting) or a verb (to see) based on the context in which it is used before deciding to return a lemma – if it’s a noun it will return “saw,” and if it’s a verb it will return “see.”
These points may be pretty clear to you by now, so here’s the million dollar question – “should I use stemming or lemmatization for text preprocessing?”
Like most things software related; it depends.
Do you care about speed and efficiency? If so, choose stemming.
Is context important for your application? If you said “yes,” then use lemmatization.
Which technique you use completely depends on the application you are working on and your goals for the project. You may want to run experiments with both techniques and compare the results to see which approach resulted in the outcomes that most align with your project goals.
Something we have not touched on much in this tutorial is how lemmatization algorithms are created; this is because there are several libraries (such as SpaCy, NLTK, etc.) compatible with different languages. However, if you had to create your own lemmatizer for an unavailable language (i.e., Akan), you would need a good knowledge and understanding of the target language to build a lemmatizer. Stemming algorithms are much easier to build for such scenarios.
| Stemming | Lemmatization | |
|---|---|---|
| Speed | Fast | Slow |
| Accuracy | Lower | Higher |
| Output | May not be a real word | Always a real word |
| Uses context? | No | Yes |
| Best for | Search engines, spam filters | Chatbots, sentiment analysis, QA |
Conclusion
To summarize, stemming and lemmatization are techniques used for text processing in NLP. They both aim to reduce inflections down to common base root words, but each takes a different approach in doing so.
Lemmatization is much more accurate than stemming. It always returns real dictionary words and takes context into consideration. For most modern NLP pipelines, lemmatization is preferred when accuracy matters. However, if you're working at scale or building a quick prototype, stemming may be good enough and considerably faster.
FAQs
Do I need to download anything extra to use NLTK's stemmer and lemmatizer?
For the Porter Stemmer, no extra downloads are needed beyond installing NLTK itself. For the WordNet Lemmatizer, you'll need to run nltk.download("wordnet") and nltk.download("omw-1.4") once before using it. For tokenization, you'll also need nltk.download("punkt_tab") in NLTK 3.9+.
Why does my lemmatizer return the same word I put in without changing it?
This usually means one of two things: either the word wasn't found in WordNet's dictionary, or you didn't specify the correct pos (part-of-speech) parameter. For example, wnl.lemmatize("running") returns "running" by default because it assumes a noun. Pass pos="v" to treat it as a verb and get "run" back instead.
Is stemming or lemmatization better for search engines?
Stemming is generally preferred for search engines because speed matters more than perfect accuracy at that scale. When a user searches "running shoes," you want results for "run," "runner," and "runs" returned quickly. The occasional non-dictionary stem doesn't hurt the experience much. Lemmatization is overkill for most search use cases.
Can I use these techniques with languages other than English?
Yes, but with varying difficulty. NLTK has limited support for other languages out of the box. For broader multilingual support, libraries like spaCy or Stanza are better options — spaCy in particular has strong lemmatization support for languages including German, French, Spanish, Portuguese, and more. Building your own stemmer or lemmatizer for a low-resource language (like Akan) requires deep knowledge of that language's morphology.
Does lemmatization always produce better NLP model results than stemming?
Not necessarily. It depends on your task and dataset. For tasks where exact word meaning matters — like question answering, sentiment analysis, or chatbots — lemmatization typically performs better. For simpler tasks like document classification or keyword extraction, stemming often produces results that are just as good, at a fraction of the processing cost. The safest approach is to test both and compare results on your specific data.
