Stemming and Lemmatization in Python

This tutorial covers stemming and lemmatization from a practical standpoint using the Python Natural Language ToolKit (NLTK) package.

Updated Feb 28, 2023 · 12 min read

The modern English language is considered a weakly inflected language. This means there are many words in English derived from another word; for example, the inflected word “normality” is derived from the word “norm,” which is the root form. All inflected languages consist of words with common root forms, but the degree of inflection varies based on the language.

“In linguistic morphology, inflection is a process of word formation, in which a word is modified to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, mood, animacy, and definiteness.”
– (Source: Wikipedia)

When working with text, sometimes it’s necessary to apply normalization techniques to get words to their root form from their derived versions. This helps reduce randomness and bring the words in the corpus closer to the predefined standard, improving the processing efficiency since the computer has fewer features to deal with.

Two popular text normalization techniques in the field of Natural Language Processing (NLP), the application of computational techniques to analyze and synthesize natural language and speech, are stemming and lemmatization. Researchers have studied these techniques for years; NLP practitioners typically use them to prepare words, text, and documents for further processing in a number of tasks.

This tutorial will cover stemming and lemmatization from a practical standpoint using the Python Natural Language ToolKit (NLTK) package.

Check out this this DataLab workbook for an overview of all the code in this tutorial. To edit and run the code, create a copy of the workbook to run and edit this code.

Stemming

Stemming is a technique used to reduce an inflected word down to its word stem. For example, the words “programming,” “programmer,” and “programs” can all be reduced down to the common word stem “program.” In other words, “program” can be used as a synonym for the prior three inflection words.

Performing this text-processing technique is often useful for dealing with sparsity and/or standardizing vocabulary. Not only does it help with reducing redundancy, as most of the time the word stem and their inflected words have the same meaning, it also allows NLP models to learn links between inflected words and their word stem, which helps the model understand their usage in similar contexts.

Stemming algorithms function by taking a list of frequent prefixes and suffixes found in inflected words and chopping off the end or beginning of the word. This can occasionally result in word stems that are not real words; thus, we can affirm this approach certainly has its pros, but it’s not without its limitations.

Advantages of Stemming

Improved model performance: Stemming reduces the number of unique words that need to be processed by an algorithm, which can improve its performance. Additionally, it can also make the algorithm run faster and more efficiently.
Grouping similar words: Words with a similar meaning can be grouped together, even if they have distinct forms. This can be a useful technique in tasks such as document classification, where it’s important to identify key topics or themes within a document.
Easier to analyze and understand: Since stemming typically reduces the size of the vocabulary, it’s much easier to analyze, compare, and understand texts. This is helpful in tasks such as sentiment analysis, where the goal is to determine the sentiment of a document.

Disadvantages of Stemming

Overstemming / False positives: This is when a stemming algorithm reduces separate inflected words to the same word stem even though they are not related; for example, the Porter Stemmer algorithm stems "universal", "university", and "universe" to the same word stem. Though they are etymologically related, their meanings in the modern day are from widely different domains. Treating them as synonyms will reduce relevance in search results.
Understemming / False negatives: This is when a stemming algorithm reduces inflected words to different word stems, but they should be the same. For example, the Porter Stemmer algorithm does not reduce the words “alumnus,” “alumnae,” and “alumni” to the same word stem, although they should be treated as synonyms.
Language challenges: As the target language's morphology, spelling, and character encoding get more complicated, stemmers become more difficult to design; For example, an Italian stemmer is more complicated than an English stemmer because there is a higher number of verb inflections. A Russian stemmer is even more complex due to more noun declensions.

Lemmatization

Lemmatization is another technique used to reduce inflected words to their root word. It describes the algorithmic process of identifying an inflected word’s “lemma” (dictionary form) based on its intended meaning.

As opposed to stemming, lemmatization relies on accurately determining the intended part-of-speech and the meaning of a word based on its context. This means it takes into consideration where the inflected word falls within a sentence, as well as within the larger context surrounding that sentence, such as neighboring sentences or even an entire document.

“Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma”

– (Source: Standford NLP Group)

In other words, to lemmatize a document typically means to “doing things correctly” since it involves using a vocabulary and performing morphological analysis of words to remove only the inflectional ends and return the base or dictionary form of a word, which is known as the “lemma.” For example, you can expect a lemmatization algorithm to map “runs,” “running,” and “ran” to the lemma, “run.”

Advantages of Lemmatization

Accuracy: Lemmatization does not merely cut words off as you see in stemming algorithms. Analysis of words is conducted based on the word’s POS to take context into consideration when producing lemmas. Also, lemmatization leads to real dictionary words being produced.

Disadvantages of Lemmatization

Time-consuming: Compared to stemming, lemmatization is a slow and time-consuming process. This is because lemmatization involves performing morphological analysis and deriving the meaning of words from a dictionary.

Start Learning Python For Free

Introduction to Data Science in Python

BeginnerSkill Level

4 hr

470.6K learners

Dive into data science using Python and learn how to effectively analyze and visualize your data. No coding experience or skills needed.

See Details

Introduction to Natural Language Processing in Python

BeginnerSkill Level

4 hr

126.7K learners

Learn fundamental natural language processing techniques using Python and how to apply them to extract insights from real-world text data.

See Details

Hands-on Stemming and Lemmatization Examples in Python with NLTK

Now you have an overview of stemming and lemmatization. In this section, we are going to get hands-on and demonstrate examples of both techniques using Python and a library called NLTK.

A brief primer to the Python NLTK package

Natural Language Tool Kit (NLTK) is a Python library used to build programs capable of processing natural language. The library can perform different operations such as tokenizing, stemming, classification, parsing, tagging, semantic reasoning, sentiment analysis, and more.

The latest version is NLTK 3.8.1, and it requires Python versions 3.7, 3.8, 3.9, 3.10, or 3.11, but you don’t have to worry about this since it comes preinstalled in the DataLab workbook – just import nltk and you’re good to go.

Python Stemming example

One of the most popular stemming algorithms is called the “Porter stemmer.” The porter stemmer was first proposed by Martin Porter in a 1980 paper titled "An algorithm for suffix stripping." The paper has become one of the most common algorithms for stemming in English.

Let’s see how it works:

import nltk
from nltk.stem import PorterStemmer
nltk.download("punkt")

# Initialize Python porter stemmer
ps = PorterStemmer()

# Example inflections to reduce
example_words = ["program","programming","programer","programs","programmed"]

# Perform stemming
print("{0:20}{1:20}".format("--Word--","--Stem--"))
for word in example_words:
   print ("{0:20}{1:20}".format(word, ps.stem(word)))

"""
--Word--            --Stem--            
program             program             
programming         program             
programer           program             
programs            program             
programmed          program

"""

This is a pretty simple example; we expected these results from our porter stemmer as mentioned in the “Stemming” section above.

Let’s try a trickier example:

import string
from nltk.tokenize import word_tokenize

example_sentence = "Python programmers often tend like programming in python because it's like english. We call people who program in python pythonistas."

# Remove punctuation
example_sentence_no_punct = example_sentence.translate(str.maketrans("", "", string.punctuation))

# Create tokens
word_tokens = word_tokenize(example_sentence_no_punct)

# Perform stemming
print("{0:20}{1:20}".format("--Word--","--Stem--"))
for word in word_tokens:
    print ("{0:20}{1:20}".format(word, ps.stem(word)))

"""
--Word--            --Stem--            
Python              python              
programmers         programm            
often               often               
tend                tend                
like                like                
programming         program             
in                  in                  
python              python              
because             becaus              
its                 it                  
like                like                
english             english             
We                  we                  
call                call                
people              peopl               
who                 who                 
program             program             
in                  in                  
python              python              
pythonistas         pythonista
"""

Here you can see some of the output words are not part of the english dictionary (i.e., “becaus,” “people,” and “programm.”). Another thing to notice is that context is not taken into consideration. For instance, “programmers” is a plural noun but it was reduced down to “program,” which can be a noun or a verb – in other words, the root words are ambiguous.

Python Lemmatization example

The motivation behind context-sensitive lemmatizers was to improve the performance on unseen and ambiguous words. In our lemmatization example, we will be using a popular lemmatizer called WordNet lemmatizer.

Wordnet is a large, free, and publicly available lexical database for the English language aiming to establish structured semantic relationships between words.

Let’s see in action:

from nltk.stem import WordNetLemmatizer
nltk.download("wordnet")
nltk.download("omw-1.4")

# Initialize wordnet lemmatizer
wnl = WordNetLemmatizer()

# Example inflections to reduce
example_words = ["program","programming","programer","programs","programmed"]

# Perform lemmatization
print("{0:20}{1:20}".format("--Word--","--Lemma--"))
for word in example_words:
   print ("{0:20}{1:20}".format(word, wnl.lemmatize(word, pos="v")))

"""
--Word--            --Lemma--           
program             program             
programming         program             
programer           programer           
programs            program             
programmed          program
"""

Input words passed to our lemmatizer will remain unchanged if it cannot be found in WordNet. This means context must be provided, which is done by giving the value for the part-of-speech parameter, pos, in wordnet_lemmatizer.lemmatize.

Notice the word “programmer” were not cut down to “program” by our lemmatizer: this is because we told our lemmatizer to only stem verbs.

Let’s pass our lemmatizer some something more complicated to see how it fairs…

example_sentence = "Python programmers often tend like programming in python because it's like english. We call people who program in python pythonistas."

# Remove punctuation
example_sentence_no_punct = example_sentence.translate(str.maketrans("", "", string.punctuation))

word_tokens = word_tokenize(example_sentence_no_punct)

# Perform lemmatization
print("{0:20}{1:20}".format("--Word--","--Lemma--"))
for word in word_tokens:
   print ("{0:20}{1:20}".format(word, wnl.lemmatize(word, pos="v")))
"""
--Word--            --Lemma--           
Python              Python              
programmers         programmers         
often               often               
tend                tend                
like                like                
programming         program             
in                  in                  
python              python              
because             because             
its                 its                 
like                like                
english             english             
We                  We                  
call                call                
people              people              
who                 who                 
program             program             
in                  in                  
python              python              
pythonistas         pythonistas 
"""

All words returned by the lemmatization algorithm is in the english dictionary - minus “pythonistas,” which is more of an informal term used to refer to python programmers.

Stemming vs Lemmatization

You’ve seen how to implement both techniques, but how do they compare?

Stemming and lemmatization are both text-processing techniques that aim to reduce inflected words to a common base root. Despite the correlation in the overarching objective, the two techniques are not the same.

The main differences between stemming and lemmatization lay in how each technique arrives at the objective of reducing inflected words to a common base root.

Stemming algorithms attempt to find the common base roots of various inflections by cutting off the endings or beginnings of the word. The chop is based on a list of common prefixes and suffixes that can typically be found in inflected words. This non-discriminatory nature act of chopping words may occasionally lead to finding meaningful word stems, but other times it does not.

On the other hand, lemmatization algorithms attempt to find common base roots from inflected words by conducting a morphological analysis. To accurately reduce inflections, a detailed dictionary must be kept so the algorithm can search through to link an inflected word back to its lemma.

“The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatization commonly only collapses the different inflectional forms of a lemma.”

– Source: Stanford NLP, IR Book.

The crude heuristic approach taken by stemming algorithms typically means they’re fast and efficient but not always accurate. In contrast, lemmatization algorithms sacrifice speed and efficiency for accuracy, thus, resulting in meaningful base roots.

For example, a stemming algorithm may reduce “saw” down to “s.” A lemmatization algorithm will consider whether “saw” is a noun (the hand tool for cutting) or a verb (to see) based on the context in which it is used before deciding to return a lemma – if it’s a noun it will return “saw,” and if it’s a verb it will return “see.”

These points may be pretty clear to you by now, so here’s the million dollar question – “should I use stemming or lemmatization for text preprocessing?”

Like most things software related; it depends.

Do you care about speed and efficiency? If so, choose stemming.

Is context important for your application? If you said “yes,” then use lemmatization.

Which technique you use completely depends on the application you are working on and your goals for the project. You may want to run experiments with both techniques and compare the results to see which approach resulted in the outcomes that most align with your project goals.

Something we have not touched on much in this tutorial is how lemmatization algorithms are created; this is because there are several libraries (such as SpaCy, NLTK, etc.) compatible with different languages. However, if you had to create your own lemmatizer for an unavailable language (i.e., Akan), you would need a good knowledge and understanding of the target language to build a lemmatizer. Stemming algorithms are much easier to build for such scenarios.

Wrap up

To summarize, stemming and lemmatization are techniques used for text processing in NLP. They both aim to reduce inflections down to common base root words, but each takes a different approach in doing so. The stemming approach is much faster than lemmatization but it’s more crude and can occasionally lead to unmeaningful common base roots. Alternatively, lemmatization is much more accurate than stemming in terms of finding meaningful dictionary words, and it takes context into consideration.

Topics

Artificial Intelligence

Python

Machine Learning

Learn more about Python and Natural Language Processing

Course

Introduction to Deep Learning in Python

4 hr

261.1K

Learn the fundamentals of neural networks and how to build deep learning models using Keras 2.0 in Python.

See Details

Start Course

Course

Introduction to Natural Language Processing in Python

4 hr

139.5K

Learn fundamental natural language processing techniques using Python and how to apply them to extract insights from real-world text data.

See Details

Start Course

Course

Advanced NLP with spaCy

5 hr

21.5K

Learn how to use spaCy to build advanced natural language understanding systems, using both rule-based and machine learning approaches.

See Details

Start Course

Tutorial

NLTK Sentiment Analysis Tutorial for Beginners

Python NLTK (natural language toolkit) sentiment analysis tutorial. Learn how to create and develop sentiment analysis using Python. Follow specific steps to mine and analyze text for natural language processing.

Moez Ali

Tutorial

Python Sentiment Analysis Tutorial

We help simplify sentiment analysis using Python in this tutorial. You will learn how to build your own sentiment analysis classifier using Python and understand the basics of NLP (natural language processing).

Sayak Paul

Tutorial

Web Scraping & NLP in Python

Learn to scrape novels from the web and plot word frequency distributions; You will gain experience with Python packages requests, BeautifulSoup and nltk.