Stemming and Lemmatization in Python
The modern English language is considered a weakly inflected language. This means there are many words in English derived from another word; for example, the inflected word “normality” is derived from the word “norm,” which is the root form. All inflected languages consist of words with common root forms, but the degree of inflection varies based on the language.
“In linguistic morphology, inflection is a process of word formation, in which a word is modified to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, mood, animacy, and definiteness.”
– (Source: Wikipedia)
When working with text, sometimes it’s necessary to apply normalization techniques to get words to their root form from their derived versions. This helps reduce randomness and bring the words in the corpus closer to the predefined standard, improving the processing efficiency since the computer has fewer features to deal with.
Two popular text normalization techniques in the field of Natural Language Processing (NLP), the application of computational techniques to analyze and synthesize natural language and speech, are stemming and lemmatization. Researchers have studied these techniques for years; NLP practitioners typically use them to prepare words, text, and documents for further processing in a number of tasks.
This tutorial will cover stemming and lemmatization from a practical standpoint using the Python Natural Language ToolKit (NLTK) package.
Check out this this DataLab workbook for an overview of all the code in this tutorial. To edit and run the code, create a copy of the workbook to run and edit this code.
Stemming
Stemming is a technique used to reduce an inflected word down to its word stem. For example, the words “programming,” “programmer,” and “programs” can all be reduced down to the common word stem “program.” In other words, “program” can be used as a synonym for the prior three inflection words.
Performing this text-processing technique is often useful for dealing with sparsity and/or standardizing vocabulary. Not only does it help with reducing redundancy, as most of the time the word stem and their inflected words have the same meaning, it also allows NLP models to learn links between inflected words and their word stem, which helps the model understand their usage in similar contexts.
Stemming algorithms function by taking a list of frequent prefixes and suffixes found in inflected words and chopping off the end or beginning of the word. This can occasionally result in word stems that are not real words; thus, we can affirm this approach certainly has its pros, but it’s not without its limitations.
Advantages of Stemming
- Improved model performance: Stemming reduces the number of unique words that need to be processed by an algorithm, which can improve its performance. Additionally, it can also make the algorithm run faster and more efficiently.
- Grouping similar words: Words with a similar meaning can be grouped together, even if they have distinct forms. This can be a useful technique in tasks such as document classification, where it’s important to identify key topics or themes within a document.
- Easier to analyze and understand: Since stemming typically reduces the size of the vocabulary, it’s much easier to analyze, compare, and understand texts. This is helpful in tasks such as sentiment analysis, where the goal is to determine the sentiment of a document.
Disadvantages of Stemming
- Overstemming / False positives: This is when a stemming algorithm reduces separate inflected words to the same word stem even though they are not related; for example, the Porter Stemmer algorithm stems "universal", "university", and "universe" to the same word stem. Though they are etymologically related, their meanings in the modern day are from widely different domains. Treating them as synonyms will reduce relevance in search results.
- Understemming / False negatives: This is when a stemming algorithm reduces inflected words to different word stems, but they should be the same. For example, the Porter Stemmer algorithm does not reduce the words “alumnus,” “alumnae,” and “alumni” to the same word stem, although they should be treated as synonyms.
- Language challenges: As the target language's morphology, spelling, and character encoding get more complicated, stemmers become more difficult to design; For example, an Italian stemmer is more complicated than an English stemmer because there is a higher number of verb inflections. A Russian stemmer is even more complex due to more noun declensions.
Lemmatization
Lemmatization is another technique used to reduce inflected words to their root word. It describes the algorithmic process of identifying an inflected word’s “lemma” (dictionary form) based on its intended meaning.
As opposed to stemming, lemmatization relies on accurately determining the intended part-of-speech and the meaning of a word based on its context. This means it takes into consideration where the inflected word falls within a sentence, as well as within the larger context surrounding that sentence, such as neighboring sentences or even an entire document.
“Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma”
– (Source: Standford NLP Group)
In other words, to lemmatize a document typically means to “doing things correctly” since it involves using a vocabulary and performing morphological analysis of words to remove only the inflectional ends and return the base or dictionary form of a word, which is known as the “lemma.” For example, you can expect a lemmatization algorithm to map “runs,” “running,” and “ran” to the lemma, “run.”
Advantages of Lemmatization
- Accuracy: Lemmatization does not merely cut words off as you see in stemming algorithms. Analysis of words is conducted based on the word’s POS to take context into consideration when producing lemmas. Also, lemmatization leads to real dictionary words being produced.
Disadvantages of Lemmatization
- Time-consuming: Compared to stemming, lemmatization is a slow and time-consuming process. This is because lemmatization involves performing morphological analysis and deriving the meaning of words from a dictionary.
Start Learning Python For Free
Introduction to Natural Language Processing in Python
Hands-on Stemming and Lemmatization Examples in Python with NLTK
Now you have an overview of stemming and lemmatization. In this section, we are going to get hands-on and demonstrate examples of both techniques using Python and a library called NLTK.
A brief primer to the Python NLTK package
Natural Language Tool Kit (NLTK) is a Python library used to build programs capable of processing natural language. The library can perform different operations such as tokenizing, stemming, classification, parsing, tagging, semantic reasoning, sentiment analysis, and more.
The latest version is NLTK 3.8.1, and it requires Python versions 3.7, 3.8, 3.9, 3.10, or 3.11, but you don’t have to worry about this since it comes preinstalled in the DataLab workbook – just import nltk and you’re good to go.
Python Stemming example
One of the most popular stemming algorithms is called the “Porter stemmer.” The porter stemmer was first proposed by Martin Porter in a 1980 paper titled "An algorithm for suffix stripping." The paper has become one of the most common algorithms for stemming in English.
Let’s see how it works:
import nltk
from nltk.stem import PorterStemmer
nltk.download("punkt")
# Initialize Python porter stemmer
ps = PorterStemmer()
# Example inflections to reduce
example_words = ["program","programming","programer","programs","programmed"]
# Perform stemming
print("{0:20}{1:20}".format("--Word--","--Stem--"))
for word in example_words:
print ("{0:20}{1:20}".format(word, ps.stem(word)))
"""
--Word-- --Stem--
program program
programming program
programer program
programs program
programmed program
"""
This is a pretty simple example; we expected these results from our porter stemmer as mentioned in the “Stemming” section above.
Let’s try a trickier example:
import string
from nltk.tokenize import word_tokenize
example_sentence = "Python programmers often tend like programming in python because it's like english. We call people who program in python pythonistas."
# Remove punctuation
example_sentence_no_punct = example_sentence.translate(str.maketrans("", "", string.punctuation))
# Create tokens
word_tokens = word_tokenize(example_sentence_no_punct)
# Perform stemming
print("{0:20}{1:20}".format("--Word--","--Stem--"))
for word in word_tokens:
print ("{0:20}{1:20}".format(word, ps.stem(word)))
"""
--Word-- --Stem--
Python python
programmers programm
often often
tend tend
like like
programming program
in in
python python
because becaus
its it
like like
english english
We we
call call
people peopl
who who
program program
in in
python python
pythonistas pythonista
"""
Here you can see some of the output words are not part of the english dictionary (i.e., “becaus,” “people,” and “programm.”). Another thing to notice is that context is not taken into consideration. For instance, “programmers” is a plural noun but it was reduced down to “program,” which can be a noun or a verb – in other words, the root words are ambiguous.
Python Lemmatization example
The motivation behind context-sensitive lemmatizers was to improve the performance on unseen and ambiguous words. In our lemmatization example, we will be using a popular lemmatizer called WordNet lemmatizer.
Wordnet is a large, free, and publicly available lexical database for the English language aiming to establish structured semantic relationships between words.
Let’s see in action:
from nltk.stem import WordNetLemmatizer
nltk.download("wordnet")
nltk.download("omw-1.4")
# Initialize wordnet lemmatizer
wnl = WordNetLemmatizer()
# Example inflections to reduce
example_words = ["program","programming","programer","programs","programmed"]
# Perform lemmatization
print("{0:20}{1:20}".format("--Word--","--Lemma--"))
for word in example_words:
print ("{0:20}{1:20}".format(word, wnl.lemmatize(word, pos="v")))
"""
--Word-- --Lemma--
program program
programming program
programer programer
programs program
programmed program
"""
Input words passed to our lemmatizer will remain unchanged if it cannot be found in WordNet. This means context must be provided, which is done by giving the value for the part-of-speech parameter, pos, in wordnet_lemmatizer.lemmatize.
Notice the word “programmer” were not cut down to “program” by our lemmatizer: this is because we told our lemmatizer to only stem verbs.
Let’s pass our lemmatizer some something more complicated to see how it fairs…
example_sentence = "Python programmers often tend like programming in python because it's like english. We call people who program in python pythonistas."
# Remove punctuation
example_sentence_no_punct = example_sentence.translate(str.maketrans("", "", string.punctuation))
word_tokens = word_tokenize(example_sentence_no_punct)
# Perform lemmatization
print("{0:20}{1:20}".format("--Word--","--Lemma--"))
for word in word_tokens:
print ("{0:20}{1:20}".format(word, wnl.lemmatize(word, pos="v")))
"""
--Word-- --Lemma--
Python Python
programmers programmers
often often
tend tend
like like
programming program
in in
python python
because because
its its
like like
english english
We We
call call
people people
who who
program program
in in
python python
pythonistas pythonistas
"""
All words returned by the lemmatization algorithm is in the english dictionary - minus “pythonistas,” which is more of an informal term used to refer to python programmers.
Stemming vs Lemmatization
You’ve seen how to implement both techniques, but how do they compare?
Stemming and lemmatization are both text-processing techniques that aim to reduce inflected words to a common base root. Despite the correlation in the overarching objective, the two techniques are not the same.
The main differences between stemming and lemmatization lay in how each technique arrives at the objective of reducing inflected words to a common base root.
Stemming algorithms attempt to find the common base roots of various inflections by cutting off the endings or beginnings of the word. The chop is based on a list of common prefixes and suffixes that can typically be found in inflected words. This non-discriminatory nature act of chopping words may occasionally lead to finding meaningful word stems, but other times it does not.
On the other hand, lemmatization algorithms attempt to find common base roots from inflected words by conducting a morphological analysis. To accurately reduce inflections, a detailed dictionary must be kept so the algorithm can search through to link an inflected word back to its lemma.
“The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatization commonly only collapses the different inflectional forms of a lemma.”
– Source: Stanford NLP, IR Book.
The crude heuristic approach taken by stemming algorithms typically means they’re fast and efficient but not always accurate. In contrast, lemmatization algorithms sacrifice speed and efficiency for accuracy, thus, resulting in meaningful base roots.
For example, a stemming algorithm may reduce “saw” down to “s.” A lemmatization algorithm will consider whether “saw” is a noun (the hand tool for cutting) or a verb (to see) based on the context in which it is used before deciding to return a lemma – if it’s a noun it will return “saw,” and if it’s a verb it will return “see.”
These points may be pretty clear to you by now, so here’s the million dollar question – “should I use stemming or lemmatization for text preprocessing?”
Like most things software related; it depends.
Do you care about speed and efficiency? If so, choose stemming.
Is context important for your application? If you said “yes,” then use lemmatization.
Which technique you use completely depends on the application you are working on and your goals for the project. You may want to run experiments with both techniques and compare the results to see which approach resulted in the outcomes that most align with your project goals.
Something we have not touched on much in this tutorial is how lemmatization algorithms are created; this is because there are several libraries (such as SpaCy, NLTK, etc.) compatible with different languages. However, if you had to create your own lemmatizer for an unavailable language (i.e., Akan), you would need a good knowledge and understanding of the target language to build a lemmatizer. Stemming algorithms are much easier to build for such scenarios.
Wrap up
To summarize, stemming and lemmatization are techniques used for text processing in NLP. They both aim to reduce inflections down to common base root words, but each takes a different approach in doing so. The stemming approach is much faster than lemmatization but it’s more crude and can occasionally lead to unmeaningful common base roots. Alternatively, lemmatization is much more accurate than stemming in terms of finding meaningful dictionary words, and it takes context into consideration.
Learn more about Python and Natural Language Processing
Course
Introduction to Natural Language Processing in Python
Course
Introduction to Deep Learning in Python
Course
Advanced NLP with spaCy
tutorial
NLTK Sentiment Analysis Tutorial for Beginners
tutorial
Python Sentiment Analysis Tutorial
tutorial
Web Scraping & NLP in Python
tutorial
Natural Language Processing Tutorial
DataCamp Team
13 min
tutorial
Latent Semantic Analysis using Python
Avinash Navlani
11 min
tutorial