Exploratory Data Analysis (EDA) for Natural Language Processing using WordCloud
What is WordCloud?
Many times you might have seen a cloud filled with lots of words in different sizes, which represent the frequency or the importance of each word. This is called Tag Cloud or WordCloud. For this tutorial, you will learn how to create a WordCloud of your own in Python and customize it as you see fit. This tool will be quite handy for exploring text data and making your report more lively.
In this tutorial we will use a wine review dataset taking from Wine Enthusiast website to learn:
- How to create a basic wordcloud from one to several text document
- Adjust color, size and number of text inside your wordcloud
- Mask your wordcloud into any shape of your choice
- Mask your wordcloud into any color pattern of your choice
Prerequisites
You will need to install some packages below:
The numpy
library is one of the most popular and helpful library that is used for handling multi-dimensional arrays and matrices. It is also used in combination with Pandas
library to perform data analysis.
The Python os
module is a built-in library so you don't have to install it. To read more about handling files with os module, this DataCamp tutorial will be helpful.
For visualization, matplotlib
is a basic library that enable many other libraries to run and plot on its base including seaborn
or wordcloud
that you will use in this tutorial. The pillow
library is a package that enable image reading. Its tutorial can be found here. Pillow is a wrapper for PIL - Python Imaging Library. You will need this library to read in image as the mask for the wordcloud.
wordcloud
can be a little tricky to install. If you only need it for plotting a basic wordcloud, then pip install wordcloud
or conda install -c conda-forge wordcloud
would be sufficient. However, the latest version with the ability to mask the cloud into any shape of your choice requires a different method of installation as below:
git clone https://github.com/amueller/word_cloud.git cd word_cloud pip install .
Dataset:
This tutorial uses the wine review dataset from Kaggle. This collection is a great dataset for learning with no missing values (which will take time to handle) and a lot of text (wine reviews), categorical, and numerical data.
Now let's get started!
First thing first, you load all the necessary libraries:
%%capture
!pip install -r requirements.txt
# Start with loading all necessary libraries
import numpy as np
import pandas as pd
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
# % matplotlib inline
If you have more than 10 libraries, organize them by sections (such as basic libs, visualization, models, etc.) using comment in code will make your code clean and easy to follow.
Now, using pandas read_csv
to load in the dataframe. Notice the use of index_col=0
meaning we don't read in row name (index) as a separated column.
# Load in the dataframe
df = pd.read_csv("data/winemag-data-130k-v2.csv", index_col=0)
# Looking at first 5 rows of the dataset
df.head()
You can print out some basic information about the dataset using print()
combined with .format()
to have a nice print out.
print("There are {} observations and {} features in this dataset. \n".format(df.shape[0],df.shape[1]))
print("There are {} types of wine in this dataset such as {}... \n".format(len(df.variety.unique()),
", ".join(df.variety.unique()[0:5])))
print("There are {} countries producing wine in this dataset such as {}... \n".format(len(df.country.unique()),
", ".join(df.country.unique()[0:5])))
df[["country", "description","points"]].head()
To make comparisons between groups of a feature, you can use groupby()
and compute summary statistics.
With the wine dataset, you can group by country and look at either the summary statistics for all countries' points and price or select the most popular and expensive ones.
# Groupby by country
country = df.groupby("country")
# Summary statistic of all countries
country.describe().head()
This selects the top 5 highest average points among all 44 countries:
country.mean().sort_values(by="points",ascending=False).head()
You can plot the number of wines by country using the plot method of Pandas DataFrame and Matplotlib. If you are not familiar with Matplotlib, I suggested to take a quick look at this tutorial.