Skip to content
New Workbook
Sign up
Live Training - Working with Categorical Data in Python (Webinar)

Analyzing Categorical Data from the General Social Survey in Python

Welcome to your webinar workspace! In this session, we will introduce you to categorical variables in Python. We will be using a subset of data from the General Social Survey.

The following code block imports some of the main packages we will be using, which are pandas, NumPy, and Plotly. We will also use statsmodels for a special type of categorical plot.

We will read in our data and preview it as an interactive table. Please follow along with the code and feel free to ask any questions!

# Import packages
import pandas as pd
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt
from statsmodels.graphics.mosaicplot import mosaic

# Read in csv as a DataFrame and preview it
df = pd.read_csv("gss_sub.csv") 
df
df.info()

Above we see that our DataFrame contains float64 column (numerical data), as well as a number of object columns. Object data types contain strings.

Inspecting individual columns

To inspect a categorical column, use the .describe() method with the include parameter to select a particular DataType (in this case "O"). This returns the count, number of unique values, the mode, and frequency of the mode.

df.describe()
df.describe(include="O")

The .value_counts() method can give you a greater insight into the distribution and structure of a column.

df["environment"].value_counts()
df["environment"].value_counts(normalize=True)
df["environment"].value_counts(normalize=True)*100

Manipulating categorical data

Let's convert our object columns to categories

  • The categorical variable type can be useful, especially here:
    • Save on memory when there are only a few different values.
    • You can specify a precise order to the categories when the default order may be incorrect (e.g., via alphabetical).
    • Can be compatible with other Python libraries.

Let's take our existing categorical variables and convert them from strings to categories. Here, we use .select_dtypes() to return only object columns, and with a dictionary set their type to be a category.

# Create a dictionary of column and data type mappings
conversion_dict={k: "category" for k in df.select_dtypes(include="object").columns}
conversion_dict
# Convert our DataFrame and check the data types

# Create a dictionary of column and data type mappings
conversion_dict={k: "category" for k in df.select_dtypes(include="object").columns}
conversion_dict
# Convert our DataFrame and check the data types
df = df.astype(conversion_dict)
df.info()