Analyzing Categorical Data from the General Social Survey in Python
Welcome to your webinar workspace! In this session, we will introduce you to categorical variables in Python. We will be using a subset of data from the General Social Survey.
The following code block imports some of the main packages we will be using, which are pandas, NumPy, and Plotly. We will also use statsmodels for a special type of categorical plot.
We will read in our data and preview it as an interactive table. Please follow along with the code and feel free to ask any questions!
# Import packages
import pandas as pd
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt
from statsmodels.graphics.mosaicplot import mosaic
# Read in csv as a DataFrame and preview it
Inspecting our data
What types of data are in our dataset?
One of the simplest ways to get an overview of the types of data you are working with is to use the .info()
method, which will return a summary of your data, including:
- The column names.
- The number of non-null values per column.
- The data types.
- The memory usage of the DataFrame.
Above we see that our DataFrame contains float64
column (numerical data), as well as a number of object
columns. Object data types contain strings.
Inspecting individual columns
To inspect a categorical column, use the .describe()
method with the include
parameter to select a particular DataType (in this case "O"
). This returns the count, number of unique values, the mode, and frequency of the mode.
The .value_counts()
method can give you a greater insight into the distribution and structure of a column.
Manipulating categorical data
Let's convert our object columns to categories
- The categorical variable type can be useful, especially here:
- Save on memory when there are only a few different values.
- You can specify a precise order to the categories when the default order may be incorrect (e.g., via alphabetical).
- Can be compatible with other Python libraries.
Let's take our existing categorical variables and convert them from strings to categories. Here, we use .select_dtypes()
to return only object columns, and with a dictionary set their type to be a category.
# Create a dictionary of column and data type mappings
# Convert our DataFrame and check the data types
Already we can see that the memory usage of the DataFrame has been halved from 7 mb to 4 mb! This can help when working with large quantities of data, such as this survey that we'll be working with.
Cleaning up the labor_status
column
labor_status
columnTo analyze the relationship between employment and attitudes over time, we need to clean up the labor_status
column. We can preview the existing categories using .categories
.
Let's collapse some of these categories. The easiest way to do this is to replace the values inside the column using a dictionary, and then reset the data type back to a category.