Analyzing Categorical Data from the General Social Survey in Python
Welcome to your webinar workspace! In this session, we will introduce you to categorical variables in Python. We will be using a subset of data from the General Social Survey.
The following code block imports some of the main packages we will be using, which are pandas, NumPy, and Plotly. We will also use statsmodels for a special type of categorical plot.
We will read in our data and preview it as an interactive table. Please follow along with the code and feel free to ask any questions!
# Import packages
import pandas as pd
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt
from statsmodels.graphics.mosaicplot import mosaic
# Read in csv as a DataFrame and preview it
df = pd.read_csv("gss_sub.csv")
df
Inspecting our data
What types of data are in our dataset?
One of the simplest ways to get an overview of the types of data you are working with is to use the .info()
method, which will return a summary of your data, including:
- The column names.
- The number of non-null values per column.
- The data types.
- The memory usage of the DataFrame.
df.info()
Above we see that our DataFrame contains float64
column (numerical data), as well as a number of object
columns. Object data types contain strings.
Inspecting individual columns
To inspect a categorical column, use the .describe()
method with the include
parameter to select a particular DataType (in this case "O"
). This returns the count, number of unique values, the mode, and frequency of the mode.
df.describe()
df.describe(include="O")
The .value_counts()
method can give you a greater insight into the distribution and structure of a column.
df["environment"].value_counts()
df["environment"].value_counts(normalize=True)
df["environment"].value_counts(normalize=True)*100
Manipulating categorical data
Let's convert our object columns to categories
- The categorical variable type can be useful, especially here:
- Save on memory when there are only a few different values.
- You can specify a precise order to the categories when the default order may be incorrect (e.g., via alphabetical).
- Can be compatible with other Python libraries.
Let's take our existing categorical variables and convert them from strings to categories. Here, we use .select_dtypes()
to return only object columns, and with a dictionary set their type to be a category.
# Create a dictionary of column and data type mappings
conversion_dict={k: "category" for k in df.select_dtypes(include="object").columns}
conversion_dict
# Convert our DataFrame and check the data types
# Create a dictionary of column and data type mappings
conversion_dict={k: "category" for k in df.select_dtypes(include="object").columns}
conversion_dict
# Convert our DataFrame and check the data types
df = df.astype(conversion_dict)
df.info()