The General Social Survey (GSS)
About Dataset The GSS gathers data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes. Hundreds of trends have been tracked since 1972. In addition, since the GSS adopted questions from earlier surveys, trends can be followed for up to 70 years.
The GSS contains a standard core of demographic, behavioral, and attitudinal questions, plus topics of special interest. Among the topics covered are civil liberties, crime and violence, intergroup tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events.
Altogether the GSS is the single best source for sociological and attitudinal trend data covering the United States. It allows researchers to examine the structure and functioning of society in general as well as the role played by relevant subgroups and to compare the United States to other nations. (Source)
This dataset is a csv version of the Cumulative Data File, a cross-sectional sample of the GSS from 1972-current.
https://www.kaggle.com/datasets/norc/general-social-survey?select=gss.csv
# Import packages
import pandas as pd
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt
from statsmodels.graphics.mosaicplot import mosaic
# Read in csv as a DataFrame and preview it
df = pd.read_csv('gss_sub.csv')
df
Data Validation and Cleaning
df.info()
Above we see that our DataFrame contains float64
column (numerical data), as well as a number of object
columns. Object data types contain strings.
Inspecting individual columns
df['environment'].value_counts(normalize = True)
Manipulating categorical data
- The categorical variable type can be useful, especially here:
- Save on memory when there are only a few different values.
- You can specify a precise order to the categories when the default order may be incorrect (e.g., via alphabetical).
- Can be compatible with other Python libraries.
# Create a dictionary of column and data type mappings
conversion_dict = {k: 'category' for k in df.select_dtypes(include='object').columns}
conversion_dict
# Convert our DataFrame and check the data types
df = df.astype(conversion_dict)
df.info()
Already we can see that the memory usage of the DataFrame has been halved from 7 mb to 4 mb! This can help when working with large quantities of data, such as this survey that we'll be working with.
Cleaning up the labor_status
column
labor_status
columndf['labor_status'].cat.categories
collapse some of these categories. The easiest way to do this is to replace the values inside the column using a dictionary, and then reset the data type back to a category.
# Create a dictionary of categories to collapse
new_labor_status = {"UNEMPL, LAID OFF": "UNEMPLOYED",
"TEMP NOT WORKING": "UNEMPLOYED",
"WORKING FULLTIME": "EMPLOYED",
"WORKING PARTTIME": "EMPLOYED"
}
# Replace the values in the column and reset as a category
df['labor_status_clean'] = df['labor_status'].replace(new_labor_status).astype('category')
# Preview the new column
df['labor_status_clean'].value_counts()
Reordering categories