Skip to content
Live Training - Working with Categorical Data in Python (Webinar)
  • AI Chat
  • Code
  • Report
  • The General Social Survey (GSS)

    About Dataset ​​The GSS gathers data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes. Hundreds of trends have been tracked since 1972. In addition, since the GSS adopted questions from earlier surveys, trends can be followed for up to 70 years.

    The GSS contains a standard core of demographic, behavioral, and attitudinal questions, plus topics of special interest. Among the topics covered are civil liberties, crime and violence, intergroup tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events.

    Altogether the GSS is the single best source for sociological and attitudinal trend data covering the United States. It allows researchers to examine the structure and functioning of society in general as well as the role played by relevant subgroups and to compare the United States to other nations. (Source)

    This dataset is a csv version of the Cumulative Data File, a cross-sectional sample of the GSS from 1972-current.

    https://www.kaggle.com/datasets/norc/general-social-survey?select=gss.csv

    # Import packages
    import pandas as pd
    import numpy as np
    import plotly.express as px
    import matplotlib.pyplot as plt
    from statsmodels.graphics.mosaicplot import mosaic
    
    # Read in csv as a DataFrame and preview it
    df = pd.read_csv('gss_sub.csv')
    df

    Data Validation and Cleaning

    df.info()

    Above we see that our DataFrame contains float64 column (numerical data), as well as a number of object columns. Object data types contain strings.

    Inspecting individual columns

    df['environment'].value_counts(normalize = True)

    Manipulating categorical data

    • The categorical variable type can be useful, especially here:
      • Save on memory when there are only a few different values.
      • You can specify a precise order to the categories when the default order may be incorrect (e.g., via alphabetical).
      • Can be compatible with other Python libraries.
    # Create a dictionary of column and data type mappings
    conversion_dict = {k: 'category' for k in df.select_dtypes(include='object').columns}
    conversion_dict
    
    # Convert our DataFrame and check the data types
    df = df.astype(conversion_dict)
    df.info()

    Already we can see that the memory usage of the DataFrame has been halved from 7 mb to 4 mb! This can help when working with large quantities of data, such as this survey that we'll be working with.

    Cleaning up the labor_status column

    df['labor_status'].cat.categories

    collapse some of these categories. The easiest way to do this is to replace the values inside the column using a dictionary, and then reset the data type back to a category.

    # Create a dictionary of categories to collapse
    new_labor_status = {"UNEMPL, LAID OFF": "UNEMPLOYED", 
                        "TEMP NOT WORKING": "UNEMPLOYED",
                        "WORKING FULLTIME": "EMPLOYED",
                        "WORKING PARTTIME": "EMPLOYED"
                       }
    
    # Replace the values in the column and reset as a category
    df['labor_status_clean'] = df['labor_status'].replace(new_labor_status).astype('category')
    
    # Preview the new column
    df['labor_status_clean'].value_counts()
    

    Reordering categories