Skip to content

A common problem when creating models to generate business value from data is that the datasets can be so large that it can take days for the model to generate predictions. Ensuring that your dataset is stored as efficiently as possible is crucial for allowing these models to run on a more reasonable timescale without having to reduce the size of the dataset.

You've been hired by a major online data science training provider called Training Data Ltd. to clean up one of their largest customer datasets. This dataset will eventually be used to predict whether their students are looking for a new job or not, information that they will then use to direct them to prospective recruiters.

You've been given access to customer_train.csv, which is a subset of their entire customer dataset, so you can create a proof-of-concept of a much more efficient storage solution. The dataset contains anonymized student information, and whether they were looking for a new job or not during training:

ColumnDescription
student_idA unique ID for each student.
cityA code for the city the student lives in.
city_development_indexA scaled development index for the city.
genderThe student's gender.
relevant_experienceAn indicator of the student's work relevant experience.
enrolled_universityThe type of university course enrolled in (if any).
education_levelThe student's education level.
major_disciplineThe educational discipline of the student.
experienceThe student's total work experience (in years).
company_sizeThe number of employees at the student's current employer.
last_new_jobThe number of years between the student's current and previous jobs.
training_hoursThe number of hours of training completed.
job_changeAn indicator of whether the student is looking for a new job (1) or not (0).
# Start your code here!
import pandas as pd
from pandas.api.types import CategoricalDtype

import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
# Loading the data.
ds_jobs = pd.read_csv(filepath_or_buffer="customer_train.csv", encoding="UTF-8")
customers_raw.head()
ds_jobs.tail()
ds_jobs.sample(30)
print(ds_jobs.shape)
print(ds_jobs.size)
ds_jobs.info()
# Columns containing integers must be stored as 32-bit integers (int32).
# Solution

# Looking at the DataFrame we find that student_id, last_new_job, 
# training_hours, job_change contain Integer numerals. 
# A closer examination with .info() method lists only the following:
# student_id, training_hours, job_change columns as int64.
# the last_new_job column is listed as object.
# On further examination with the .unique() and .nunique() on the column
# last_new_job we find that it has non integer values like 
# '>4' , 'never', NaN nevertheless these can be handled to be 
# converted to some suitable integer but for the time being we ignore.

# Finding the number of unique values of last_new_job column
print("Number of unique values contined in the last_new_job column:", ds_jobs.last_new_job.nunique())


# What are the unique values in the last_new_job column?
print("List of unique values contined in the last_new_job column:", ds_jobs.last_new_job.unique())
# So we will change only the student_id, training_hours, job_change columns from int64 to int32

ds_jobs['student_id'] = ds_jobs['student_id'].astype('int32')

ds_jobs['training_hours'] = ds_jobs['training_hours'].astype('int32')

ds_jobs['job_change'] = ds_jobs['job_change'].astype('int32')

ds_jobs.info()

# After calling .info() method on the DataFrame we find the columns dtype has changed from 
# int64 to int32 and also memory usage has reducded from 2.0+ MB to 1.8+ MB 
# a reduction of ~ 10% memory resource usage.
# Columns containing floats must be stored as 16-bit floats (float16).
# Column to change: city_development_index  19158 non-null  float64

ds_jobs['city_development_index'] = ds_jobs['city_development_index'].astype('float16')

ds_jobs.info()

# After calling .info() method on the DataFrame we find the columns dtype has changed from 
# float64 to float16 and also memory usage has reducded from 1.8+ MB to 1.7+ MB 
# a reduction of ~ 6% memory resource usage.
# Columns containing nominal categorical data must be stored as the category data type

# Which columns contain categorical data?
# Solution: We will iterate through the list of all columns and see the unique 
# counts of values contained in each of them to identify the columns which 
# can be good candidates for transformation to categorical dtype.

columns = ds_jobs.columns
print('Column Name\t\t\t\t Number of Unique Values')
possible_categorical = []
for col in columns:
    n_val = ds_jobs[col].nunique()
    print(col, '\t\t\t\t', n_val)
    if n_val<=10:
        possible_categorical.append(col)

print('Possible Categorical Columns:')
for p_cat in possible_categorical:
    print(p_cat)
# What are the unique values of the entires in possible categorical columns?
for c4c in possible_categorical:
    print('Unique Values for {}: {}'.format(c4c, ds_jobs[c4c].unique()))
# `gender`, `relevant_experience`, `enrolled_university`, education_level, 
# `major_discipline`, company_size, `company_type`, last_new_job, job_change
# can be good candidates for categorical dtype, 
# We already changed the type of job_change from int64 to int32 
# so we chose to leave it for the time being we will consult an SME or consider
# design constraints like the advantage in memory size if we further change int32 
# to categorical.

# customers_raw['job_change'] = customers_raw['job_change'].astype("category") 
# Commenting above step as it gives us not memory usage advantage.

ds_jobs.info()
# We will focus on these following columns instead:
# `gender`, `relevant_experience`, `enrolled_university`, education_level, 
# `major_discipline`, company_size, `company_type`, last_new_job.

ds_jobs['city'] = ds_jobs['city'].astype('category')
ds_jobs['gender'] = ds_jobs['gender'].astype('category')
ds_jobs['relevant_experience'] = ds_jobs['relevant_experience'].astype('category')
ds_jobs['enrolled_university'] = ds_jobs['enrolled_university'].astype('category')
ds_jobs['education_level'] = ds_jobs['education_level'].astype('category')
ds_jobs['major_discipline'] = ds_jobs['major_discipline'].astype('category')
ds_jobs['company_size'] = ds_jobs['company_size'].astype('category')
ds_jobs['company_type'] = ds_jobs['company_type'].astype('category')
ds_jobs['last_new_job'] = ds_jobs['last_new_job'].astype('category')

ds_jobs.info()

# After Changes memory usage is drastically reduced from 1.7+ MB to 712.7+KB 
# or 0.7+ MB a reduction of ~ 59% which is significant.