Data Scientist Certification - Case Study

Data Scientist Professional Case Study

Company Background

Snow Cone Games are a digital retailer for video games. The website includes a simple filter for the genre of the game as well as more advanced filtering. The product manager has recently learned that only 23% of buyers search for games using the genre filter.

71% of buyers use advanced filters to find the types of games that appeal to them. The product manager hopes that new categories based on different game attributes can help players find what they are looking for faster.

Customer Question

The product manager would like to know:

Can you use the data from 40,000 games to identify groupings of games that are most similar to each other?

Success Criteria

Ideally, there would be no more than five groupings. Any more would make it much harder to display on the page so there would need to be a good reason to have more.

Dataset

The data you will use for this analysis can be accessed here: "data/video_games.csv"

1. Imports

# Base modules
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from datetime import datetime

# Pre-processing
from sklearn.preprocessing import OrdinalEncoder, StandardScaler

# ML
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

2. Setup

# Seed
seed = 42
np.random.seed(seed)

# Plots
plt.rcParams['figure.figsize'] = (10, 4)
plt.rcParams['figure.dpi'] = 150
plt.rcParams['figure.autolayout'] = True
plot_title_fontsize = 14
plot_labels_fontsize = 12
sns.set()

# Custom function
def statistical_overview(series:pd.Series, of:float=1.5, evf:float=3.0) -> None:
    """Displays a statistical description for a numeric Pandas Series.
    
    Args:
        series: A Pandas Series object
        of: Outlier Factor (inner fence)
        evf: Extreme Value Factor (outer fence)
    """
    
    total_values = len(series)
    q1 = np.round(series.quantile(0.25), 2)
    q3 = np.round(series.quantile(0.75), 2)
    iqr = q3 - q1
    lower_outer_fence = np.round(q1 - evf * iqr, 2)
    lower_inner_fence = np.round(q1 - of * iqr, 2)
    upper_inner_fence = np.round(q3 + of * iqr, 2)
    upper_outer_fence = np.round(q3 + evf * iqr, 2)
    
    outliers_count = series[((lower_outer_fence < series) & (series <= lower_inner_fence)) | \
                            ((upper_inner_fence < series ) & (series <= upper_outer_fence))].count()
    non_outliers_count = total_values - outliers_count
    
    extreme_values_count = series[(series < lower_outer_fence) | \
                                  (series > upper_outer_fence)].count()
    non_extreme_values_count = total_values - extreme_values_count
        
    print(f"Min:    {np.round(np.min(series), 2)}")
    print(f"Max:    {np.round(np.max(series), 2)}")
    print(f"Mean:   {np.round(np.nanmean(series), 2)}")
    print(f"Median: {np.round(np.nanmedian(series), 2)}")
    print(f"Q1:     {q1}")
    print(f"Q3:     {q3}")
    print(f"IQR:    {iqr}\n")

    print(f"Lower outer fence: {lower_outer_fence}")
    print(f"Lower inner fence: {lower_inner_fence}")
    print(f"Upper inner fence: {upper_inner_fence}")
    print(f"Upper outer fence: {upper_outer_fence}\n")

    print(f"Outliers:           {outliers_count}")
    print(f"Non-outliers:       {non_outliers_count}")
    print(f"Extreme values:     {extreme_values_count}")
    print(f"Non-extreme values: {non_extreme_values_count}\n")
    
    print(f"Unbiased skew: {np.round(series.skew())}\n")

    return None

3. Ingest data, check missing values

3.1. Import CSV file

# Read in data
df = pd.read_csv('data/video_games.csv')

# Sample and show 10 random rows from df
display(df.sample(10))

3.2. Inspect DataFrame

# General info
print(df.info(), '\n')

# Descriptive statistics
df.describe(include='all')

3.3. Inspect missing values

# Function to create custom palette (https://stackoverflow.com/a/60917129/1494932)
def colors_from_values(values, palette_name):
    normalized = (values - min(values)) / (max(values) - min(values))
    indices = np.round(normalized * (len(values) - 1)).astype(np.int32)
    palette = sns.color_palette(palette_name, len(values))
    return np.array(palette).take(indices, axis=0)

# Missing values by column
missing_val = df.isna().sum()
print(f"Missing values by column:\n\n{missing_val}")

# Bar plot to visualize missing values
sns.barplot(x=missing_val.values, y=missing_val.index, orient='horizontal',
            palette=colors_from_values(missing_val, 'Blues_d'))
plt.title('Missing values by column', fontsize=plot_title_fontsize)
plt.xlabel('Number of missing values', fontsize=plot_labels_fontsize)
plt.ylabel('Columns', fontsize=plot_labels_fontsize)
plt.show()

3.4. Remove all rows with missing values

# Drop all rows with NaN values
df_all_nan_dropped = df.dropna()
print('Number of rows remaining after dropping all NaN values indiscriminately: ' \
      f"{df_all_nan_dropped.shape[0]}")

Following this approach, only 8840 rows would remain, out of 40833.

This motivates a different approach, based on a deeper analysis, with the aim of salvaging and leveraging as much data as possible, before feeding it, or a subset, to a K-Means algorithm for clustering.

‌
‌
‌