Data Scientist Associate Practical Exam Submission
Use this template to complete your analysis and write up your summary for submission.
Task 1
The Electric Mopeds dataset contains 8 columns and 1,500 rows. Upon initial analysis, there are 150 missing values in the web browser column.
- Owned: Same as the description taking the form of a dummy variable with two values, 0 or 1. No missing values.
- Make Model: Same as the description, no missing values and has 6 unique values, as stated in the description
- Review Month: Values don't match the description, there are various structures ranging from short format to short format with the day. There are 12 different month names, but 332 unique values. To clean, I extracted the last 3 characters of the string after the "-". Confirmed that there are now 12 months in short format and NO missing values
- Web Browser: Same as description with 150 missing values, 6 different web browsers. Almost half of the observations belong to Chrome. The missing values in this column were replaced with "unknown"
- Reviewer Age: This column was a string datatype and needed to be an integer to satisfy the discrete variable type. Although there were no NULLS, there were 105 values with the observation, '-', which I replaced with the average age which was just over 30. I also validated that the minimum age was indeed 16. 36 different unique ages.
- Primary Use: Same as the description with no missing values, 2 primary uses.
- Value for Money: Different from the description as the data type is a string but needs to be a discrete variable. I cleaned this column by stripping the elements before and after the "/" and assigning the first element to the value_for_money column. There are no null values and 10 unique ratings.
- Overall Rating : Same as description without missing values. The datatype is float which satisfies the description of being a continous variable
After completing the data validation process, the dataset contains 1500 rows, 8 columns, and no null values.
#EDA and Validation of Data
#Checking each individual variable
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.style as style
import seaborn as sns
import numpy as np
Loading the Data & Understanding the Variables Initially
#loading the data
df = pd.read_csv("electric_bike_ratings_2212.csv")
df.info()
#Understanding variables
variables_df = pd.DataFrame(columns = ['Variable','# of unique values','Values'])
for i, var in enumerate(df.columns):
print(i,var)
variables_df.loc[i] = [var,df[var].nunique(),df[var].unique().tolist()]
print(variables_df)
print(variables_df[['Variable','# of unique values']])
#view first 10 rows of data
display(df.head(10))
Checking missing values in each of the 8 columns
df.isna().sum()Analyze the Categorical Variables
col_list = ['make_model','review_month','web_browser','primary_use']
for item in col_list:
print(df[item].value_counts())Cleaning review_month and web_browser columns
#Extract the last 3 characters of the review_month column and check value_counts to ensure only 12 distinct values exist
df['review_month'] = (df.review_month.str[-3:])
print(df['review_month'].value_counts())
#Checking nulls exclusively for the web_browser column
print("There are ",df.web_browser.isna().sum()," null values in the web_browser column")
df['web_browser'] = df['web_browser'].fillna("unknown")
#Validate that null values in the web_browser were filled with "unknown"
df.info()
df.web_browser.value_counts()Convert Reviewer_Age & Overall_Rating to Int Type & Analyze All Numerical Values
#First strip the string to the left of the '/' and convert to a discrete variable
df['value_for_money'] = (df.value_for_money.str.split('/').str[0]).astype(int)
#Analyze reviewer_age object before conversion to discrete variable, noticed '-' values
print(df['reviewer_age'].value_counts())
#Replace the '-' with a 0 before additional conversion and shaping
df['reviewer_age'] = df['reviewer_age'].replace('-',0)
#Convert column to integer type and replace values of zero with average age
df['reviewer_age'] = (df['reviewer_age'].astype(int))
average_age = df['reviewer_age'].mean()
df['reviewer_age'] = df['reviewer_age'].replace(0,average_age)
#Describe numerical variables & confirm that the reviewer_age column replacement worked with the average
print(df.describe())
print(df['reviewer_age'].value_counts())
print(df.reviewer_age.min())
df.info()Task 2
From the first graph, The Count of Owned, the category of 1, which means "owns the moped", has the most observations with 890 of them. 610 observations are from those who do not own the moped. According to the below visual, it does not appear that the observations are perfectly balanced across the categories of the variable owned.
Analyzing the Owned and Review Columns