Skip to content
📝 Task List
Your written report should include both code, output and written text summaries of the following:
- Data Validation:
- Describe validation and cleaning steps for every column in the data
- Exploratory Analysis:
- Include two different graphics showing single variables only to demonstrate the characteristics of data
- Include at least one graphic showing two or more variables to represent the relationship between features
- Describe your findings
- Model Development
- Include your reasons for selecting the models you use as well as a statement of the problem type
- Code to fit the baseline and comparison models
- Model Evaluation
- Describe the performance of the two models based on an appropriate metric
- Business Metrics
- Define a way to compare your model performance to the business
- Describe how your models perform using this approach
- Final summary including recommendations that the business should undertake
Start writing report here..
Good Day.
Data Validation
First step is to validate the data provided in order to ensure that the analisys is not deviated by incorrect or missing data. The data set has 947 rows and 9 columns. Once revised all variables I found that there were missing values and also some incorrect classification so I had to make some changes in order to have a clean dataset before proceeding with further analisys, after cleaning there are 895 rows remaining. Following a detail of the analysis of each column in the dataset:
- recipe: identifier of each recipe, correct integer type, no missing values and checked that not repeated values were present. No cleaning is needed.
- calories: correct float type, but found 52 missing values, as long as the missing data was around 5% of the total, I decide to drop the missing values so they don't interfere with further analysis. No negative values found.
- carbohydrate: correct float type, but found 52 missing values, as long as the missing data was around 5% of the total, I decide to drop the missing values so they don't interfere with further analysis. No negative values found.
- sugar: correct float type, but found 52 missing values, as long as the missing data was around 5% of the total, I decide to drop the missing values so they don't interfere with further analysis. No negative values found.
- protein: correct float type, but found 52 missing values, as long as the missing data was around 5% of the total, I decide to drop the missing values so they don't interfere with further analysis. No negative values found.
- category: correct object type, but found 11 groups instead of 10, had to change 'Chicken Breast' to 'Chicken' to unify the categories as long as they can be considered the same. After changes it has 10 categories without missing values.
- servings: correct object type, but found 2 groups that were not consistent with quantity of servings, changed both to the equivalent quantity of servings '4 as a snack' to '4' and '6 as a snack' to '6'. After changes it has 4 categories without missing values.
- high_traffic: correct object type but only had 'High' value. Many rows had NAN values. I assumed them as 'Low' value, so changed all NAN to 'Low'. After changes it has 2 categories without missing values. Finally to point that missign values are in the same rows so erasing them in each column maintains the 5% of the total data.
Data Scientist Professional Practical Exam Submission - Daniel Bustos
Use this template to write up your summary for submission. Code in Python or R needs to be included.
# Start coding here...
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.style as style
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import PowerTransformer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import BaggingClassifier
#Initial check of the data set
df=pd.read_csv('INFO_LTE.csv', header=0, sep="|")
df.info()
df.describe()
df['PERIODO'].unique()
# Count the numbers of samples per "CATEGORIA"
df['CATEGORIA'].value_counts()
# Filter out categories with less than 100 samples
category_counts = df['CATEGORIA'].value_counts()
categories_to_keep = category_counts[category_counts >= 100].index
df_filtered = df[df['CATEGORIA'].isin(categories_to_keep)]
df_filtered['CATEGORIA'].value_counts()
# Remove the text "POSTPAGO VIVA WIFI" from "CATEGORIA" labels by replacing it with an empty string
df_filtered['CATEGORIA'] = df_filtered['CATEGORIA'].str.replace("POSTPAGO VIVA WIFI", "")
df_filtered['CATEGORIA'] = df_filtered['CATEGORIA'].str.replace("SERVICIO VIVA WIFI", "")
df_filtered['CATEGORIA'] = df_filtered['CATEGORIA'].str.replace("PLAN ILIMITADO", "")
df_filtered['CATEGORIA'] = df_filtered['CATEGORIA'].str.replace("VIVA WIFI", "")
df_filtered['CATEGORIA'] = df_filtered['CATEGORIA'].str.replace("PREPAGO ", "")
df_filtered['CATEGORIA'] = df_filtered['CATEGORIA'].str.replace("30 MIN", "XX")
df_filtered['CATEGORIA'] = df_filtered['CATEGORIA'].str.replace("+", "XX")
df_filtered['CATEGORIA'] = df_filtered['CATEGORIA'].str.replace("XX XX", "")
df_filtered['CATEGORIA'] = df_filtered['CATEGORIA'].str.replace("POSTPAGO ", "POSTPAGO")
df_filtered['CATEGORIA'].value_counts()
# Plotting the filtered dataframe to show GB Consumed by Category with outliers
plot_order=[' 1 MBPS',' 2 MBPS', ' 4 MBPS',' 6 MBPS',' 10 MBPS.',' 12 MBPS',' 15 MBPS',' 16 MBPS',' 17 MBPS',' 22 MBPS',' 23 MBPS',' POSTPAGO']
plt.figure(figsize=(12, 10))
sns.boxplot(x="CATEGORIA", y="GB_CONSUMIDOS", data=df_filtered, orient='v', order=plot_order, showfliers=True, showmeans=True)
plt.axhline(1000)
plt.xticks(rotation=45)
plt.grid(axis='y')
plt.title('GB Consumed by Category - Filtered Data')
plt.show()
# Plotting the filtered dataframe to show GB Consumed by Category with NO outliers
plot_order=[' 1 MBPS',' 2 MBPS', ' 4 MBPS',' 6 MBPS',' 10 MBPS.',' 12 MBPS',' 15 MBPS',' 16 MBPS',' 17 MBPS',' 22 MBPS',' 23 MBPS',' POSTPAGO']
plt.figure(figsize=(12, 10))
sns.boxplot(x="CATEGORIA", y="GB_CONSUMIDOS", data=df_filtered, orient='v', order=plot_order, showfliers=False, showmeans=True)
plt.axhline(1000)
plt.xticks(rotation=45)
plt.grid(axis='y')
plt.title('GB Consumed by Category - Filtered Data - No outliers')
plt.show()
# Plotting the filtered dataframe to show GB Consumed by Category with outliers
plot_order=[' 1 MBPS',' 2 MBPS', ' 4 MBPS',' 6 MBPS',' 10 MBPS.',' 12 MBPS',' 15 MBPS',' 16 MBPS',' 17 MBPS',' 22 MBPS',' 23 MBPS',' POSTPAGO']
plt.figure(figsize=(12, 10))
sns.stripplot(y="GB_CONSUMIDOS", data=df_filtered, x='CATEGORIA', order=plot_order)
plt.axhline(1000)
plt.xticks(rotation=45)
plt.grid(axis='y')
plt.title('GB Consumed by Category - Filtered Data')
plt.show()
# create df only for February 2024
# df_filtered_feb = df_filtered[df_filtered['PERIODO'] == '202402']
# df_filtered_feb.info()
df_filtered.info()
df_filtered_new.describe()
df_filtered_feb['PERIODO'].unique()