Data Scientist: Recipe Site Trafic Prediction
Hey!
I have an analysis task for you from the sales team. You can see the background and request in the email below. They were quite skeptical about how we could help them, so this is a great opportunity to start to improve processes and make the team more efficient. I would like you to perform the analysis and write a short report for me. I want to be able to review your code as well as read your thought process for each step. I also want you to prepare and deliver the presentation for the sales team - you are ready for the challenge!
The goal of the task is to predict the popularity of recipes that will be displayed on the organization's website to improve user experience.
Develop a Model that predicts popular recipes with 80% accuracy while minimizing display of unpopular recipes.
Find attached the data and more details about what I expect you to do.
IMPORT LIBRARIES
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report
DATA VALIDATION
# Load the CSV file into a pandas DataFrame
df = pd.read_csv('recipe_site_traffic_2212.csv')
# Display the first few rows of the DataFrame
display(df.head())
# Display summary information about the DataFrame
display(df.info())
# Display descriptive statistics of the DataFrame
display(df.describe())
# Check for duplicated rows
duplicates = df.duplicated()
# Count the total number of duplicates
total_duplicates = duplicates.sum()
# Print the total number of duplicate rows
print(f"Total duplicates: {total_duplicates}")
Data Summary and Duplicates Check
Display Summary and Statistics
- Summary information about the dataset is displayed`.
- Descriptive statistics are shown`.
Check for Duplicates
- We check for the sum of duplicated rows in the dataframe, and zero was return indicating no duplicate values in the DataFrame.
# Print unique values of the 'servings' column
print(df['servings'].unique())
# Print unique values of the 'category' column
print('category unique values: ', df['category'].unique())
# Print unique values of the 'high_traffic' column
print('high_traffic unique values: ', df['high_traffic'].unique())
Print Unique Values:
- The code prints the unique values of the 'servings', 'category', and 'high_traffic' columns.
- This helps us understand the distinct values present in these columns,
- And we discovered a sting in the serving column ('4 as a snack' '6 as a snack'), that needs to be replaced
- We also discovered (Chicken Breast) which according to the Dataframe description, it should not be there.
# Replace 'Chicken Breast' with 'Chicken' in the 'category' column
df['category'] = df['category'].replace('Chicken Breast', 'Chicken')
print('category unique values: ', df['category'].unique())
# Convert 'servings' column to strings and clean values
df['servings'] = df['servings'].astype(str)
df['servings'] = df['servings'].str.replace('as a snack', '')
df['servings'] = df['servings'].str.strip()
df['servings'] = df['servings'].astype('int64')
# Print unique values of the 'servings' column and its data type
print(df['servings'].unique())
df['servings'].dtype
Explanation:
-
Replace 'Chicken Breast' in 'category' Column:
- The code replaces occurrences of 'Chicken Breast' with 'Chicken' in the 'category' column.
- This adjustment maintains consistency and aligns with the data description.
- Unique values of the modified 'category' column are displayed.
-
Convert and Clean 'servings' Column:
- The 'servings' column is converted to strings for uniform handling.
- Values containing 'as a snack' are removed using
str.replace()
and leading/trailing spaces are stripped. - The column is converted back to integers for accurate numerical analysis.
-
Print Cleaned 'servings' Values and Data Type:
- Unique values of the cleaned 'servings' column are printed.
- The data type of the 'servings' column is displayed to confirm the and we can see that the changes have been implemented.
# Convert 'High' to 1 and NaN to 0
df['high_traffic'] = df['high_traffic'].apply(lambda x: 1 if x == 'High' else 0)
# Convert 'category' and 'high_traffic' columns to categorical data type
df[['category', 'high_traffic']] = df[['category', 'high_traffic']].astype('category')
# Display 'high_traffic' column
display(df['high_traffic'])
# Display summary information about the DataFrame
df.info()
Explanation:
-
Convert 'High' to 1 and NaN to 0:
- The code uses the
apply()
function to convert 'High' values to 1 and NaN values to 0 in the 'high_traffic column, prepairing the column for modelling.
- The code uses the
-
Convert Columns to Categorical Data Type:
- The 'category' and 'high_traffic' columns are converted to the categorical data type using
astype()
.
- The 'category' and 'high_traffic' columns are converted to the categorical data type using
-
Display 'high_traffic' Column:
- The 'high_traffic' column is displayed to observe the conversion changes.
-
Display Summary Information:
- Summary information about the DataFrame is displayed using
info()
to show data types and non-null counts of each column.
- Summary information about the DataFrame is displayed using
# Check for columns with empty strings
columns_with_empty_strings = df.columns[df.applymap(lambda x: x == '').any()]
# Print the columns with empty strings
print('columns with empty strings are:\n', columns_with_empty_strings)
print('\n')
# Select numeric columns
numeric_cols = df.select_dtypes(include=[np.number])
# Check if numeric columns contain negative values
contains_negative = (numeric_cols < 0).any().any()
if contains_negative:
print("Numeric columns contain negative values.")
else:
print("Numeric columns do not contain negative values.")
print('\n')
# Print sum of null values per column
print("Sum of Null Values per Column:\n", df.isna().sum())