Let's Make Some Drinks
But what?!
My task here is to create a program to receive ingredients as inputs and produce recipes than contain those ingredients
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
cocktails = pd.read_csv('cocktails.csv')Cleaning & EDA
I'll start by looking at the information about the dataframe. Looking for data types and number of missing values
cocktails.info()
cocktails.head()Looks like each column is an object data type. Probably just all text entries with no numeric data. Also, looks like there are lots of missing values. I won't need to impute any missing data since the data isn't numeric and I won't be conducting any statistical analysis. Now I'll try to create a bar graph in order to visualize the missing data. Maybe not totally necessary, but good practice at least.
missing = cocktails.isna().sum()
complete = cocktails.notna().sum()
completeMake a dataframe showing filled vs null values
null_v_complete = pd.concat([complete, missing], axis=1)
null_v_complete['Category'] = null_v_complete.index
null_v_complete.columns = ['Filled', 'Null', 'Category']
null_v_complete = null_v_complete[['Category', 'Filled', 'Null']].reset_index(drop=True)
null_v_complete.info()
null_v_completeWow, just making this 'simple' dataframe above was incredibly difficult for me oof And now I'll try to visualize the missing data in the original dataframe
Only two columns don't have any null values, the cocktail name and the cocktail ingredients. Thankfully, those are the most important columns. Is there any column that I really don't need at all?...
fig, ax = plt.subplots(figsize=(10, 6))
plt.style.use('seaborn-darkgrid')
bar1 = ax.bar(null_v_complete['Category'], null_v_complete['Filled'], label='Filled')
bar2 = ax.bar(null_v_complete['Category'], null_v_complete['Null'], bottom=null_v_complete['Filled'], label='Null')
ax.set_title('Filled vs Null Values by Category')
ax.set_xlabel('Category')
ax.set_ylabel('Count')
ax.set_xticklabels(null_v_complete['Category'], rotation=60)
ax.spines[['top', 'right']].set_visible(False)
ax.bar_label(bar1, label_type='center', color='white', weight='bold')
ax.bar_label(bar2, label_type='center', color='white', weight='bold', padding=3)
ax.axhline(400, color='b', linewidth=1, linestyle='dashed')
ax.legend(loc=2)
plt.show()So, I looked through the dataset and its info, and it looks like the 'Notes' column has a huge amount of null values. Also, when I looked at the entries that had values, the information wasn't really pertinent. Things like "this is an anniversary drink", or "credit for the photo to..", so my first task is to just drop that column altogether. Also, the 'Bar/Company' column has lots of null values also. Location is missing about half of its values.While the bar name and location would be cool, for this example I'm going to remove those columns since there are so many empty values. For the task of filtering the data to use as a recipe generator, I might as well just get rid of every column except for cocktail name, ingredients, garnish, glassware, and preparation.
cocktails.drop(['Bartender', 'Bar/Company', 'Location', 'Notes'], axis=1, inplace=True)
cocktails.info()
cocktails.head()Ok, now let's check for any duplicate data
for col in cocktails.columns:
col_vals = cocktails[col].nunique()
print(f'Unique values in {col} column = {col_vals}')