Skip to content

Searching for Superior Scotch

My mission for this project is to: A. Clean a dataset

B. Perform exploratory data analysis:

the distribution of review scores, the descriptive statistics of prices per review score, identify the scotch whiskys with the highest review scores

C. Answer a few questions:

1. How do the 3 scotch categories perform according to review scores? 2. Which category is produced and sold in the greatest quantity? 3. Which scotches provide the best overall value? 4. My friend was upset that a particular bottle wasn't in the best values list, so why not?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

scotch = pd.read_csv('scotch_reviews_2020.csv')

View the data

scotch.info()
scotch.head()

Clean the data

Rename weird columns, and check to see if currency is all in $. If so, I'll change the name of the price column and drop the currency column...

scotch['currency'].unique()
scotch['currency'].nunique()
scotch.rename(columns={'review.point': 'points', 'price': 'USD', 'description.1.2247.': 'review'}, inplace=True)
scotch.drop('currency', axis=1, inplace=True)

Clean up the price column. Need to remove the commas, and change the data type to integer.

scotch['USD'] = pd.to_numeric(scotch['USD'], errors='coerce')
scotch['USD'].isna().sum()
scotch.dropna(subset='USD', inplace=True)
scotch['USD'].isna().sum()
scotch['USD'] = scotch['USD'].astype(int)

Check for duplicate rows

scotch.duplicated(subset='name').sum()