In this brief exploration, we will explore a dataset containing information about drinks from Starbucks. Our goal is to construct a fundamental multi-linear model that estimates calorie counts based on factors such as fat, carbohydrates, protein, and other nutrients.
#These are the libraries we'll need
import pandas as pd
import statsmodels.formula.api as smf
import seaborn as sns
import matplotlib.pyplot as plt
Let's start by reading the csv file into a dataframe called "drinks".
drinks = pd.read_csv("starbucks-menu-nutrition-drinks.csv")
drinks
Next, we'll eliminate rows containing null values, set the initial column as the index, and rename the remaining columns for easier referencing.
drinks = pd.read_csv("starbucks-menu-nutrition-drinks.csv",index_col = 0, na_values=["-"])
drinks = drinks.dropna(axis = 0)
drinks.columns = ["calories", "fat", "carbs", "fiber", "protein", "sodium"]
We can see that the dataframe is much easier to work with now.
drinks
Let's reorder the drinks by the highest number of calories. We can see that, at least in this dataset, Starbucks Signature Hot Chocolate contains the most amount of calories.
drinks.sort_values(by="calories", ascending=False)
Let's explore the relationship between calories and carbohydrates. From the scatter plot and regression line, we can see that there is a strong postive relationship.
sns.regplot(x="carbs", y="calories", data = drinks)
plt.title("Correlation Between Calories and Carbs")
Here we are going to construct a multi-linear model in which the calorie count serves as the dependent variable, while fat, carbs, fiber, protein, and sodium act as independent variables. Let's then display a summary of the model. The R-squared value of 0.997 indicates a good fit, implying that our linear regression model effectively captures the dataset's patterns.
lm_drinks = smf.ols(formula = 'calories ~ fat + carbs + fiber + protein + sodium', data = drinks).fit()
lm_drinks.summary()