Skip to content
Starbucks & Multi-linear Regression (Python)
  • AI Chat
  • Code
  • Report
  • In this brief exploration, we will explore a dataset containing information about drinks from Starbucks. Our goal is to construct a fundamental multi-linear model that estimates calorie counts based on factors such as fat, carbohydrates, protein, and other nutrients.

    #These are the libraries we'll need
    import pandas as pd 
    import statsmodels.formula.api as smf
    import seaborn as sns
    import matplotlib.pyplot as plt

    Let's start by reading the csv file into a dataframe called "drinks".

    drinks = pd.read_csv("starbucks-menu-nutrition-drinks.csv")
    drinks

    Next, we'll eliminate rows containing null values, set the initial column as the index, and rename the remaining columns for easier referencing.

    drinks = pd.read_csv("starbucks-menu-nutrition-drinks.csv",index_col = 0, na_values=["-"])
    drinks = drinks.dropna(axis = 0)
    drinks.columns = ["calories", "fat", "carbs", "fiber", "protein", "sodium"]

    We can see that the dataframe is much easier to work with now.

    drinks

    Let's reorder the drinks by the highest number of calories. We can see that, at least in this dataset, Starbucks Signature Hot Chocolate contains the most amount of calories.

    drinks.sort_values(by="calories", ascending=False)

    Let's explore the relationship between calories and carbohydrates. From the scatter plot and regression line, we can see that there is a strong postive relationship.

    sns.regplot(x="carbs", y="calories", data = drinks)
    plt.title("Correlation Between Calories and Carbs")

    Here we are going to construct a multi-linear model in which the calorie count serves as the dependent variable, while fat, carbs, fiber, protein, and sodium act as independent variables. Let's then display a summary of the model. The R-squared value of 0.997 indicates a good fit, implying that our linear regression model effectively captures the dataset's patterns.

    lm_drinks = smf.ols(formula = 'calories ~ fat + carbs + fiber + protein + sodium', data = drinks).fit()
    lm_drinks.summary()