Skip to content
0

Romaine Calm! And check out the data...

This data can't define good nutrition.

Reasonable folks can disagree about what constitutes "good" nutrition. If we can agree that good nutrition is nutrition that encourages good health; then the question becomes: how do we define good health? Body Mass Index(BMI)? Cholesterol levels? Heart rate? Blood pressure? Presence or absence of various diseases (diabetes, cardiovascular disease...)? There are studies that suggest our nutritional needs are as unique as we are (see this Washington Post article (Invalid URL) that invokes the concept of bio-individuality). The definition of "good nutrition", if it is defined as a function of the positive health benefits it engenders, depends entirely on an individuals's personal health goals and/or needs; and so, the definition of "good nutrition" is necessarily a human-centric endeavor. Data can empower with information; but individuals must weigh that information in coming to their own conclusions based on their needs and priorities.

This data can illustrate nutritional give-and-take & highlight potentially surprising facts.

  • By far, the fruit with the highest Vitamic C concentration is the Acerola Cherry (Invalid URL). Other good sources of Vitamin C include various peper varieties (red peppers in particular), guavas, litchis, and currants. There are five categories where Vitamin C content is most dense: (1) fruits and fruit juices(highest), (2) vegetables and vegetable products, (3) American Indian/Alaska Native foods, (4) breakfast cereals, and (5) nut and seed products.
  • Higher water content in foods is associated to lower calories. This holds true even when we exclude fats and oils, which are very calorie dense and generally do not contain any water.
  • It appears that the drawbacks of a zero-carb diet is that when eating a zero-carb diet, foods are likely to be higher in calories, fat, and cholesterol, and lower in fiber and Vitamin C.
  • It appears that the drawbacks of a high protein diet is that when eating a high protein diet, foods are likely to be higher in calories and cholesterol, and lower in fiber, Vitamin C, carbohydrates, and water content.
  • This data confirms the information provided by the Clevland Clinic: gram of fat has around 9 kilocalories, and a gram of protein and a gram of carbohydrate contain 4 kilocalories each. Rounded to the nearest whole number, the correlation coefficients for protein, carbohydrates, and fat in this data set are 4,4, and 9 respectively, meaning that for every increase in a unit (gram) of protein or carbohydrate we expect a 4 unit increase in calories, and for every unit increase in fat, we expect a 9 unit increase in calories.
  • In analysing the errors of the linear analysis, we notice that alcohol with a correlation coefficient of 7 is positively correlated with calories: for every increase in a unit (grams) of alcohol, we expect a 7 unit increase in calories.
  • Just for fun, we plotted the average calories per food item per food category. The top six calorie dense food categories home to foods which, on average, clock in at more than 300 calories per 100g serving are (from highest): fats and oils, nut and seed products, snacks, baked products, sweets, and breakfast cereals.

About me.

I'm Darcee Caron.

I've been learning Python for about a month here on DataCamp. It's the first time I have ever coded and I'm really loving it! I've long been a self-described data enthusiast and I decided about 2 months ago to take the plunge to shift my career to focus on data analytics. Working on this competition submission has been fantastic to test out what I've learned this past month and get creative. DataCamp courses are great for learning the basics, but the coding within lessons is contained and directed. This competition submission has allowed me, for the first time ever (!), to think through a Python project all on my own. Thanks for checking it out. Please feel encouraged to share your comments and feedback with me; I am hungry to learn. Connect with me on LinkedIn here (Invalid URL). 😀

💪 Competition challenge

Create a report that covers the following:

  1. Which fruit has the highest vitamin C content? What are some other sources of vitamin C?
  2. Describe the relationship between the calories and water content of a food item.
  3. What are the possible drawbacks of a zero-carb diet? What could be the drawbacks of a very high-protein diet?
  4. According to the Cleveland Clinic website, a gram of fat has around 9 kilocalories, and a gram of protein and a gram of carbohydrate contain 4 kilocalories each. Fit a linear model to test whether these estimates agree with the data.
  5. Analyze the errors of your linear model to see what could be the hidden sources of calories in food.

💾 The data

You source nutrition data from USDA's FoodData Central website. This data contains the calorie content of 7,793 common foods, as well as their nutritional composition. Each row represents one food item, and nutritional values are based on a 100g serving. Here is a description of the columns:

  • FDC_ID: A unique identifier for each food item in the database.
  • Item: The name or description of the food product.
  • Category: The category or classification of the food item, such as "Baked Products" or "Vegetables and Vegetable Products".
  • Calories: The energy content of the food, presented in kilocalories (kcal).
  • Protein: The protein content of the food, measured in grams.
  • Carbohydrate: The carbohydrate content of the food, measured in grams.
  • Total fat: The total fat content of the food, measured in grams.
  • Cholesterol: The cholesterol content of the food, measured in milligrams.
  • Fiber: The dietary fiber content of the food, measured in grams.
  • Water: The water content of the food, measured in grams.
  • Alcohol: The alcohol content of the food (if any), measured in grams.
  • Vitamin C: The Vitamin C content of the food, measured in milligrams.
import pandas as pd
df_food = pd.read_csv('nutrition.csv')
df_food

Cleanliness is Next to Godliness

As with any new data set, I've first taken a moment to get to know the data, and give it a quick clean. I looked for missing values, duplicate values, erroneous data types, and outliers. After validating the data quality, I computed a variety of summary statistics and accompanying visualizations to get to know the data better.

import pandas as pd
df_food = pd.read_csv('nutrition.csv')

### Which columns have missing data?
df_food.isna().any()

### The Cholesterol, Fiber, Alcohol and Water Columns have missing values. 
import matplotlib.pyplot as plt
df_food = pd.read_csv('nutrition.csv')
### How much data is missing?
df_food.isna().sum()
df_food.isna().sum().plot(kind="bar")
nut_val_missing_val=df_food.loc[:,["Cholesterol","Fiber","Alcohol","Vitamin C"]]
nut_val_missing_val.isna().sum().plot(kind="bar")
plt.title("Missing Nutritional Value Information")
plt.ylabel("N° Foods Missing Value out of 7,793 Foods")
plt.xlabel("Nutritional Value")
### Only the nutritional values with missing data are plotted. Our primary case study questions do not concern cholesterol, fiber, and alcohol; so we will not worry about these missing values (although we will consider them when hunting for hiding calories). The foods that are missing a value for vitamin C will need to be excluded when we examine which foods are rich sources of vitamin C. 
### Checking for duplicate values
df_food["FDC_ID"].duplicated()
df_food["FDC_ID"].drop_duplicates()

### There are no duplicates it the data set. 
### Verifying that all data within the same column are of the same type.

df_food.dtypes
### They are all the same type, but all nutritional values are in grams/miligrams/kcal and include a "g"/"mg"/"kcal" in the entry; so instead of being floats, they are strings (objects). We need to fix this before moving on.

df_food_flts = df_food["Calories"].str.replace(' kcal', '').astype(float)
df_food_flts = pd.concat([df_food_flts, df_food[["Protein", "Carbohydrate", "Total fat", "Fiber", "Water", "Alcohol"]].apply(lambda x: x.str.replace(' g', '').astype(float))], axis=1)
df_food_flts = pd.concat([df_food_flts, df_food[["Cholesterol", "Vitamin C"]].apply(lambda x: x.str.replace(' mg', '').astype(float))], axis=1)
df_food_flts=pd.concat([df_food_flts, df_food[["FDC_ID", "Item","Category"]]],axis=1)
print(df_food_flts.head())
### Verifying that all data within the same column are of the same type.

df_food_flts.dtypes
### Get standard summary stats for entire dataframe.
df_food_flts.describe()

### Looks like there may be some outliers in the Calories, Fiber, Alcohol, Cholesterol, and Vitamin C columns because the mean and max values are very different; however, the min values for all nutritional categories is zeron which will pull down the means accross all columns.
print(df_food_flts["Cholesterol"].describe())