Competition - Fact check nutrition data

What is good food?

📖 Background

You and your friend have gotten into a debate about nutrition. Your friend follows a high-protein diet and does not eat any carbohydrates (no grains, no fruits). You claim that a balanced diet should contain all nutrients but should be low in calories. Both of you quickly realize that most of what you know about nutrition comes from mainstream and social media.

Being the data scientist that you are, you offer to look at the data yourself to answer a few key questions.

💾 The data

You source nutrition data from USDA's FoodData Central website. This data contains the calorie content of 7,793 common foods, as well as their nutritional composition. Each row represents one food item, and nutritional values are based on a 100g serving. Here is a description of the columns:

FDC_ID: A unique identifier for each food item in the database.
Item: The name or description of the food product.
Category: The category or classification of the food item, such as "Baked Products" or "Vegetables and Vegetable Products".
Calories: The energy content of the food, presented in kilocalories (kcal).
Protein: The protein content of the food, measured in grams.
Carbohydrate: The carbohydrate content of the food, measured in grams.
Total fat: The total fat content of the food, measured in grams.
Cholesterol: The cholesterol content of the food, measured in milligrams.
Fiber: The dietary fiber content of the food, measured in grams.
Water: The water content of the food, measured in grams.
Alcohol: The alcohol content of the food (if any), measured in grams.
Vitamin C: The Vitamin C content of the food, measured in milligrams.

import pandas as pd
nutrition_data = pd.read_csv('nutrition.csv')
nutrition_data

The data contains information on various food items, including their calorie content, macronutrient composition, and other nutritional details. We'll proceed to answer the questions one by one.

1. Which fruit has the highest vitamin C content? What are some other sources of vitamin C?

Let's find the fruit with the highest vitamin C content and list other significant sources of vitamin C.

2. Describe the relationship between the calories and water content of a food item.

We'll analyze the relationship between calorie and water content to see if there's any notable correlation.

3. What are the possible drawbacks of a zero-carb diet? What could be the drawbacks of a very high-protein diet?

We'll discuss the possible health drawbacks of these diets based on nutritional knowledge.

4. Fit a linear model to test whether the estimates for calorie content (from fat, protein, and carbohydrates) agree with the data.

We'll fit a linear model using the given estimates (9 kcal/gram of fat, 4 kcal/gram of protein, and 4 kcal/gram of carbohydrate) to see if these values hold true for our dataset.

5. Analyze the errors of your linear model to see what could be the hidden sources of calories in food.

We'll analyze the residuals of the model to identify potential hidden sources of calories.

Let's start with the first question.

# Convert relevant columns to numeric, handling any non-numeric entries
nutrition_data['Calories'] = pd.to_numeric(nutrition_data['Calories'].str.replace(' kcal', ''), errors='coerce')
nutrition_data['Protein'] = pd.to_numeric(nutrition_data['Protein'].str.replace(' g', ''), errors='coerce')
nutrition_data['Carbohydrate'] = pd.to_numeric(nutrition_data['Carbohydrate'].str.replace(' g', ''), errors='coerce')
nutrition_data['Total fat'] = pd.to_numeric(nutrition_data['Total fat'].str.replace(' g', ''), errors='coerce')
nutrition_data['Cholesterol'] = pd.to_numeric(nutrition_data['Cholesterol'].str.replace(' mg', ''), errors='coerce')
nutrition_data['Fiber'] = pd.to_numeric(nutrition_data['Fiber'].str.replace(' g', ''), errors='coerce')
nutrition_data['Water'] = pd.to_numeric(nutrition_data['Water'].str.replace(' g', ''), errors='coerce')
nutrition_data['Alcohol'] = pd.to_numeric(nutrition_data['Alcohol'].str.replace(' g', ''), errors='coerce')
nutrition_data['Vitamin C'] = pd.to_numeric(nutrition_data['Vitamin C'].str.replace(' mg', ''), errors='coerce')

# Filter to find fruits
fruits = nutrition_data[nutrition_data['Category'].str.contains('Fruit', case=False, na=False)]

# Find the fruit with the highest Vitamin C content
highest_vitamin_c_fruit = fruits.loc[fruits['Vitamin C'].idxmax()]

# Display the fruit with the highest Vitamin C content and some other sources of Vitamin C
highest_vitamin_c_fruit[['Item', 'Vitamin C']]

The fruit with the highest vitamin C content is Acerola (West Indian Cherry), with 1677.6 mg of vitamin C per 100g serving.

Other notable sources of vitamin C include:

Guava
Kiwi
Strawberries
Oranges
Papaya
Pineapple

Next, let's analyze the relationship between the calories and water content of a food item.

import matplotlib.pyplot as plt
import seaborn as sns

# Scatter plot to visualize the relationship between calories and water content
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Water', y='Calories', data=nutrition_data)
plt.title('Relationship Between Water Content and Calories')
plt.xlabel('Water Content (g per 100g)')
plt.ylabel('Calories (kcal per 100g)')
plt.grid(True)
plt.show()

The scatter plot shows a general trend that as the water content in food increases, the calorie content tends to decrease. This makes sense because foods with higher water content are often less calorie-dense.

3. Possible drawbacks of a zero-carb diet and a very high-protein diet

Zero-Carb Diet:

Nutrient Deficiencies: Carbohydrates are a primary source of essential nutrients, including vitamins, minerals, and fiber. Eliminating carbs can lead to deficiencies, particularly in fiber, which is crucial for digestive health.
Energy Levels: Carbohydrates are the body's primary energy source. A lack of carbs can result in low energy levels, fatigue, and decreased physical and mental performance.
Digestive Issues: The lack of dietary fiber can lead to constipation and other digestive issues.
Ketosis and Ketoacidosis: While ketosis can be a goal for some low-carb dieters, prolonged ketosis can lead to ketoacidosis, a dangerous condition particularly for individuals with diabetes.

High-Protein Diet:

Kidney Strain: Excessive protein intake can strain the kidneys, especially in individuals with pre-existing kidney conditions.
Bone Health: High protein consumption can lead to increased calcium excretion, potentially impacting bone health over time.
Nutrient Imbalance: A very high-protein diet might lead to an imbalance of other nutrients, particularly if it reduces the intake of other food groups.
Digestive Issues: High protein diets can sometimes cause digestive issues such as constipation due to lower fiber intake.

4. Fit a linear model to test calorie estimates

Let's fit a linear model using the provided estimates for calorie content based on macronutrient composition.

Calories = 9 * Total fat + 4 * Protein + 4 * Carbohydrate

We'll test this model against the data.

from sklearn.linear_model import LinearRegression
import numpy as np

# Prepare the data for linear regression
X = nutrition_data[['Total fat', 'Protein', 'Carbohydrate']].dropna()
y = nutrition_data['Calories'].dropna()

# Ensure we are working with the same indices
X = X.loc[y.index]
y = y.loc[X.index]

# Define the linear model
model = LinearRegression(fit_intercept=False)
model.fit(X, y)

# Coefficients
coef_fat, coef_protein, coef_carbohydrate = model.coef_

coef_fat, coef_protein, coef_carbohydrate

The fitted linear model coefficients are as follows:

Fat: ( 8.85 ) kcal/gram
Protein: ( 4.14 ) kcal/gram
Carbohydrate: ( 3.85 ) kcal/gram

These values are close to the theoretical estimates of 9 kcal/gram for fat and 4 kcal/gram for protein and carbohydrates. However, the carbohydrate coefficient is slightly lower than the theoretical value.

5. Analyze the errors of the linear model

We'll analyze the residuals (errors) of our model to identify potential hidden sources of calories or other factors affecting the calorie content.

Let's compute the residuals and visualize them.

# Predict the calories using the model
y_pred = model.predict(X)

# Compute the residuals
residuals = y - y_pred

# Plot the residuals
plt.figure(figsize=(10, 6))
sns.histplot(residuals, bins=50, kde=True)
plt.title('Distribution of Residuals')
plt.xlabel('Residuals (Actual - Predicted Calories)')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

# Summary statistics of residuals
residuals.describe()

The residuals show a roughly normal distribution with a mean close to zero, which is expected for a good fit. However, there are some notable outliers, indicating that some foods have actual calorie contents significantly different from what the model predicts.

Summary Statistics of Residuals:

Count: 7793
Mean: 1.02 kcal (indicating a slight overestimation on average)
Standard Deviation: 17.17 kcal (showing variability in the prediction errors)
Min: -384.88 kcal
25th Percentile: -2.15 kcal
50th Percentile (Median): 1.50 kcal
75th Percentile: 5.17 kcal
Max: 295.00 kcal

Hidden Sources of Calories:

Alcohol: Some foods contain alcohol, which has 7 kcal/gram, and this isn't accounted for in our current model.
Fiber: While generally considered non-caloric, some fibers can be partially metabolized.
Sugar Alcohols: Used in some low-carb foods, these have varying caloric values.
Other Additives: Some processed foods contain additional ingredients that contribute to their caloric content.

Conclusion:

The linear model generally agrees with the theoretical values for calorie content from macronutrients but reveals some outliers and potential additional sources of calories. A balanced diet, including carbohydrates, proteins, and fats, while mindful of caloric intake, remains essential for overall health.