What is good food?
📖 Background
You and your friend have gotten into a debate about nutrition. Your friend follows a high-protein diet and does not eat any carbohydrates (no grains, no fruits). You claim that a balanced diet should contain all nutrients but should be low in calories. Both of you quickly realize that most of what you know about nutrition comes from mainstream and social media.
Being the data scientist that you are, you offer to look at the data yourself to answer a few key questions.
💾 The data
You source nutrition data from USDA's FoodData Central website. This data contains the calorie content of 7,793 common foods, as well as their nutritional composition. Each row represents one food item, and nutritional values are based on a 100g serving. Here is a description of the columns:
- FDC_ID: A unique identifier for each food item in the database.
- Item: The name or description of the food product.
- Category: The category or classification of the food item, such as "Baked Products" or "Vegetables and Vegetable Products".
- Calories: The energy content of the food, presented in kilocalories (kcal).
- Protein: The protein content of the food, measured in grams.
- Carbohydrate: The carbohydrate content of the food, measured in grams.
- Total fat: The total fat content of the food, measured in grams.
- Cholesterol: The cholesterol content of the food, measured in milligrams.
- Fiber: The dietary fiber content of the food, measured in grams.
- Water: The water content of the food, measured in grams.
- Alcohol: The alcohol content of the food (if any), measured in grams.
- Vitamin C: The Vitamin C content of the food, measured in milligrams.
# Import libraries
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
from PIL import Image
from io import BytesIO
from IPython.display import display, HTML
import re
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import seaborn as sns
import matplotlib.pyplot as plt
# Read the data
nutrition_data = pd.read_csv('nutrition.csv')
nutrition_data.head()
# Columns to be processed
columns_to_process = ['Calories', 'Protein', 'Carbohydrate', 'Total fat', 'Cholesterol', 'Fiber', 'Water', 'Alcohol', 'Vitamin C']
# Removing units (kcal, g, mg) and converting to float
for column in columns_to_process:
nutrition_data[column] = nutrition_data[column].str.extract(r'(\d+\.?\d*)').astype(float)
# Checking for null or missing values
missing_values = nutrition_data.isnull().sum()
# Replacing all null values in the DataFrame with 0
nutrition_data_filled = nutrition_data.fillna(0)
Executive summary
While high-protein/zero-carb diets can be effective for short-term weight loss, they are often not sustainable in the long term. Balanced diets are more likely to lead to sustainable weight management.
A balanced diet encompasses a variety of nutrients sourced from diverse food groups including fruits, vegetables, grains, proteins, and fats. Such a diet is pivotal in providing a comprehensive range of essential vitamins, minerals, and other nutrients crucial for the body's optimal performance. Moving forward, I'll delve into the specifics of Vitamin C-rich foods, examine the benefits of water-rich foods which are typically low in calories and hence beneficial for weight loss, discuss the potential downsides of an excessively high-protein and zero-carb diet, and explore the impact of different macronutrient compositions on calorie content.
Vitamin C sources:
- Fruit whith the highest vitamin C content: The top fruits for vitamin C content are Acerola, Guavas, Black Currants, Kiwifruit, Orange peel, Lemon peel, Longans, Litchis, and Oranges (with peel). These fruits are exceptionally rich in vitamin C, far surpassing the daily recommended intake for both men and women with relatively small servings.
- Other sources of vitamin C: Spices and Herbs, Beverages, Baby Foods, Vegetables.
The relationship between the calories and water content of a food item:
Foods high in water content have lower calorie density. This is because water adds weight and volume to food without adding calories. For example, fruits and vegetables, which are high in water content, are much less calorie-dense compared to dry, processed foods.
The drawbacks of a very high-protein zero-carb diet:
High protein/low carb diets may cause nutrient imbalance due to reduced carb intake, leading to a lack of essential nutrients found in carb-rich foods. Adherence can be difficult due to the lack of a defined macronutrient range, and the diets might be high in fat, increasing saturated fat intake and potentially impacting heart health. Severe calorie restriction may risk muscle mass loss. These diets require careful monitoring, making them complex and inconvenient. They may increase the risk of heart disease, limit dietary variety, and cause protein overconsumption, stressing the kidneys in sedentary individuals. Balance, moderation, and individualized dietary planning are essential.
Macronutrients food composition:
A linear model was fitted to test the accuracy of the nutritional estimates provided by the Cleveland Clinic, which state that one gram of fat contains about 9 kilocalories, and one gram each of protein and carbohydrates contains 4 kilocalories. The initial model's results showed a Mean Squared Error (MSE) of 277.4625 and an R-squared value of 0.9908. The regression formula derived was: Calories = 4.2851 + (3.9913 * Protein) + (3.8007 * Carbohydrate) + (8.8010 * Total fat).
Upon further examination of outliers in the residual data, it was concluded that other sources of calories, like alcohol, needed to be considered. After incorporating alcohol into the model, the results improved significantly, showing a lower MSE of 161.5293 and a higher R-squared value of 0.9946. The revised regression formula became: Calories = 1.0372 + (4.1036 * Protein) + (3.8329 * Carbohydrate) + (8.8356 * Total fat) + (6.9001 * Alcohol).
Fruit whith the highest vitamin C content
Reference Intake source: https://ods.od.nih.gov/factsheets/VitaminC-Consumer/
# Constants for daily intake
daily_intake_women = 75 # mg
daily_intake_men = 90 # mg
# Filtering the data for raw fruits only
raw_fruit_data = nutrition_data_filled[
nutrition_data_filled['Category'].str.contains('Fruits', case=False, na=False) &
nutrition_data_filled['Item'].str.contains('raw', case=False, na=False)
]
# Sorting the data by Vitamin C content
sorted_vitamin_c_raw_fruits = raw_fruit_data.sort_values(by='Vitamin C', ascending=False)
# Selecting top fruits with unique starting words
unique_start_words = set()
top_vitamin_c_raw_fruits = pd.DataFrame()
for _, row in sorted_vitamin_c_raw_fruits.iterrows():
first_word = row['Item'].split()[0]
if first_word not in unique_start_words:
unique_start_words.add(first_word)
top_vitamin_c_raw_fruits = top_vitamin_c_raw_fruits.append(row)
if len(top_vitamin_c_raw_fruits) == 10:
break
# Calculate the minimum quantity (in grams) for each fruit to meet the daily intake
top_vitamin_c_raw_fruits['Reference Intake for Women (%)'] = (top_vitamin_c_raw_fruits['Vitamin C']/daily_intake_women) * 100
top_vitamin_c_raw_fruits['Reference Intake for Men (%)'] = (top_vitamin_c_raw_fruits['Vitamin C']/daily_intake_men) * 100
# Rounding the minimum quantity values to 0 decimal places
top_vitamin_c_raw_fruits['Reference Intake for Women (%)'] = top_vitamin_c_raw_fruits['Reference Intake for Women (%)'].round(0)
top_vitamin_c_raw_fruits['Reference Intake for Men (%)'] = top_vitamin_c_raw_fruits['Reference Intake for Men (%)'].round(0)
# Displaying the result
top_vitamin_c_raw_fruits[['Item', 'Vitamin C', 'Reference Intake for Women (%)', 'Reference Intake for Men (%)']]
food_items = top_vitamin_c_raw_fruits["Item"]
# Function to fetch the second image URL containing 'unsplash' from a webpage
def fetch_second_image_url(search_term):
headers = {'User-Agent': 'Mozilla/5.0'}
url = f"https://unsplash.com/s/photos/{search_term}-fruit"
try:
response = requests.get(url, headers=headers)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
images = soup.select('div.MorZF img')
unsplash_images = [img['src'] for img in images if 'src' in img.attrs and 'unsplash' in img['src']]
if len(unsplash_images) >= 2:
return unsplash_images[1] # Return the second image URL
except Exception as e:
print(f"Error fetching image for {search_term}: {e}")
return None
# Fetching and storing the second image URL for each food item
urls = {}
for item in food_items:
first_word = re.split(',| ', item)[0]
second_image_url = fetch_second_image_url(first_word)
if second_image_url:
urls[item] = second_image_url
# Generating HTML for images
images_html = "<table><tr>"
for count, (item, image_url) in enumerate(urls.items(), 1):
images_html += f"<td style='text-align: center; padding: 10px;'><img src='{image_url}' style='width: 150px; height: 150px; object-fit: cover;'><br>{item}</td>"
if count % 4 == 0:
images_html += "</tr><tr>"
images_html += "</tr></table>"
# Display the images
display(HTML(images_html))
Other sources of vitamin C
# Calculating the mean value of Vitamin C for each category
vitamin_c_mean_by_category = nutrition_data_filled.groupby('Category')['Vitamin C'].mean()
# Sorting the mean values in descending order
vitamin_c_mean_by_category_sorted = vitamin_c_mean_by_category.sort_values(ascending=False)
# Plotting the sorted mean values
plt.figure(figsize=(10, 6))
vitamin_c_mean_by_category_sorted.plot(kind='bar', color=[category_colors[cat] for cat in vitamin_c_mean_by_category_sorted.index])
plt.title('Mean Vitamin C content by Category')
plt.xlabel('Category')
plt.ylabel('Mean Vitamin C (mg)')
plt.xticks(rotation=90, fontsize=8)
plt.show()