Greetings, data lovers!
🤓 This is my first competition submission after a few months of learning Python.
Modeling is still quite new to me, so if you spot any mistakes or have ideas on how to approach things differently, give me a shout! You can drop me an email at [email protected] or connect on LinkedIn at https://www.linkedin.com/in/niksa-derek-fmva/.
2 hidden cells
💪 Competition challenge
Create a report that covers the following:
- Which fruit has the highest vitamin C content? What are some other sources of vitamin C?
- Describe the relationship between the calories and water content of a food item.
- What are the possible drawbacks of a zero-carb diet? What could be the drawbacks of a very high-protein diet?
- According to the Cleveland Clinic website, a gram of fat has around 9 kilocalories, and a gram of protein and a gram of carbohydrate contain 4 kilocalories each. Fit a linear model to test whether these estimates agree with the data.
- Analyze the errors of your linear model to see what could be the hidden sources of calories in food.
Summary of Key Findings from the Nutrition Data Analysis
1. Highest Vitamin C Content in Fruit:
The fruit with the highest vitamin C content was identified as Acerola, containing 1677.6 mg of vitamin C per 100g. Additional sources with high vitamin C content include baby food (GERBER, apple, carrot, and squash), various fruit-flavored drinks, and freeze-dried sweet red peppers.
2. Calories and Water Content Relationship:
For the category 'Baked Products', a scatter plot analysis revealed a negative correlation of -0.9 between calories and water content. This suggests that as calories increase, water content tends to decrease, and vice versa.
3. Zero-Carb vs. High-Protein Diets:
Zero-Carb Diet Drawbacks: Nutrient deficiencies (lack of fiber, vitamins, minerals), digestive issues, low energy levels, and difficulty in sustaining the diet long-term. High-Protein Diet Drawbacks: Potential kidney damage, nutrient imbalances, digestive issues, and increased risk of chronic diseases.
4. Nutritional Composition Comparison:
Zero-carb foods were compared with high-protein foods, revealing that zero-carb foods contain significantly less fiber and vitamin C, indicating potential nutritional imbalances.
5. Calorie Estimation vs. Actual Calories:
A linear model was used to estimate calories based on the macronutrient content (fat, protein, carbohydrate). The estimated total calories were 1,731,393.36 kcal compared to the actual total of 1,716,354.62 kcal. A strong correlation (0.9947) was observed between estimated and actual calories, indicating the model's accuracy. However, there were unaccounted calories (9113.95 kcal), which could be partially explained by alcohol content (6965 kcal from alcohol). The model did not account for alcohol, which has about 7 kcal per gram.
Additional insights
Statistical Test (KS Test):
The Kolmogorov-Smirnov test showed no significant difference between the distributions of actual and predicted calories (p-value: 0.1297). This implies that the model's calorie predictions align well with the actual values overall.
Residual Analysis:
Residuals between predicted and actual calories were analyzed, revealing discrepancies, particularly at extreme ends of the dataset. This suggests that a different model or distribution might be more accurate.
# Import libraries
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Read the nutrition data from the CSV file
df_food = pd.read_csv('nutrition.csv')
# Display the dataframe
df_food# Inspect the unique values in the "Category" column of the df_food dataframe
df_food["Category"].unique()Iitem with the highest vitamin C content
- After familiarizing myself with the data, I've embarked on a journey to discover the fruit item with the highest vitamin C content and listed the top 10 foods with the highest vitamin C content in a dataset.
# Remove ' mg' from the 'Vitamin C' column and convert it to float
df_food['Vitamin C'] = df_food['Vitamin C'].astype(str).str.replace(' mg', '')
df_food['Vitamin C'] = df_food['Vitamin C'].astype(float)
# Filter the dataframe to include only rows with category 'Fruits and Fruit Juices' and select columns 'Item' and 'Vitamin C'
fruit_vitamin_c = df_food[df_food["Category"]=="Fruits and Fruit Juices"][["Item", "Vitamin C"]]
# Find the maximum value of 'Vitamin C' in the filtered dataframe
Max = fruit_vitamin_c["Vitamin C"].max() # Corrected variable name
# Filter the dataframe to include only rows with 'Vitamin C' equal to the maximum value
c_max = df_food[df_food["Vitamin C"]== Max]
# Select columns 'Item' and 'Vitamin C' from the filtered dataframe
c_max[["Item","Vitamin C"]]
# Get the item with the highest content of vitamin C
Food = c_max["Item"].iloc[0].split(",")[0]
Vitamin_C = c_max["Vitamin C"].iloc[0]
# Print the item with the content of vitamin C in bold
print(Food + " has the highest content of vitamin C: " + str(Vitamin_C) + " mg") # Corrected concatenation
# Add space between results
print()
# Get the top 10 foods with the highest content of vitamin C
Top_10 = df_food.nlargest(10, "Vitamin C")[["Item","Vitamin C"]]
# Print other foods with high content of vitamin C
print("Other foods with high content of vitamin C:")
print(Top_10)Relationship between calories and water content
- Next, I've calculated and visualized the relationship between calories and water content for a specific food category ('Baked Products') using a scatter plot, indicating whether, on average, increasing calories lead to an increase, decrease, or show no significant change in water content.
- The negative correlation of -0.9 indicates that the calories tend to decrease as the water content increases (and vice versa).
# Define a function to the relationship between calories and content for given food category
def calculate(category):
# dataframe to only the rows for specified food category
category_items = df_food.loc[df_food["Category"] == category]
# Initialize counters for total items, items with increasing water content, and items with decreasing water content
total_items = 0
increasing_water_items = 0
decreasing_water_items = 0
# Iterate through each item in the category
for index, row in category_items.iterrows():
# Calculate the correlation coefficient between calories and water content for the current item
correlation = float(row['Calories'].replace(' kcal', '')) - float(row['Water'].replace(' g', ''))
# Determine the relationship between calories and water content for the current item
if correlation > 0:
increasing_water_items += 1
elif correlation < 0:
decreasing_water_items += 1
total_items += 1
# Determine the overall relationship between calories and water content for the category
if increasing_water_items > decreasing_water_items:
relationship = "In the category '{}', as the calories increase, the water content tends to increase.".format(category)
elif increasing_water_items < decreasing_water_items:
relationship = "In the category '{}', as the calories increase, the water content tends to decrease.".format(category)
else:
relationship = "In the category '{}', there is no significant relationship between calories and water content.".format(category)
return relationship
# Call the function to calculate the relationship for a specific food category
category_relationship = calculate('Baked Products')
print(category_relationship)
# dataframe to only the rows for specified food category
category_items = df_food.loc[df_food["Category"] == 'Baked Products']
# Plotting the scatter plot
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Calories', y='Water', data=category_items, hue='Calories', palette='coolwarm', size='Calories', sizes=(20, 200))
# Adding labels and title
plt.title('Relationship Between Calories and Water Content for Baked Products')
plt.xlabel('Calories (kcal)')
plt.ylabel('Water Content (g)')
# Remove the legend
plt.legend().remove()
# Group x and y values in bigger groups
plt.xticks(np.arange(0, 1, 100))
plt.yticks(np.arange(0, 1, 5))
# Display the plot
plt.show()# Remove 'kcal' from the 'Calories' column
df_food['Calories'] = df_food['Calories'].str.replace(' kcal', '')
# Remove ' g' from the 'Water' column
df_food['Water'] = df_food['Water'].str.replace(' g', '')
# Convert 'Calories' and 'Water' columns to float
df_food['Calories'] = df_food['Calories'].astype(float)
df_food['Water'] = df_food['Water'].astype(float)
# Find correlation coefficient between calories and water content of food
cor_cal_water = df_food['Calories'].corr(df_food['Water'])
print('Correlation coefficient between water content and calories in food: ', round(cor_cal_water, 2))
# Plot scatterplot with regression line between water and calories
sns.regplot(data=df_food, x='Water', y='Calories', marker='.', line_kws={'color':'red'})
plt.title('Water vs calories')
plt.xlabel('Water in grams')
plt.ylabel('Calories in kcal')
plt.show()Advantages and disadvantages of a zero-carb and high-protein diet
- Up next, we are moving into the advantages and disadvantages of a zero-carb and high-protein diet.
- For high-protein foods, I've selected all items with 20 or more grams of protein per 100 g.
Possible drawbacks of a zero-carb diet:
- Nutrient deficiencies: Carbohydrates are a major source of essential nutrients like fiber, vitamins, and minerals. A zero-carb diet may lead to deficiencies in these nutrients.
- Digestive issues: Lack of fiber from carbohydrates can cause constipation and other digestive problems.
- Low energy levels: Carbohydrates are the body's preferred source of energy. Without them, energy levels may decrease.
- Sustainability: It can be challenging to sustain a zero-carb diet in the long term due to limited food choices and potential health risks.
Drawbacks of a very high-protein diet:
- Kidney damage: Excessive protein intake can put strain on the kidneys and potentially lead to kidney damage.
- Nutrient imbalances: Focusing too much on protein may result in inadequate intake of other essential nutrients.
- Digestive issues: Consuming excessive protein can cause digestive problems such as constipation and bloating.
- Increased risk of chronic diseases: Some studies suggest that a high-protein diet, particularly from animal sources, may increase the risk of certain chronic diseases like heart disease and cancer
# Compare the results from zero-carb food and high protein food
food = pd.read_csv("nutrition.csv")
# Calculate the total fat, cholesterol, fiber, protein, vitamin C, and water in zero-carb food items compared to food that is high in protein
zero_carb_food = food[food["Carbohydrate"].str.replace(' g', '').astype(float) == 0]
high_protein_food = food[food["Protein"].str.replace(' g', '').astype(float) > 20]
# Calculate the total fat in zero-carb food items
total_fat_zero_carb = round(zero_carb_food["Total fat"].str.replace("g", "").astype(float).sum(), 2)
# Calculate the total cholesterol in zero-carb food items
total_cholesterol_zero_carb = round(zero_carb_food["Cholesterol"].str.replace(" mg", "").astype(float).sum(), 2)
# Calculate the total fiber in zero-carb food items
total_fiber_zero_carb = round(zero_carb_food["Fiber"].str.replace("g", "").astype(float).sum(), 2)
# Calculate the total protein in zero-carb food items
total_protein_zero_carb = round(zero_carb_food["Protein"].str.replace("g", "").astype(float).sum(), 2)
# Calculate the total vitamin C in zero-carb food items
total_vitamin_c_zero_carb = round(zero_carb_food["Vitamin C"].str.replace(" mg", "").astype(float).sum(), 2)
# Calculate the total water in zero-carb food items
total_water_zero_carb = round(zero_carb_food["Water"].str.replace("g", "").astype(float).sum(), 2)
# Calculate the total fat in high protein food items
total_fat_high_protein = round(high_protein_food["Total fat"].str.replace("g", "").astype(float).sum(), 2)
# Calculate the total cholesterol in high protein food items
total_cholesterol_high_protein = round(high_protein_food["Cholesterol"].str.replace(" mg", "").astype(float).sum(), 2)
# Calculate the total fiber in high protein food items
total_fiber_high_protein = round(high_protein_food["Fiber"].str.replace("g", "").astype(float).sum(), 2)
# Calculate the total protein in high protein food items
total_protein_high_protein = round(high_protein_food["Protein"].str.replace("g", "").astype(float).sum(), 2)
# Calculate the total vitamin C in high protein food items
total_vitamin_c_high_protein = round(high_protein_food["Vitamin C"].str.replace(" mg", "").astype(float).sum(), 2)
# Calculate the total water in high protein food items
total_water_high_protein = round(high_protein_food["Water"].str.replace("g", "").astype(float).sum(), 2)
# Calculate the total carbohydrate in zero-carb food items
total_carbohydrate_zero_carb = round(zero_carb_food["Carbohydrate"].str.replace("g", "").astype(float).sum(), 2)
# Calculate the total carbohydrate in high protein food items
total_carbohydrate_high_protein = round(high_protein_food["Carbohydrate"].str.replace("g", "").astype(float).sum(), 2)
vitamin_c_comparison = round(total_vitamin_c_high_protein/total_vitamin_c_zero_carb,2)
carbohydrate_comparison = round(total_carbohydrate_high_protein, 2)
# Compare the results
print("Total fat in zero-carb food items:", total_fat_zero_carb)
print("Total cholesterol in zero-carb food items:", total_cholesterol_zero_carb)
print("Total fiber in zero-carb food items:", total_fiber_zero_carb)
print("Total protein in zero-carb food items:", total_protein_zero_carb)
print("Total vitamin C in zero-carb food items:", total_vitamin_c_zero_carb)
print("Total water in zero-carb food items:", total_water_zero_carb)
print("Total carbohydrate in zero-carb food items:", total_carbohydrate_zero_carb)
print()
print("Total fat in high protein food items:", total_fat_high_protein)
print("Total cholesterol in high protein food items:", total_cholesterol_high_protein)
print("Total fiber in high protein food items:", total_fiber_high_protein)
print("Total protein in high protein food items:", total_protein_high_protein)
print("Total vitamin C in high protein food items:", total_vitamin_c_high_protein)
print("Total water in high protein food items:", total_water_high_protein)
print("Total carbohydrate in high protein food items:", total_carbohydrate_high_protein)
print()
print("There is", total_fiber_zero_carb, "fibers in zero-carb food and which are important for a healty gut and",vitamin_c_comparison,"times less vitamin C than in high protein food.")
print()
print("There is", total_carbohydrate_zero_carb, "carbohydrate in zero-carb food while there are",carbohydrate_comparison, "grams in high protein food, which are an importan source of nutrient and fuel for our body.")