What is good food?
๐ Background
You and your friend have gotten into a debate about nutrition. Your friend follows a high-protein diet and does not eat any carbohydrates (no grains, no fruits). You claim that a balanced diet should contain all nutrients but should be low in calories. Both of you quickly realize that most of what you know about nutrition comes from mainstream and social media.
Being the data scientist that you are, you offer to look at the data yourself to answer a few key questions.
๐พ The data
You source nutrition data from USDA's FoodData Central website. This data contains the calorie content of 7,793 common foods, as well as their nutritional composition. Each row represents one food item, and nutritional values are based on a 100g serving. Here is a description of the columns:
- FDC_ID: A unique identifier for each food item in the database.
- Item: The name or description of the food product.
- Category: The category or classification of the food item, such as "Baked Products" or "Vegetables and Vegetable Products".
- Calories: The energy content of the food, presented in kilocalories (kcal).
- Protein: The protein content of the food, measured in grams.
- Carbohydrate: The carbohydrate content of the food, measured in grams.
- Total fat: The total fat content of the food, measured in grams.
- Cholesterol: The cholesterol content of the food, measured in milligrams.
- Fiber: The dietary fiber content of the food, measured in grams.
- Water: The water content of the food, measured in grams.
- Alcohol: The alcohol content of the food (if any), measured in grams.
- Vitamin C: The Vitamin C content of the food, measured in milligrams.
import pandas as pd
df_food = pd.read_csv('nutrition.csv')
summary:
Create a report that covers the following:
- fruit has the highest vitamin C and some other sources of vitamin C.
- the relationship between the calories and water content
- possible drawbacks of a zero-carb diet drawbacks of a very high-protein diet.
- fit a linear model to find that kcal in protein, carbohydrates and fat.
- Alcohol as a source of calories.
๐ฅFirst of all we need to explore our data and see information about dataframe. Information of Food data and columns types.
df_food.info()Delete all missing Data. New Data information.
df_food_Nna = df_food.dropna()
df_food_Nna.info()All columns are object but we need to convert all columns with data of object types and all data in it to numbers,so we can process this data.
import numpy as np
df_food[['Vitamin C','Cholesterol']] = df_food[['Vitamin C','Cholesterol']].fillna('0.0 mg')
df_food[['Fiber','Alcohol']]= df_food[['Fiber','Alcohol']].fillna('0.0 g')
We need all data, so we replace missing data with 0, as start to begin aur journey.๐
def DataframeAddCol(df, dic):
"""
Function to convert string columns with 'mg, g, kca, ...' to float and add new columns to the dataframe.
df: DataFrame, dic: dictionary of column names and measurements like mg, g, ...
"""
for i, x in dic.items():
new_col_name = i + "_" + x
df[new_col_name] = pd.to_numeric(df[i].str.split(x).str[0], errors='coerce')
return dfOur first step convert all columns with data from objects to numbers.๐ช
dictcol = {"Calories":"kcal",
"Protein":"g",
"Carbohydrate":"g",
"Total fat":"g",
"Cholesterol":"mg",
"Fiber":"g",
"Water":"g",
"Alcohol":"g",
"Vitamin C":"mg"}
DataframeAddCol(df_food,dictcol)
Now๐, we have our new dataframe with numerical columns from originals.
โ
โ