Let us find out what good food should be
1. Background
I and my friend have gotten into a debate about nutrition. My friend follows a high-protein diet and does not eat any carbohydrates (no grains, no fruits). I claim that a balanced diet should contain all nutrients but should be low in calories. However, we quickly realize that most of what we know about nutrition comes from mainstream and social media.
As a data scientist, I will dive deeper into real nutrition dats to answer a few key questions.
2. The data
The data is sourced from USDA's FoodData Central website. This data contains the calorie content of 7,793 common foods, as well as their nutritional composition. Each row represents one food item, and nutritional values are based on a 100g serving. Here is a description of the columns:
- FDC_ID: A unique identifier for each food item in the database.
- Item: The name or description of the food product.
- Category: The category or classification of the food item, such as "Baked Products" or "Vegetables and Vegetable Products".
- Calories: The energy content of the food, presented in kilocalories (kcal).
- Protein: The protein content of the food, measured in grams.
- Carbohydrate: The carbohydrate content of the food, measured in grams.
- Total fat: The total fat content of the food, measured in grams.
- Cholesterol: The cholesterol content of the food, measured in milligrams.
- Fiber: The dietary fiber content of the food, measured in grams.
- Water: The water content of the food, measured in grams.
- Alcohol: The alcohol content of the food (if any), measured in grams.
- Vitamin C: The Vitamin C content of the food, measured in milligrams.
3. Executive Summary/Findings
- The Acerola(west indian cherry),raw fruit contains the highest content of Vitamin C followed closely by the Acerola Juice
- Spices and Herbs food category leads in Vitamin C content followed by Fruits and Fruit Juices.
- In the table visual below, I show the top ten foods with the highest vitamin C content are beverages, fruits and fruit juices, baby food, vegetables and vegetable products and spices and herbs.
- There is a strong negative correlation between calories and water content of a food item.The correlation coefficient has been found to be -0.8955.
- The linear model is highly significant, and the Calories from the dataset has a strong and highly significant relationship with the calories suggested by the Cleveland Clinic model. The model explains a large proportion of the variability in the calculated calories.
1 hidden cell
4. EDA and Data Cleaning
Let us start by loading the necessary libraries into our workspace. Then we will load the dataset and view it to get a glimpse of its set up before proceeding with our analysis.
library(tidyverse)
library(skimr)
library(readr)
library(dplyr)
library(stringr)
library(tidyr)
library(ggplot2)# load the dataset to a dataframe
nutrition <- read_csv('nutrition.csv', show_col_types = FALSE)2 hidden cells
Let us examine the dataset provided using head() and get to understand a few details about it. We will start by viewing the first few rows of the data set to understand its structure
head(nutrition)glimpse(nutrition)Here, I will rename the FDC_ID, Item, Total fat and Vitamin C columns to Food_ID, Description, Total_fat, and Vitamin_C respectively. This is to make the names more meaningful, descriptive and easy to use.
df_nutrition <- rename(nutrition,Food_ID = FDC_ID, Description = Item, Total_fat = `Total fat`, Vitamin_C = `Vitamin C`)
df_nutritionNext we remove the units from the values and convert the results from character to numeric datatype. The Food_ID column has been converted from numeric type to character type (because its a categorical data ). This will enable us conduct our analysis on the numeric data.