Skip to content

Introduction to Statistics in Python

Part I

# Importing numpy and pandas
import numpy as np
import pandas as pd
import warnings
import matplotlib.pyplot as plt
import matplotlib as mpl 

# Importing the course datasets
deals = pd.read_csv("datasets/amir_deals.csv")
happiness = pd.read_csv("datasets/world_happiness.csv")
food = pd.read_csv("datasets/food_consumption.csv")

plt.rcParams['figure.figsize'] = [7, 5]

pd.set_option('display.expand_frame_repr', False)

warnings.filterwarnings("ignore")

mpl.rcParams['axes.grid'] = True
plt.style.use('seaborn')

Mean and median

In this chapter, you'll be working with the 2018 Food Carbon Footprint Index from nu3. The food_consumption dataset contains information about the kilograms of food consumed per person per year in each country in each food category (consumption) as well as information about the carbon footprint of that food category (co2_emissions) measured in kilograms of carbon dioxide, or CO2, per person per year in each country.

In this exercise, you'll compute measures of center to compare food consumption in the US and Belgium using your pandas and numpy skills.

pandas is imported as pd for you and food_consumption is pre-loaded.

food_consumption = pd.read_csv('./datasets/food_consumption.csv', index_col=0)
display(food_consumption.head())

# Subset for Belgium and USA only
be_and_usa = food_consumption[(food_consumption['country'] == 'Belgium') | (food_consumption['country'] == 'USA')]

# Group by country, select consumption column, and compute mean and median
print(be_and_usa.groupby('country')['consumption'].agg([np.mean, np.median]))
# Filter for Belgium
be_consumption =food_consumption[food_consumption['country'] == 'Belgium'] 

# Filter for USA
usa_consumption = food_consumption[food_consumption['country'] == 'USA'] 

# Calculate mean and median consumption in Belgium
print(np.mean(be_consumption['consumption']))
print(np.median(be_consumption['consumption']))

# Calculate m)an and median consumption in USA
print(np.mean(usa_consumption['consumption']))
print(np.median(usa_consumption['consumption']))
  • Subset food_consumption to get the rows where food_category is 'rice'.
  • Create a histogram of co2_emission for rice and show the plot.

# Subset for food_category equals rice
rice_consumption = food_consumption[food_consumption['food_category'] == 'rice']

# Histogram of co2_emission for rice and show plot
rice_consumption.co2_emission.hist()
plt.show()
  • Use .agg() to calculate the mean and median of co2_emission for rice.
rice_consumption = food_consumption[food_consumption['food_category'] == 'rice']

# Calculate mean and median of co2_emission with .agg()
print(rice_consumption.agg([np.mean, np.median]))

Measures of spread

Quartiles, quantiles, and quintiles

Quantiles are a great way of summarizing numerical data since they can be used to measure center and spread, as well as to get a sense of where a data point stands in relation to the rest of the data set. For example, you might want to give a discount to the 10% most active users on a website.

For example:

# Calculate the quartiles of co2_emission
print(np.quantile(food_consumption['co2_emission'],np.linspace(0, 1, 5)))
print(np.quantile(food_consumption['co2_emission'],np.linspace(0, 1, 6)))
#Calculate the eleven quantiles of co2_emission that split up the data into ten pieces (deciles).
print(np.quantile(food_consumption['co2_emission'],np.linspace(0, 1, 11)))

Those are some high-quality quantiles! While calculating more quantiles gives you a more detailed look at the data, it also produces more numbers, making the summary more difficult to quickly understand.

Variance and standard deviation