Skip to content

Summary Statistics

# Importing numpy and pandas
import numpy as np
import pandas as pd

# Importing the course datasets
deals = pd.read_csv("datasets/amir_deals.csv")
happiness = pd.read_csv("datasets/world_happiness.csv")
food = pd.read_csv("datasets/food_consumption.csv")
happiness.head()

Measures of Center

  • mean (np.mean)
  • median (np.median), sorted
  • mode (statistics.mode, value_counts)

mean is sensitive to outliers (skews)

Measures of Spread

  • Variance
  • Standard Deviation
  • Mean Absolute Deviation
  • Quantiles
  • IQR
  • Outliers
'''
Variance - get the distances vs the mean, square them, sum them, then diving by number of data points - 1
'''
#ddof = 1, calcuates as a sample, no 1 does population

np.var(happiness['life_exp'], ddof=1)
'''
Standard Deviation - sqrt of variance
'''
#ddof = 1, calculates as a sample instead of population
np.std(happiness['life_exp'], ddof = 1)
'''
Mean absolute deviation - takes the absolute value of the distances to the mean, and gets the mean of those distances

Not the same as STD - STD Squares distances so longer distances are penalized more. AKA outliers have a bigger effect on the STD, as you see below.

MAD punishes each distance equally
'''
dists = happiness['life_exp'] - np.mean(happiness['life_exp'])
MAD = np.mean(abs(dists))
print('Mad:',MAD)
print('STD:', np.std(happiness['life_exp']))
'''
Quantiles - split data into equal parts (Quartiles, deciles etc)
'''
#5 Equal seqments. Instead of a list, you could do one number like 0.50
#can also use np.linspace(start,stop,num)
np.quantile(happiness['life_exp'], [0,0.25,0.50,0.75,1])
#An example of quartiles using a boxplot
import matplotlib.pyplot as plt
plt.boxplot(happiness['life_exp'])
plt.show()

array([52.9 , 69.1 , 74.9 , 79.65, 85.1 ])

In this array

  • 1st Quartile = 69

  • 2nd Quartile (median) = 74

  • 3rd Quartile = 79

52, 85 are the outer edges of the plot

'''
Interquartile Range (IQR): range of 25-75 quartiles, i.e. High of a box in a boxplot
'''

quart_25 = np.quantile(happiness['life_exp'], 0.25)
quart_75 = np.quantile(happiness['life_exp'], 0.75)
IQR = quart_75 - quart_25
IQR
from scipy.stats import iqr
iqr(happiness['life_exp'])