Skip to content
Introduction to Statistics in Python
Summary Statistics
# Importing numpy and pandas
import numpy as np
import pandas as pd
# Importing the course datasets
deals = pd.read_csv("datasets/amir_deals.csv")
happiness = pd.read_csv("datasets/world_happiness.csv")
food = pd.read_csv("datasets/food_consumption.csv")happiness.head()Measures of Center
- mean (np.mean)
- median (np.median), sorted
- mode (statistics.mode, value_counts)
mean is sensitive to outliers (skews)
Measures of Spread
- Variance
- Standard Deviation
- Mean Absolute Deviation
- Quantiles
- IQR
- Outliers
'''
Variance - get the distances vs the mean, square them, sum them, then diving by number of data points - 1
'''
#ddof = 1, calcuates as a sample, no 1 does population
np.var(happiness['life_exp'], ddof=1)'''
Standard Deviation - sqrt of variance
'''
#ddof = 1, calculates as a sample instead of population
np.std(happiness['life_exp'], ddof = 1)'''
Mean absolute deviation - takes the absolute value of the distances to the mean, and gets the mean of those distances
Not the same as STD - STD Squares distances so longer distances are penalized more. AKA outliers have a bigger effect on the STD, as you see below.
MAD punishes each distance equally
'''
dists = happiness['life_exp'] - np.mean(happiness['life_exp'])
MAD = np.mean(abs(dists))
print('Mad:',MAD)
print('STD:', np.std(happiness['life_exp']))
'''
Quantiles - split data into equal parts (Quartiles, deciles etc)
'''
#5 Equal seqments. Instead of a list, you could do one number like 0.50
#can also use np.linspace(start,stop,num)
np.quantile(happiness['life_exp'], [0,0.25,0.50,0.75,1])#An example of quartiles using a boxplot
import matplotlib.pyplot as plt
plt.boxplot(happiness['life_exp'])
plt.show()array([52.9 , 69.1 , 74.9 , 79.65, 85.1 ])
In this array
-
1st Quartile = 69
-
2nd Quartile (median) = 74
-
3rd Quartile = 79
52, 85 are the outer edges of the plot
'''
Interquartile Range (IQR): range of 25-75 quartiles, i.e. High of a box in a boxplot
'''
quart_25 = np.quantile(happiness['life_exp'], 0.25)
quart_75 = np.quantile(happiness['life_exp'], 0.75)
IQR = quart_75 - quart_25
IQRfrom scipy.stats import iqr
iqr(happiness['life_exp'])