Introduction to Statistics in Python

Summary Statistics

# Importing numpy and pandas
import numpy as np
import pandas as pd

# Importing the course datasets
deals = pd.read_csv("datasets/amir_deals.csv")
happiness = pd.read_csv("datasets/world_happiness.csv")
food = pd.read_csv("datasets/food_consumption.csv")

happiness.head()

Measures of Center

mean (np.mean)
median (np.median), sorted
mode (statistics.mode, value_counts)

mean is sensitive to outliers (skews)

Measures of Spread

Variance
Standard Deviation
Mean Absolute Deviation
Quantiles
IQR
Outliers

'''
Variance - get the distances vs the mean, square them, sum them, then diving by number of data points - 1
'''
#ddof = 1, calcuates as a sample, no 1 does population

np.var(happiness['life_exp'], ddof=1)

'''
Standard Deviation - sqrt of variance
'''
#ddof = 1, calculates as a sample instead of population
np.std(happiness['life_exp'], ddof = 1)

'''
Mean absolute deviation - takes the absolute value of the distances to the mean, and gets the mean of those distances

Not the same as STD - STD Squares distances so longer distances are penalized more. AKA outliers have a bigger effect on the STD, as you see below.

MAD punishes each distance equally
'''
dists = happiness['life_exp'] - np.mean(happiness['life_exp'])
MAD = np.mean(abs(dists))
print('Mad:',MAD)
print('STD:', np.std(happiness['life_exp']))

'''
Quantiles - split data into equal parts (Quartiles, deciles etc)
'''
#5 Equal seqments. Instead of a list, you could do one number like 0.50
#can also use np.linspace(start,stop,num)
np.quantile(happiness['life_exp'], [0,0.25,0.50,0.75,1])

#An example of quartiles using a boxplot
import matplotlib.pyplot as plt
plt.boxplot(happiness['life_exp'])
plt.show()

array([52.9 , 69.1 , 74.9 , 79.65, 85.1 ])

In this array

1st Quartile = 69
2nd Quartile (median) = 74
3rd Quartile = 79

52, 85 are the outer edges of the plot

'''
Interquartile Range (IQR): range of 25-75 quartiles, i.e. High of a box in a boxplot
'''

quart_25 = np.quantile(happiness['life_exp'], 0.25)
quart_75 = np.quantile(happiness['life_exp'], 0.75)
IQR = quart_75 - quart_25
IQR

from scipy.stats import iqr
iqr(happiness['life_exp'])

‌
‌
‌

Introduction to Statistics in Python

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Summary Statistics

Measures of Center

Measures of Spread

Summary Statistics