Skip to content
Introduction to Statistics in Python
Introduction to Statistics in Python
Run the hidden code cell below to import the data used in this course.
# Importing numpy and pandas
import numpy as np
import pandas as pd
# Importing the course datasets
deals = pd.read_csv("datasets/amir_deals.csv")
happiness = pd.read_csv("datasets/world_happiness.csv")
food = pd.read_csv("datasets/food_consumption.csv")Summary Statistics
Definition
- The field of statistics : the practice and study of collecting and analyzing data
- A summary statistic : a fact about or summary of some data
Type of Statistics
- Descriptive statistics
- Describe and summarize data
- 50% of friends drive to work
- 25% take the bus
- 25% take the bike
- Inferential statistics
- Use a sample of data to make inferences about a larger population
- what percent of people drive to work
Types of data :
- Numeric
- Countinous (measured)
- Airplane speed
- Time Spent waiting in line
- Discrete (counted)
- Number of pets
- Number of packages shipped
- Categorical
- Nominal (Unordered)
- Married/unmerried
- Country of residence
- Ordinal (Ordered)
- strongly agree - strongly disagree (1-5)
Measure of center
where is the center of the data?
- Mean
- Median
- Mode
# Mode
food['country'].value_counts().head(5)import statistics
statistics.mode(food["country"])# Median and mean
# Opsi 1
var = food[food['country']=='Argentina']['consumption']
print("median : ", np.median(var))
print("mean : ",np.mean(var))# Opsi 2
food[food["country"]=="Argentina"]["consumption"].agg([np.mean, np.median])If the data skewed, it depend on the type of skewed of the distribution.
- Right-skewed
- Left-skewed It is recommended to used Median
Measure of Spread
- Variance
- Standard deviation
- Quartiles
- IQR (Q3-Q1)
- Outliers
# to calculate the variance
# use np.var()
np.var(food["consumption"], ddof=1)
# ddof = 1, it means we calculate the sample variance
# without ddof = 1, we calculate the population# to calculate the standard deviation
# use np.std()
np.std(food["consumption"], ddof=1)# Remember that standard deviation is a squared-root of variance
np.sqrt(np.var(food["consumption"], ddof=1))