Skip to content

Introduction to Statistics in Python

Run the hidden code cell below to import the data used in this course.

# Importing numpy and pandas
import numpy as np
import pandas as pd

# Importing the course datasets
deals = pd.read_csv("datasets/amir_deals.csv")
happiness = pd.read_csv("datasets/world_happiness.csv")
food = pd.read_csv("datasets/food_consumption.csv")

Summary Statistics

Definition

  • The field of statistics : the practice and study of collecting and analyzing data
  • A summary statistic : a fact about or summary of some data

Type of Statistics

  1. Descriptive statistics
  • Describe and summarize data
    • 50% of friends drive to work
    • 25% take the bus
    • 25% take the bike
  1. Inferential statistics
  • Use a sample of data to make inferences about a larger population
    • what percent of people drive to work

Types of data :

  1. Numeric
  • Countinous (measured)
    • Airplane speed
    • Time Spent waiting in line
  • Discrete (counted)
    • Number of pets
    • Number of packages shipped
  1. Categorical
  • Nominal (Unordered)
    • Married/unmerried
    • Country of residence
  • Ordinal (Ordered)
    • strongly agree - strongly disagree (1-5)

Measure of center

where is the center of the data?

  1. Mean
  2. Median
  3. Mode
# Mode

food['country'].value_counts().head(5)
import statistics
statistics.mode(food["country"])
# Median and mean
# Opsi 1
var = food[food['country']=='Argentina']['consumption']
print("median : ", np.median(var))
print("mean : ",np.mean(var))
# Opsi 2
food[food["country"]=="Argentina"]["consumption"].agg([np.mean, np.median])

If the data skewed, it depend on the type of skewed of the distribution.

  1. Right-skewed
  2. Left-skewed It is recommended to used Median

Measure of Spread

  1. Variance
  2. Standard deviation
  3. Quartiles
  4. IQR (Q3-Q1)
  5. Outliers
# to calculate the variance
# use np.var()
np.var(food["consumption"], ddof=1)

# ddof = 1, it means we calculate the sample variance
# without ddof = 1, we calculate the population
# to calculate the standard deviation
# use np.std()

np.std(food["consumption"], ddof=1)
# Remember that standard deviation is a squared-root of variance
np.sqrt(np.var(food["consumption"], ddof=1))