Skip to content

Introduction to Statistics in Python

Run the hidden code cell below to import the data used in this course.

# Importing numpy and pandas
import numpy as np
import pandas as pd

# Importing the course datasets
deals = pd.read_csv("datasets/amir_deals.csv")
happiness = pd.read_csv("datasets/world_happiness.csv")
food = pd.read_csv("datasets/food_consumption.csv")

Measure Of Data Spread

It describes how spread apart or close together the data points are i.e how far from the center of the data tend to range.Just like measures of center, there are a few different measures of spread.

Quantiles for data spread

Quantiles are in general just a line divide data into equally sized groups. Percentiles are quantiles that divide the data into 100 groups

  • Quantiles: Range from any value to any other value.

  • Percentiles: Range from 0 to 100.

  • Quartiles: Range from 0 to 4.

Note that percentiles and quartiles are simply types of quantiles.

Some types of quantiles even have specific names, including:

  • 4-quantiles are called quartiles.
  • 5-quantiles are called quintiles.
  • 8-quantiles are called octiles.
  • 10-quantiles are called deciles.
  • 100-quantiles are called percentiles.

Note that percentiles and quartiles share the following relationship:

  • 0 percentile = 0 quartile (also called the minimum)
  • 25th percentile = 1st quartile
  • 50th percentile = 2nd quartile (also called the median)
  • 75th percentile = 3rd quartile
  • 100th percentile = 4th quartile (also called the maximum)

To get array of quantiles values to pass to the np.quantile function we the following

#np.linspace(start, stop, num)
quartiles = np.linspace(0, 1, 4)
print('Quartiles - 4:',quartiles)

quintiles = np.linspace(0, 1, 5)
print('quintiles - 5:',quintiles)

octiles = np.linspace(0, 1, 8)
print('octiles - 8:',octiles)

deciles = np.linspace(0, 1, 10)
print('deciles - 5:',deciles)

ref: statology web statquest video

# Add your code snippets here
food.info()

Median:

The median is a kind of center i.e is the midpoint of the data . To calculate the median the general rule is to order the data from smallest to largest. If the number fo data points is odd then the median is the middle data point. If the number fo data points is even then the median is the average of two data points nearest the middle.

food['co2_emission'].median()

The median co2 emission is 16.53 then 50% of data data higher than 16.53 and 50% of data lower than 16.53

Quantile

The median can also be called as quantile because it's split the data into groups that contain same number of data points. It is labelled as 0.5 quantile or 50% percentile ( as it split 50% to higher and lower)

Here 0.5 quantile value is 16.53 which is a median

quintiles_co2_emission = np.quantile(food['co2_emission'], np.linspace(0, 1, 5))
print(quintiles_co2_emission)

Interquartile range:

The interquartile range (IQR) is the range of the middle 50% of data values. To find the IQR for a given dataset, we can calculate 3rd quartile – 1st quartile.

IQR = Q3 - Q1