Sampling in Python
Run the hidden code cell below to import the data used in this course.
Chapter 1
# Importing pandas
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random
# Importing the course arrays
attrition = pd.read_feather("datasets/attrition.feather")
spotify = pd.read_feather("datasets/spotify_2000_2020.feather")
coffee = pd.read_feather("datasets/coffee_ratings_full.feather")Sampling
It's a way to work with small portions of a whole and use statistics to generalize
- Population: is the entire datset. We may not now the quantity
- Sample: Its a cut of the population
Sample:
.sample allows us to generate n random rows from the datset
### Example ###
coffee_samp = coffee.sample(n = 7)
#print(coffee_samp)### Example ###
# It works in specific columns as well but doesn't have headers.
coffee_samp_col = coffee['total_cup_points'].sample(n = 7)
#print(coffee_samp_col)Population Paramater
calculation made on the entire dateset
### Example ###
#print(coffee['total_cup_points'].describe())
mean_coffee_pop = np.mean(coffee['total_cup_points'])
print(mean_coffee_pop)Point Paramater
calculation made on the sample
### Example ###
mean_coffee_point = np.mean(coffee_samp_col)
print(mean_coffee_point)Sample Bias
When the sample selected is not representative of the population
Convenience sampling = When collecting samples by the esasiest method. Could cause sample bias
A histogram allows you to view the impact of the convenience sample:
We need to define the ranges of the histogram in the bins
### Example ###
#histogram to check the sampling
coffee['total_cup_points'].hist(bins = np.arange(80, 100, 5))Pseudo Random Number Generation
It's a calculated process that is cheap.
It appears to be random but each value is calculted based on the previous number
The first random number calculated is from a "seed" value.
To calculate that we use the formula np.random
The first arguments specify distribution parameters. The size is the amounts of numbers:
- For Beta would be A, B:
- For Normal is loc, which is the mean and the scale is the standard deviation
### Example ###
random_num = np.random.beta(a = 3, b = 2, size = 150)
plt.hist(random_num, bins = np.arange(0,1,0.2))Generating Seed
np.random.seed(number)
We can generate different random numbers in the seed, and if we use the same seed and do the same steps we would get the sane numbers. If we use another seed, then it will change