Sampling in Python

Run the hidden code cell below to import the data used in this course.

Chapter 1

# Importing pandas
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random

# Importing the course arrays
attrition = pd.read_feather("datasets/attrition.feather")
spotify = pd.read_feather("datasets/spotify_2000_2020.feather")
coffee = pd.read_feather("datasets/coffee_ratings_full.feather")

Sampling

It's a way to work with small portions of a whole and use statistics to generalize

Population: is the entire datset. We may not now the quantity
Sample: Its a cut of the population

Sample:

.sample allows us to generate n random rows from the datset

### Example ###
coffee_samp = coffee.sample(n = 7)
#print(coffee_samp)

### Example ###
# It works in specific columns as well but doesn't have headers.
coffee_samp_col = coffee['total_cup_points'].sample(n = 7)
#print(coffee_samp_col)

Population Paramater

calculation made on the entire dateset

### Example ###
#print(coffee['total_cup_points'].describe())
mean_coffee_pop = np.mean(coffee['total_cup_points'])
print(mean_coffee_pop)

Point Paramater

calculation made on the sample

### Example ###
mean_coffee_point = np.mean(coffee_samp_col)
print(mean_coffee_point)

Sample Bias

When the sample selected is not representative of the population

Convenience sampling = When collecting samples by the esasiest method. Could cause sample bias

A histogram allows you to view the impact of the convenience sample:

We need to define the ranges of the histogram in the bins

### Example ###
#histogram to check the sampling
coffee['total_cup_points'].hist(bins = np.arange(80, 100, 5))

Pseudo Random Number Generation

It's a calculated process that is cheap.

It appears to be random but each value is calculted based on the previous number

The first random number calculated is from a "seed" value.

To calculate that we use the formula np.random

The first arguments specify distribution parameters. The size is the amounts of numbers:

For Beta would be A, B:
For Normal is loc, which is the mean and the scale is the standard deviation

### Example ###
random_num = np.random.beta(a = 3, b = 2, size = 150)
plt.hist(random_num, bins = np.arange(0,1,0.2))

Generating Seed

np.random.seed(number)

We can generate different random numbers in the seed, and if we use the same seed and do the same steps we would get the sane numbers. If we use another seed, then it will change

‌
‌
‌

Sampling in Python

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Sampling in Python

Chapter 1

Sampling

Sample:

Population Paramater

Point Paramater

Sample Bias

Pseudo Random Number Generation

Generating Seed

Sampling in Python