Course Notes: SciPy and Stats

SciPy and Stats, Certification Prep.

Here you'll fnd a collection of terms and items you'll want to look over and know inside out, before you take the Cert Exam for Data Scientist. We used to say in USAF: Read and Heed! You won't be disappointed.

# Write and run code here
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Set random seed for reproducibility
np.random.seed(42)

# Generate data
n = 100
temperature = np.random.uniform(20, 35, n)  # Confounding variable
ice_cream_sales = 50 + 2 * temperature + np.random.normal(0, 5, n)  # Variable A
drownings = 10 + 0.5 * temperature + np.random.normal(0, 2, n)  # Variable B

# Create a DataFrame
data = pd.DataFrame({
    'Temperature': temperature,
    'Ice_Cream_Sales': ice_cream_sales,
    'Drownings': drownings
})

# Plot the relationships
plt.figure(figsize=(10, 5))

# Scatter plot of Ice Cream Sales vs. Drownings
plt.subplot(1, 2, 1)
sns.scatterplot(x='Ice_Cream_Sales', y='Drownings', data=data)
plt.title('Ice Cream Sales vs. Drownings')
plt.xlabel('Ice Cream Sales')
plt.ylabel('Drownings')

# Scatter plot of Temperature vs. both variables
plt.subplot(1, 2, 2)
sns.scatterplot(x='Temperature', y='Ice_Cream_Sales', data=data, label='Ice Cream Sales')
sns.scatterplot(x='Temperature', y='Drownings', data=data, label='Drownings')
plt.title('Temperature vs. Ice Cream Sales and Drownings')
plt.xlabel('Temperature')
plt.ylabel('Values')
plt.legend()

plt.tight_layout()
plt.show()

Independent Samples t-test

If you have two independent groups and you want to compare their means, you can use ttest_ind.

import numpy as np
from scipy.stats import ttest_ind

# Example data
group1 = np.array([10, 12, 14, 16, 18])
group2 = np.array([11, 13, 15, 17, 19])

# Perform the t-test
t_stat, p_value = ttest_ind(group1, group2)

print("t-statistic:", t_stat)
print("p-value:", p_value)

Paired Samples t-test

If you have paired data (e.g., measurements before and after a treatment on the same subjects), you can use ttest_rel.

import numpy as np
from scipy.stats import ttest_rel

# Example data
before_treatment = np.array([10, 12, 14, 16, 18])
after_treatment = np.array([11, 13, 15, 17, 19])

# Perform the paired t-test
t_stat, p_value = ttest_rel(before_treatment, after_treatment)

print("t-statistic:", t_stat)
print("p-value:", p_value)

One-Sample t-test

If you have a single sample and you want to test whether its mean differs from a known value, you can use ttest_1samp.

from scipy.stats import ttest_1samp

# Example data
data = np.array([10, 12, 14, 16, 18])

# Perform the one-sample t-test
t_stat, p_value = ttest_1samp(data, popmean=15)

print("t-statistic:", t_stat)
print("p-value:", p_value)

To compute the binomial probability mass function (PMF) in Python, you can use the binom class from the scipy.stats library. The PMF gives the probability of obtaining exactly k successes in n independent Bernoulli trials with success probability p.

Here's how you can calculate and plot the binomial PMF:

Calculating Binomial PMF:

from scipy.stats import binom

# Parameters
n = 10    # Number of trials
p = 0.5   # Probability of success
k = 5     # Number of successes

# Calculate the PMF
pmf = binom.pmf(k, n, p)

print(f"The probability of getting exactly {k} successes in {n} trials is {pmf:.4f}")

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import binom

# Parameters
n = 10    # Number of trials
p = 0.5   # Probability of success

# Values of k
k_values = np.arange(0, n+1)

# Calculate PMF for each k
pmf_values = binom.pmf(k_values, n, p)

# Plot the PMF
plt.stem(k_values, pmf_values, use_line_collection=True)
plt.xlabel('Number of Successes')
plt.ylabel('Probability')
plt.title('Binomial PMF')
plt.show()

Example with Different Parameters

Here's another example with different parameters:

# Parameters
n = 20    # Number of trials
p = 0.3   # Probability of success
k = 7     # Number of successes

# Calculate the PMF
pmf = binom.pmf(k, n, p)

print(f"The probability of getting exactly {k} successes in {n} trials is {pmf:.4f}")

# Plotting the PMF for a range of k values
k_values = np.arange(0, n+1)
pmf_values = binom.pmf(k_values, n, p)

plt.stem(k_values, pmf_values, use_line_collection=True)
plt.xlabel('Number of Successes')
plt.ylabel('Probability')
plt.title('Binomial PMF with n=20 and p=0.3')
plt.show()

To compute and visualize the Poisson probability density function (PDF) in Python, you can use the poisson class from the scipy.stats library. The PDF gives the probability of a given number of events occurring in a fixed interval of time or space.

Calculating Poisson PDF Here's how you can calculate the Poisson PDF for a specific value: