SciPy and Stats, Certification Prep.
Here you'll fnd a collection of terms and items you'll want to look over and know inside out, before you take the Cert Exam for Data Scientist. We used to say in USAF: Read and Heed! You won't be disappointed.
# Write and run code here
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Set random seed for reproducibility
np.random.seed(42)
# Generate data
n = 100
temperature = np.random.uniform(20, 35, n) # Confounding variable
ice_cream_sales = 50 + 2 * temperature + np.random.normal(0, 5, n) # Variable A
drownings = 10 + 0.5 * temperature + np.random.normal(0, 2, n) # Variable B
# Create a DataFrame
data = pd.DataFrame({
'Temperature': temperature,
'Ice_Cream_Sales': ice_cream_sales,
'Drownings': drownings
})
# Plot the relationships
plt.figure(figsize=(10, 5))
# Scatter plot of Ice Cream Sales vs. Drownings
plt.subplot(1, 2, 1)
sns.scatterplot(x='Ice_Cream_Sales', y='Drownings', data=data)
plt.title('Ice Cream Sales vs. Drownings')
plt.xlabel('Ice Cream Sales')
plt.ylabel('Drownings')
# Scatter plot of Temperature vs. both variables
plt.subplot(1, 2, 2)
sns.scatterplot(x='Temperature', y='Ice_Cream_Sales', data=data, label='Ice Cream Sales')
sns.scatterplot(x='Temperature', y='Drownings', data=data, label='Drownings')
plt.title('Temperature vs. Ice Cream Sales and Drownings')
plt.xlabel('Temperature')
plt.ylabel('Values')
plt.legend()
plt.tight_layout()
plt.show()
Independent Samples t-test
If you have two independent groups and you want to compare their means, you can use ttest_ind.
import numpy as np
from scipy.stats import ttest_ind
# Example data
group1 = np.array([10, 12, 14, 16, 18])
group2 = np.array([11, 13, 15, 17, 19])
# Perform the t-test
t_stat, p_value = ttest_ind(group1, group2)
print("t-statistic:", t_stat)
print("p-value:", p_value)
Paired Samples t-test
If you have paired data (e.g., measurements before and after a treatment on the same subjects), you can use ttest_rel.
import numpy as np
from scipy.stats import ttest_rel
# Example data
before_treatment = np.array([10, 12, 14, 16, 18])
after_treatment = np.array([11, 13, 15, 17, 19])
# Perform the paired t-test
t_stat, p_value = ttest_rel(before_treatment, after_treatment)
print("t-statistic:", t_stat)
print("p-value:", p_value)
One-Sample t-test
If you have a single sample and you want to test whether its mean differs from a known value, you can use ttest_1samp.
from scipy.stats import ttest_1samp
# Example data
data = np.array([10, 12, 14, 16, 18])
# Perform the one-sample t-test
t_stat, p_value = ttest_1samp(data, popmean=15)
print("t-statistic:", t_stat)
print("p-value:", p_value)
To compute the binomial probability mass function (PMF) in Python, you can use the binom class from the scipy.stats library. The PMF gives the probability of obtaining exactly k successes in n independent Bernoulli trials with success probability p.
Here's how you can calculate and plot the binomial PMF:
Calculating Binomial PMF:
from scipy.stats import binom
# Parameters
n = 10 # Number of trials
p = 0.5 # Probability of success
k = 5 # Number of successes
# Calculate the PMF
pmf = binom.pmf(k, n, p)
print(f"The probability of getting exactly {k} successes in {n} trials is {pmf:.4f}")
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import binom
# Parameters
n = 10 # Number of trials
p = 0.5 # Probability of success
# Values of k
k_values = np.arange(0, n+1)
# Calculate PMF for each k
pmf_values = binom.pmf(k_values, n, p)
# Plot the PMF
plt.stem(k_values, pmf_values, use_line_collection=True)
plt.xlabel('Number of Successes')
plt.ylabel('Probability')
plt.title('Binomial PMF')
plt.show()
Example with Different Parameters
Here's another example with different parameters:
# Parameters
n = 20 # Number of trials
p = 0.3 # Probability of success
k = 7 # Number of successes
# Calculate the PMF
pmf = binom.pmf(k, n, p)
print(f"The probability of getting exactly {k} successes in {n} trials is {pmf:.4f}")
# Plotting the PMF for a range of k values
k_values = np.arange(0, n+1)
pmf_values = binom.pmf(k_values, n, p)
plt.stem(k_values, pmf_values, use_line_collection=True)
plt.xlabel('Number of Successes')
plt.ylabel('Probability')
plt.title('Binomial PMF with n=20 and p=0.3')
plt.show()
To compute and visualize the Poisson probability density function (PDF) in Python, you can use the poisson class from the scipy.stats library. The PDF gives the probability of a given number of events occurring in a fixed interval of time or space.
Calculating Poisson PDF Here's how you can calculate the Poisson PDF for a specific value: