Skip to content

Introduction to Statistics in R

Run the hidden code cell below to import the data used in this course.

# When wanting to use different quantiles
quantile(data$variable, probs = seq(from, to, by))
seq(0, 1, 0.2 ) # From 0 to 1 with steps of 0.2 

# Calculate variance and sd of co2_emission for each food_category
food_consumption %>% 
  group_by(food_category) %>% 
  summarize(var_co2 = var(co2_emission),
           sd_co2 = sd(co2_emission)) 

# Create subgraphs for each food_category: histogram of co2_emission
 ggplot(food_consumption, aes(co2_emission)) +
  # Create a histogram
  geom_histogram() +
  # Create a separate sub-graph for each food_category
  facet_wrap(~ food_category)

A probability distribution describes the probability of each possible outcome in a scenario. The expected value is the mean of the probability distribution.

# Distributions and calculating probabilities
punif # uniform
pnorm # normal
pbinom # binomial
ppois # poisson
pexp # exponential
# Min and max wait times for back-up that happens every 30 min
min <- 0
max <- 30

# Calculate probability of waiting 10-20 mins
prob_between_10_and_20 <- punif(20, min, max) - punif(10, min, max)
prob_between_10_and_20
# Set random seed to 334
set.seed(334)

# Generate 1000 wait times between 0 and 30 mins, save in time column
wait_times %>%
  mutate(time = runif(1000, min = 0, max = 30)) %>%
  # Create a histogram of simulated times
  ggplot(aes(time)) +
  geom_histogram(bins = 30)
# Probability of deal < 7500
pnorm(7500, 5000, 2000, lower.tail = TRUE)

# Calculate new average amount
new_mean <- 5000 * 1.2

# Calculate new standard deviation
new_sd <- 2000 * 1.3

# Simulate 36 sales
new_sales <- new_sales %>% 
  mutate(amount = rnorm(36, new_mean, new_sd))
 
# Create histogram with 10 bins
ggplot(new_sales, aes(amount)) +
geom_histogram(bins = 10) # 10 bars

# Take 30 samples of 20 values of num_users, take mean of each sample
sample_means <- replicate(30, sample(all_deals$num_users, 20) %>% mean())

The central limit theorem states that a sampling distribution of a sample statistic approaches the normal distribution as you take more samples, no matter the original distribution being sampled from.