Skip to content

## Introduction to Statistics in R

Run the hidden code cell below to import the data used in this course.

```
# When wanting to use different quantiles
quantile(data$variable, probs = seq(from, to, by))
seq(0, 1, 0.2 ) # From 0 to 1 with steps of 0.2
# Calculate variance and sd of co2_emission for each food_category
food_consumption %>%
group_by(food_category) %>%
summarize(var_co2 = var(co2_emission),
sd_co2 = sd(co2_emission))
# Create subgraphs for each food_category: histogram of co2_emission
ggplot(food_consumption, aes(co2_emission)) +
# Create a histogram
geom_histogram() +
# Create a separate sub-graph for each food_category
facet_wrap(~ food_category)
```

A probability distribution describes the probability of each possible outcome in a scenario. The expected value is the mean of the probability distribution.

```
# Distributions and calculating probabilities
punif # uniform
pnorm # normal
pbinom # binomial
ppois # poisson
pexp # exponential
```

```
# Min and max wait times for back-up that happens every 30 min
min <- 0
max <- 30
# Calculate probability of waiting 10-20 mins
prob_between_10_and_20 <- punif(20, min, max) - punif(10, min, max)
prob_between_10_and_20
```

```
# Set random seed to 334
set.seed(334)
# Generate 1000 wait times between 0 and 30 mins, save in time column
wait_times %>%
mutate(time = runif(1000, min = 0, max = 30)) %>%
# Create a histogram of simulated times
ggplot(aes(time)) +
geom_histogram(bins = 30)
```

```
# Probability of deal < 7500
pnorm(7500, 5000, 2000, lower.tail = TRUE)
# Calculate new average amount
new_mean <- 5000 * 1.2
# Calculate new standard deviation
new_sd <- 2000 * 1.3
# Simulate 36 sales
new_sales <- new_sales %>%
mutate(amount = rnorm(36, new_mean, new_sd))
# Create histogram with 10 bins
ggplot(new_sales, aes(amount)) +
geom_histogram(bins = 10) # 10 bars
# Take 30 samples of 20 values of num_users, take mean of each sample
sample_means <- replicate(30, sample(all_deals$num_users, 20) %>% mean())
```

The central limit theorem states that a sampling distribution of a sample statistic approaches the normal distribution as you take more samples, no matter the original distribution being sampled from.