Central Limit Theorem
The Central Limit Theorem says that if
More precisely the CLT says that limit as
of the distribution is normal with that mean and variance.
- In practice, 'big
' means ' '
Let's try this out for a few different sampling distributions!
Initial setup
First I'll make some functions so that we can easily experiment with the central limit theorem with many different settings without rewriting the same code over and over.
sample_means() will be used to simulate running a series of experiments and calculating sample mean after the completion of each series of runs
- each run of the experiment will collect n_samples data points and calculate their sample mean
- this will be done n_trials times, generating a total of n_trials sample means
density_plots() will compute sets of sample means for multiple values of n_samples and then plot nonparametric density estimates of the distributions for each n_samples value.
- distributions should all have roughly the same mean (the mean of the underlying distribution)
- higher values of n_samples should result in more 'Normal-looking' distribution
- higher values of n_samples should result in lower variance
quintile_plot() will compute sets of sample means for a single n_samples value and then make a normal probability plot
- higher values of n_samples should result in a more linear plot
- plots should look 'fairly linear' for n_samples
########################################################################
#' Perform multiple trials sampling a random variable
#'
#' `sample_means()` simulates doing multiple experiment trials,
#' sampling a random variable multiple times
#' and computing the sample mean for each trial.
#'
#' @ n_samples number of samples for each trial
#' @ n_trials number of trials to perform
#' @ rdist random point distribution (default = rnorm)
#' @ ... parameters to pass to rdist
#'
#' @ return vector of sample means of trials (length = n_trials)
#'
sample_means <- function( n_samples, n_trials, rdist = rnorm, ...) {
samples <- matrix(
rdist( n_samples * n_trials, ...), # n_samples x n_rows random numbers
nrow = n_trials # each trial of n_samples is a row
)
# return the sample means for each row (set of trial runs)
return( rowMeans(samples) )
}
########################################################################
#' Plot density estimates for multiple trial sizes
#'
#' @ samples vector of sample sizes for each trial run
#' @... other inputs are same as sample_means()
#'
density_plots <- function( samples, n_trials, rdist = rnorm, ...) {
samples <- rev(sort(samples))
# plot the trial with the most samples first (since it should have tallest density)
plot(
density(sample_means( samples[1], n_trials, rdist, ...)),
main=paste("Density Estimate.",n_trials, # plot title
"trials with",as.character(substitute(rdist))),
xlab=paste("#Samples =",paste(rev(samples),collapse=",")) # label x-axis
)
# now add the other trials onto the same plot
for (n in 2:length(samples) ) {
lines(
density(sample_means( samples[n], n_trials, rdist, ...)),
col=n # use different color for each sample size set
)
}
}
########################################################################
#' Check quintile plots for multiple trial sizes
#'
#' @ samples vector of sample sizes for each trial run
#' @... other inputs are same as sample_means()
#'
quintile_plots <- function( samples, n_trials, rdist = rnorm, ...) {
for (n_samples in samples) {
x_bar <- sample_means( n_samples, n_trials, rdist, ...) # run experiment
qqnorm( # make probability plot
x_bar,
main=paste(n_trials,"trials of size ",n_samples), # plot title
xlab=paste("µ =",signif(mean(x_bar),2), # label x-axis
", σ =",signif(sd(x_bar),2)),
ylab="" # label y-axis
)
abline(
mean(x_bar), # line with y-intercept = mean
sd(x_bar) # and slope = std. dev
) # if normal, the quantile plots should follow this line (I think)
}
}
CLT for the Normal Distribution
Let's try this for
(random normal, with
- sample means gathered from
samples - sample means gathered from
samples - sample means gathered from
samples - sample means gathered from
samples - sample means gathered from
samples - sample means gathered from
samples
First I'll simulate a run in each setup and plot resulting distribution functions against each other.
Then I'll simulate a series of runs and plot their (quartile) probability plots.
# Try out the normal distribution
density_plots(
samples=c(5,10,15,20,25,30),
n_trials=100,
rnorm, 5, 10
)
par(mfrow=c(2,3)) # split screen into 6 plots: 2 rows, 3 cols
quintile_plots(
samples=c(5,10,15,20,25,30),
n_trials=100,
rnorm, 5, 10
)
The quantile plots above should all look more or less like straight lines. The y-intercept of each line is the mean of the approximating normal distribution and the slope is approximately the standard deviation. The slopes should be
It shouldn't be a surprise that
CLT for LogNormal Distribution
This is more interesting when we sample from non-normal
- n_samples =
(in this case sample means are just lognormal data) - n_samples =
- n_samples =
- n_samples =
- n_samples =
- n_samples =
# Try out the lognormal distribution
density_plots(
samples=c(1,10,20,30,40,50),
n_trials=100,
rlnorm
)
par(mfrow=c(2,3))
quintile_plots(
samples=c(1,10,20,30,40,50),
n_trials=100,
rlnorm
)
The lognormal distribution above should have mean
CLT for Binomial Distribution
The central limit theorem applies to all distributions. Even discrete ones! Let's try this out with the discrete
We'll use a similar setup as before.
# Try out the binomial distribution
density_plots( c(1,10,20,30,40,50), 100,
rbinom, 30, 1/10
)
par(mfrow=c(2,3))
quintile_plots( c(1,10,20,30,40,50), 100,
rbinom, 30, 1/10
)
Point Estimators
Point Estimators for Mean of Normal
There are many possible point estimators for the mean of a normal distribution. The following estimators are all unbiased:
sample mean sample median (assuming ) average of sample max and min (assuming ) sample mean after "trimming" the largest and smallest
Question: How can we decide which of these is best to use?
Use the point estimator with the smallest sample error!
For the sample mean it is pretty easy to compute that the standard error is
For the sample median, it is possible to show that standard error is
In general, computing standard error can be difficult, though... in this case we can fall back on "bootstrapping" methods. Compute many values of the point estimate and then find their sample standard deviation. Using this method below, we see that sample mean is very good. Sample median is not bad (and may work better if there are extreme outliers). Averaging the max and min is terrible. The trimmed sample mean is about as good as the sample mean, and is better when data has outliers. [Computing a mean is faster than computing a median.]
mu<-2; sigma<-5; n<-100; num_groups<-5 # n samples of normal(mu,sigma)
library(matrixStats) # needed for rowSds (sample deviations of rows)
data <- matrix(rnorm(100*num_groups*n,mu,sigma), # make some random data
nrow=100*num_groups)
###################################################
#' Helper function, used to make a table of bootstrap values
#'
#' @ A matrix of data to bootstrap, comparing elments across rows
#'
#' output is data.frame
#' one column for each row of input data
#' two rows: "mean" and "sd"
bootstrap_table <- function(A) {
output <- t(data.frame(rowMeans(A),rowSds(A)))
rownames(output) <- c("mean","error")
return(output)
}
###################################################
###################################################
## sample mean
# expected value = mu (unbiased)
# sample error = sigma/sqrt(n)
cat("-----------------------\nSAMPLE MEAN")
sample_means <- matrix(rowMeans(data),
nrow=num_groups)
bootstrap_table(sample_means)
###################################################
## sample median
# expected value = mu (unbiased)
# sample error = 1.2533 sigma/sqrt(n)
cat("-----------------------\nSAMPLE MEDIAN")
sample_medians <- matrix(rowMedians(data),
nrow=num_groups)
bootstrap_table(sample_medians)
###################################################
## sample midpoint of max and min
# expected value = mu (unbiased)
# sample error = ... something messy
cat("-----------------------\nSAMPLE RANGE MIDPOINTS")
sample_midpoints <- matrix((rowMaxs(data) + rowMins(data))/2,
nrow=num_groups)
bootstrap_table(sample_midpoints)
###################################################
## trimmed sample mean
# expected value = mu (unbiased)
# sample error = ... same as sample mean???
cat("-----------------------\nTRIMMED MEANS")
trim <- function(v) { # function to remove smallest and largest 5 elements
return( v[-order(v)[c( # sort and remove elements
1:5 , # smallest 5
(length(v)-4):length(v) # largest 5
)]] )
}
trimmed_means <- matrix(rowMeans(t(apply(data,1,trim))), # trim then compute mean
nrow=num_groups)
bootstrap_table(trimmed_means)
rm(data) # remove this variable to save memory.....
Mean vs Median Distributions
Let's look at two of these distributions a bit more closely. I'll increase the number of point estimates to
On the graph below, I'll plot density estimates for the sample mean and sample median for normal random data, using
These should be graphs for the following distributions:
mu<-2; sigma<-5; n<-100; num_groups<-10000 # note: this will take ~6 seconds to run
data <- matrix(rnorm(100*num_groups*n,mu,sigma), # make some random data
nrow=100*num_groups)
plot(density(rowMeans(data)), # density of sample mean estimator
main="Estimators for mean",
xlab=paste(num_groups,'estimates,',n,'samples each'),
lty=1,col='blue',lwd=2 ) # solid, blue, thick line
lines(density(rowMedians(data)), # density of sample median estimator
lty=2,col='red',lwd=2) # dashed, red, thick line
legend(-.5,.8, # add a legend to upper-left
legend=c("Sample Mean","Sample Median"),
lty=c(1,2),col=c('blue','red'),lwd=2)
rm(data) # remove this variable to save memory...
Point Estimators for Variance of Normal
Recall that
To remove this bias, we instead use sample variance
We can verify this bias issue and its correction using bootstrap methods. The code below computes the mean and error of
- "observed variance"
- "sample variance"
The "observed variance" is expected to yield results a bit below the correct variance, while the "sample variance" should give values on either side of the correct answer.
In order to see this difference, we will only use
samples per run -- once reaches around , the difference between these two measures becomes negligible. (When is big -- "asymptotically" -- these two measures are identical.)
Note that sample variance is an unbiased estimator for variance, but since square roots don't pass through expectation, sample standard deviation is not an unbiased estimator for standard deviation. But it's easy to compute and actually often what we really want is the variance anyway...
mu<-2; sigma<-5; n<-20; num_groups<-5 # n samples of normal(mu,sigma)
data <- matrix(rnorm(100*num_groups*n,mu,sigma), # make some random data
nrow=100*num_groups)
#########################################
## population sd
# expected value of square = (n/n-1) sigma^2 (biased!)
cat("-----------------------\nObserved Standard Dev")
pop_sd <- matrix(rowSds(data)*sqrt((n-1)/n), # convert sample to pop sd
nrow=num_groups)
bootstrap_table(pop_sd)
#########################################
## sample sd
# expected value of square = sigma^2 (square is unbiased!)
cat("-----------------------\nSample Standard Dev")
sample_sd <- matrix(rowSds(data), # compute sample sd
nrow=num_groups)
bootstrap_table(sample_sd)
rm(data) # remove this variable to save memory....