Skip to content

Central Limit Theorem

The Central Limit Theorem says that if is an IID random sample from a population with mean and variance , then for big , the sample mean will be approximately distributed as normal with mean and variance .

More precisely the CLT says that limit as of the distribution is normal with that mean and variance.

  • In practice, 'big ' means ''

Let's try this out for a few different sampling distributions!

Initial setup

First I'll make some functions so that we can easily experiment with the central limit theorem with many different settings without rewriting the same code over and over.

sample_means() will be used to simulate running a series of experiments and calculating sample mean after the completion of each series of runs

  • each run of the experiment will collect n_samples data points and calculate their sample mean
  • this will be done n_trials times, generating a total of n_trials sample means

density_plots() will compute sets of sample means for multiple values of n_samples and then plot nonparametric density estimates of the distributions for each n_samples value.

  • distributions should all have roughly the same mean (the mean of the underlying distribution)
  • higher values of n_samples should result in more 'Normal-looking' distribution
  • higher values of n_samples should result in lower variance

quintile_plot() will compute sets of sample means for a single n_samples value and then make a normal probability plot

  • higher values of n_samples should result in a more linear plot
  • plots should look 'fairly linear' for n_samples
########################################################################
#' Perform multiple trials sampling a random variable 
#'
#' `sample_means()` simulates doing multiple experiment trials, 
#'   sampling a random variable multiple times 
#'   and computing the sample mean for each trial. 
#'
#' @ n_samples number of samples for each trial
#' @ n_trials  number of trials to perform
#' @ rdist     random point distribution (default = rnorm)
#' @ ...       parameters to pass to rdist
#'
#' @ return    vector of sample means of trials (length = n_trials)
#'
sample_means <- function( n_samples, n_trials, rdist = rnorm, ...) {
	samples <- matrix(
					  rdist( n_samples * n_trials, ...),  # n_samples x n_rows random numbers 
					  nrow = n_trials                     # each trial of n_samples is a row
					 )
	
	# return the sample means for each row (set of trial runs)
	return( rowMeans(samples) )
}

########################################################################
#' Plot density estimates for multiple trial sizes
#'
#' @ samples   vector of sample sizes for each trial run
#' @...        other inputs are same as sample_means()
#' 
density_plots <- function( samples, n_trials, rdist = rnorm, ...) {
	samples <- rev(sort(samples))
	
	# plot the trial with the most samples first (since it should have tallest density)
	plot(
		 density(sample_means( samples[1], n_trials, rdist, ...)), 
		
		 main=paste("Density Estimate.",n_trials,                   # plot title
					"trials with",as.character(substitute(rdist))), 
		
		 xlab=paste("#Samples =",paste(rev(samples),collapse=","))  # label x-axis
		) 
	
	# now add the other trials onto the same plot
	for (n in 2:length(samples) ) {
		lines(
			  density(sample_means( samples[n], n_trials, rdist, ...)), 
			
			  col=n               # use different color for each sample size set
		     ) 
	}
}

########################################################################
#' Check quintile plots for multiple trial sizes
#'
#' @ samples   vector of sample sizes for each trial run
#' @...        other inputs are same as sample_means()
#'
quintile_plots <- function( samples, n_trials, rdist = rnorm, ...) {
	
	for (n_samples in samples) {
		x_bar <- sample_means( n_samples, n_trials, rdist, ...)  # run experiment 
		
		qqnorm(                                                  # make probability plot
			   x_bar, 
			   main=paste(n_trials,"trials of size ",n_samples), # plot title
			   xlab=paste("µ =",signif(mean(x_bar),2),           # label x-axis
						  ", σ =",signif(sd(x_bar),2)),
			   ylab=""                                           # label y-axis
			  )
		
		abline(
			   mean(x_bar),      # line with y-intercept = mean
			   sd(x_bar)         #            and slope = std. dev
		      )  # if normal, the quantile plots should follow this line (I think)
	}
}

CLT for the Normal Distribution

Let's try this for normally distributed. We'll generate sample means in six different setups
(random normal, with and .)

  • sample means gathered from samples
  • sample means gathered from samples
  • sample means gathered from samples
  • sample means gathered from samples
  • sample means gathered from samples
  • sample means gathered from samples

First I'll simulate a run in each setup and plot resulting distribution functions against each other.

Then I'll simulate a series of runs and plot their (quartile) probability plots.

# Try out the normal distribution

density_plots(
			  samples=c(5,10,15,20,25,30), 
			  n_trials=100, 
			  rnorm, 5, 10
			 )

par(mfrow=c(2,3))  # split screen into 6 plots: 2 rows, 3 cols
quintile_plots(
			   samples=c(5,10,15,20,25,30), 
			   n_trials=100, 
			   rnorm, 5, 10
			  )

The quantile plots above should all look more or less like straight lines. The y-intercept of each line is the mean of the approximating normal distribution and the slope is approximately the standard deviation. The slopes should be .

It shouldn't be a surprise that are normally distributed when the themselves already were...

CLT for LogNormal Distribution

This is more interesting when we sample from non-normal ... We'll do a similar setup, this time with trials each in six setups of random points.

  • n_samples =     (in this case sample means are just lognormal data)
  • n_samples =
  • n_samples =
  • n_samples =
  • n_samples =
  • n_samples =
# Try out the lognormal distribution

density_plots(
			  samples=c(1,10,20,30,40,50), 
			  n_trials=100, 
			  rlnorm
			 )

par(mfrow=c(2,3))
quintile_plots(
			   samples=c(1,10,20,30,40,50), 
			   n_trials=100, 
			   rlnorm
			  )

The lognormal distribution above should have mean and variance . The distribution with only one sample in each trial should just be lognormal. As the number of samples in each trial increases, the distribution should become more normal, which means the quantile plots should become more linear.

CLT for Binomial Distribution

The central limit theorem applies to all distributions. Even discrete ones! Let's try this out with the discrete distribution. To make this more extreme, we'll use binomial with trials and : a heavily right-skewed distribution that looks very far from normal.

We'll use a similar setup as before.

# Try out the binomial distribution

density_plots( c(1,10,20,30,40,50), 100, 
			  rbinom, 30, 1/10 
			 )

par(mfrow=c(2,3))
quintile_plots( c(1,10,20,30,40,50), 100, 
			   rbinom, 30, 1/10 
			  )

Point Estimators

Point Estimators for Mean of Normal

There are many possible point estimators for the mean of a normal distribution. The following estimators are all unbiased:

  •   sample mean
  •   sample median   (assuming )
  • average of sample max and min   (assuming )
  • sample mean after "trimming" the largest and smallest

Question: How can we decide which of these is best to use?

Use the point estimator with the smallest sample error!

For the sample mean it is pretty easy to compute that the standard error is
For the sample median, it is possible to show that standard error is

In general, computing standard error can be difficult, though... in this case we can fall back on "bootstrapping" methods. Compute many values of the point estimate and then find their sample standard deviation. Using this method below, we see that sample mean is very good. Sample median is not bad (and may work better if there are extreme outliers). Averaging the max and min is terrible. The trimmed sample mean is about as good as the sample mean, and is better when data has outliers. [Computing a mean is faster than computing a median.]

mu<-2; sigma<-5; n<-100; num_groups<-5  # n samples of normal(mu,sigma)

library(matrixStats)       # needed for rowSds (sample deviations of rows)

data <- matrix(rnorm(100*num_groups*n,mu,sigma), # make some random data
			   nrow=100*num_groups)

###################################################
#' Helper function, used to make a table of bootstrap values
#'
#' @ A  matrix of data to bootstrap, comparing elments across rows
#'
#'  output is data.frame
#'    one column for each row of input data 
#'    two rows: "mean" and "sd"
bootstrap_table <- function(A) {
	output <- t(data.frame(rowMeans(A),rowSds(A)))
	rownames(output) <- c("mean","error")
	return(output)
}
###################################################

###################################################
## sample mean
#  expected value = mu   (unbiased)
#  sample error   = sigma/sqrt(n)
cat("-----------------------\nSAMPLE MEAN")
sample_means <- matrix(rowMeans(data),
					   nrow=num_groups)
bootstrap_table(sample_means)

###################################################
## sample median
#  expected value = mu   (unbiased)
#  sample error   = 1.2533 sigma/sqrt(n)
cat("-----------------------\nSAMPLE MEDIAN")
sample_medians <- matrix(rowMedians(data),
					     nrow=num_groups)
bootstrap_table(sample_medians)


###################################################
## sample midpoint of max and min
#  expected value = mu    (unbiased)
#  sample error   = ... something messy
cat("-----------------------\nSAMPLE RANGE MIDPOINTS")
sample_midpoints <- matrix((rowMaxs(data) + rowMins(data))/2,
						   nrow=num_groups)
bootstrap_table(sample_midpoints)

###################################################
## trimmed sample mean
#  expected value = mu    (unbiased)
#  sample error   = ... same as sample mean???
cat("-----------------------\nTRIMMED MEANS")
trim <- function(v) {  # function to remove smallest and largest 5 elements
	return( v[-order(v)[c(         # sort and remove elements
						  1:5 ,                   # smallest 5
						  (length(v)-4):length(v) # largest 5
	   					 )]] )
}
trimmed_means <- matrix(rowMeans(t(apply(data,1,trim))), # trim then compute mean
					   nrow=num_groups)
bootstrap_table(trimmed_means)

rm(data)      # remove this variable to save memory.....

Mean vs Median Distributions

Let's look at two of these distributions a bit more closely. I'll increase the number of point estimates to so that we can use density to make smooth non-parametric density estimates for point estimators.

On the graph below, I'll plot density estimates for the sample mean and sample median for normal random data, using samples for each calculation. (This takes a few seconds to run, mostly due to the data creation step at the start, which is generating random normal values.)

These should be graphs for the following distributions:

mu<-2; sigma<-5; n<-100; num_groups<-10000 # note: this will take ~6 seconds to run

data <- matrix(rnorm(100*num_groups*n,mu,sigma), # make some random data
			   nrow=100*num_groups)

plot(density(rowMeans(data)),        # density of sample mean estimator
	 main="Estimators for mean",
	 xlab=paste(num_groups,'estimates,',n,'samples each'),
	 lty=1,col='blue',lwd=2 )        # solid, blue, thick line

lines(density(rowMedians(data)),     # density of sample median estimator            
	  lty=2,col='red',lwd=2)         # dashed, red, thick line

legend(-.5,.8,                       # add a legend to upper-left
	   legend=c("Sample Mean","Sample Median"),
	   lty=c(1,2),col=c('blue','red'),lwd=2)

rm(data)    # remove this variable to save memory... 

Point Estimators for Variance of Normal

Recall that
 

To remove this bias, we instead use sample variance
 

We can verify this bias issue and its correction using bootstrap methods. The code below computes the mean and error of

  • "observed variance"
  • "sample variance"   

The "observed variance" is expected to yield results a bit below the correct variance, while the "sample variance" should give values on either side of the correct answer.

In order to see this difference, we will only use samples per run -- once reaches around , the difference between these two measures becomes negligible. (When is big -- "asymptotically" -- these two measures are identical.)

Note that sample variance is an unbiased estimator for variance, but since square roots don't pass through expectation, sample standard deviation is not an unbiased estimator for standard deviation. But it's easy to compute and actually often what we really want is the variance anyway...

mu<-2; sigma<-5; n<-20; num_groups<-5  # n samples of normal(mu,sigma)

data <- matrix(rnorm(100*num_groups*n,mu,sigma), # make some random data
			   nrow=100*num_groups)

#########################################
## population sd
#  expected value of square = (n/n-1) sigma^2  (biased!)
cat("-----------------------\nObserved Standard Dev")
pop_sd <- matrix(rowSds(data)*sqrt((n-1)/n),  # convert sample to pop sd
				 nrow=num_groups)

bootstrap_table(pop_sd)

#########################################
## sample sd
#  expected value of square = sigma^2   (square is unbiased!)
cat("-----------------------\nSample Standard Dev")
sample_sd <- matrix(rowSds(data),             # compute sample sd
				    nrow=num_groups)

bootstrap_table(sample_sd)

rm(data)   # remove this variable to save memory....