Skip to content

Analysis of Variance

Review of Regression Model

In Regression Analysis, we had an output random variable which depended on a continuous input variable .

  • y ~ x, where x is a continuous variable
  • Model:   where   Note that the regression line is the mean value of .
  • Rewritten model:
x <- runif(100, 0, 10)
y <- 3 + 2*x + rnorm(100, 0, 4)

plot(x,y)
abline(lm(y~x), col='blue', lwd=3)

Analysis of Variance Model

Analysis of Variance is similar, except that now the input variable will be discrete. We will use slightly different language reflecting this change.

  • y ~ x, where x is a discrete "factor"
  • the values of x will be called "levels" or "treatments"
    • levels could be either numerical or categorical

Since we aren't making any assumptions about relationship between different levels of the factor, we will not draw a regression line... instead we must consider each factor separately.
(In theory there isn't any natural ordering or spacing between levels, since they could even be categorical data (like colors or types) -- so we can't do numerical things like "").

x <- sample(0:10, 100, replace=TRUE)
y <- 3 + 2*x + rnorm(100,0,4)

plot(x,y)

(In the case where there are only two treatments, this should be equivalent to the pooled variance two sample -test from before.)

  • Instead of inputs being pairs of numbers , we will now have a group of observed values for each level .
    • Input is groups of observations for each treatment
      ( treatments and observations for each treatment)
  • The textbook writes these as for treatment, observation and views them as a rectangular matrix of observations, with rows corresponding to treatments.
  • For clarity, I will use superscript for treatment and subscript for observations:
    for treatment, observation.
    • The superscript corresponds to in regression -- it gives the value of the discrete independent variable.
    • I will view this as a 'list of lists' of observations, not necessarily rectangular.

In this case, the regression line from before will be replaced by the mean of the various treatments.

x <- rep(1:10, 15)
y <- 3 + 2*x + rnorm(length(x),0,4)

plot(x,y)
points(1:10,
	   sapply( 1:10, function(.) mean(y[x==.]) ), 
	   col='blue', pch=19, cex=2)

Maybe it is more useful to use box plots...

boxplot(y~x)

To be fair, in Analysis of Variance, we aren't assuming that has a linear relationship with ... or even that is numeric at all. So a more realistic example is one like the following:

x  <- c('a','b','c','d','e','f','g')  # levels
mu <- runif(length(x),5,20)           # mean of each level

n  <- 50            # observations per level

y  <- rep(mu,n) + rnorm(n*length(x),0,5) # observations
x  <- rep(x, n)                          # level of observations

boxplot(y~x)
Notation.
  • mean of all treatments -- the grand mean
  • , the sample mean of all data (point estimate for )

  • mean of treatment -- the treatment mean
  • , the sample mean of treatment data (point estimate for )
Model.

The analysis of variance model has the same form as the (rewritten) regression model

  • Model:   where  
Hypothesis Test.

We want to test the hypothesis:

  • H0: (all means are same)
    HA: At least one is different.

(This is equivalent to the regression test H0: which would imply constant mean .)

We will test this by using the -test from before. We will consider the sources of variance: within levels and between levels.

The "Analysis of Variance" table now looks as follows

SourcedfSum of SquareMean Square
Level
Error
Total

where , and , and (so and ).

  • "Regression" variance becomes "Level" variance (due to the independent variable which now is factor)
  • "Residual" variance becomes "Error" variance (this is pooled sample variance of sets of observations)

The "sum of square" values are


  • This is the variance of the means between levels.

  • This is the variance of the data within levels.
    (This is hard to compute, so we'll get it by .)

  • This is total variance of observed .

As before, the -statistic is

# convert x to a "factor" array
x  <- factor(x)

# total number of observations
N  <- length(y)

# number of levels in factor  
k  <- length(levels(x))

# number of observations per level (assuming constant)
n  <- N / k

# get sample means for each level
mu <- sapply(levels(x), function(.) mean(y[x==.]) )
			 
# compute sum of squares values
#SSL <- n * sum( ( mu - mean(y) )^2 )
 SSL <- n *   (k-1) * sd( mu )^2
#SST <-     sum( (  y - mean(y) )^2 )
 SST <-       (N-1) * sd( y )^2
			 
SSE <- SST - SSL      # Easier than direct computation
			 
# compute mean squared values (variances)
MSL <- SSL / (k-1)    # variance between levels
MSE <- SSE / (N-k)    # pooled variance within levels
			
# F statistic			 
F   <- MSL / MSE

# p-value
p   <- 1 - pf(F, k-1, N-k)
			 
# print results
cat(' Source |   df  |    SS   |   MS  \n')
cat('----------------------------------\n')
cat(' Factor |   ', k-1,  ' |  ', signif(SSL,4), ' | ', signif(MSL,3), '\n')
cat(' Error  | ', k*(n-1),' |  ', signif(SSE,4), ' |  ', signif(MSE,3), '\n')
cat(' Total  | ', k*n-1,  ' | ',  signif(SST,3), ' | \n\n')

cat('F-statistic: ', signif(F,4), '\n' )
cat('p-value:      ', p)

We can do all of this in R with the aov(..) ("Analysis of Variance") function.

summary( aov(y~x) )

Confidence Intervals on Treatment Means and Least Significant Differences

As with regression, is the "pooled sample variance" estimate for the variance of . Thus it can be used to make confidence intervals for the treatment means.

  • ANOVA model:   where  

  • Point estimate: the sample mean of treatment

  • Standard error of is
  • Standard error of is , since it is the mean of samples Thus the confidence interval for is:

Combining two such confidence intervals, we can get confidence intervals for the difference of treatment means

Thus, if two treatment sample means differ by more than

then the probability that is less than . We call this the "Fisher Least Significant Difference". If sample means are farther apart than this, then we can safely conclude that they come from different distributions.

We can perform this test using the LSD.test(..) function from the "agricolae" library.