Analysis of Variance
Review of Regression Model
In Regression Analysis, we had an output random variable
- y ~ x, where x is a continuous variable
- Model:
where Note that the regression line is the mean value of .
- Rewritten model:
x <- runif(100, 0, 10)
y <- 3 + 2*x + rnorm(100, 0, 4)
plot(x,y)
abline(lm(y~x), col='blue', lwd=3)
Analysis of Variance Model
Analysis of Variance is similar, except that now the input variable
- y ~ x, where x is a discrete "factor"
- the values of x will be called "levels" or "treatments"
- levels could be either numerical or categorical
Since we aren't making any assumptions about relationship between different levels of the factor, we will not draw a regression line... instead we must consider each factor separately.
(In theory there isn't any natural ordering or spacing between levels, since they could even be categorical data (like colors or types) -- so we can't do numerical things like "
x <- sample(0:10, 100, replace=TRUE)
y <- 3 + 2*x + rnorm(100,0,4)
plot(x,y)
(In the case where there are only two treatments, this should be equivalent to the pooled variance two sample
- Instead of inputs being pairs of numbers
, we will now have a group of observed values for each level .- Input is groups of observations for each treatment
( treatments and observations for each treatment)
- Input is groups of observations for each treatment
- The textbook writes these as
for treatment, observation and views them as a rectangular matrix of observations, with rows corresponding to treatments. - For clarity, I will use superscript for treatment and subscript for observations:
for treatment, observation.- The superscript
corresponds to in regression -- it gives the value of the discrete independent variable. - I will view this as a 'list of lists' of observations, not necessarily rectangular.
- The superscript
In this case, the regression line from before will be replaced by the mean of the various treatments.
x <- rep(1:10, 15)
y <- 3 + 2*x + rnorm(length(x),0,4)
plot(x,y)
points(1:10,
sapply( 1:10, function(.) mean(y[x==.]) ),
col='blue', pch=19, cex=2)
Maybe it is more useful to use box plots...
boxplot(y~x)
To be fair, in Analysis of Variance, we aren't assuming that
x <- c('a','b','c','d','e','f','g') # levels
mu <- runif(length(x),5,20) # mean of each level
n <- 50 # observations per level
y <- rep(mu,n) + rnorm(n*length(x),0,5) # observations
x <- rep(x, n) # level of observations
boxplot(y~x)
Notation.
mean of all treatments -- the grand mean , the sample mean of all data (point estimate for ) mean of treatment -- the treatment mean , the sample mean of treatment data (point estimate for )
Model.
The analysis of variance model has the same form as the (rewritten) regression model
- Model:
where
Hypothesis Test.
We want to test the hypothesis:
- H0:
(all means are same)
HA: At least one is different.
(This is equivalent to the regression test H0:
We will test this by using the
The "Analysis of Variance" table now looks as follows
Source | df | Sum of Square | Mean Square |
---|---|---|---|
Level | |||
Error | |||
Total |
where
- "Regression" variance becomes "Level" variance (due to the independent variable which now is factor)
- "Residual" variance becomes "Error" variance (this is pooled sample variance of
sets of observations)
The "sum of square" values are
This is the variance of the means between levels.
This is the variance of the data within levels.
(This is hard to compute, so we'll get it by .)
This is total variance of observed .
As before, the
# convert x to a "factor" array
x <- factor(x)
# total number of observations
N <- length(y)
# number of levels in factor
k <- length(levels(x))
# number of observations per level (assuming constant)
n <- N / k
# get sample means for each level
mu <- sapply(levels(x), function(.) mean(y[x==.]) )
# compute sum of squares values
#SSL <- n * sum( ( mu - mean(y) )^2 )
SSL <- n * (k-1) * sd( mu )^2
#SST <- sum( ( y - mean(y) )^2 )
SST <- (N-1) * sd( y )^2
SSE <- SST - SSL # Easier than direct computation
# compute mean squared values (variances)
MSL <- SSL / (k-1) # variance between levels
MSE <- SSE / (N-k) # pooled variance within levels
# F statistic
F <- MSL / MSE
# p-value
p <- 1 - pf(F, k-1, N-k)
# print results
cat(' Source | df | SS | MS \n')
cat('----------------------------------\n')
cat(' Factor | ', k-1, ' | ', signif(SSL,4), ' | ', signif(MSL,3), '\n')
cat(' Error | ', k*(n-1),' | ', signif(SSE,4), ' | ', signif(MSE,3), '\n')
cat(' Total | ', k*n-1, ' | ', signif(SST,3), ' | \n\n')
cat('F-statistic: ', signif(F,4), '\n' )
cat('p-value: ', p)
We can do all of this in R with the aov(..) ("Analysis of Variance") function.
summary( aov(y~x) )
Confidence Intervals on Treatment Means and Least Significant Differences
As with regression,
- ANOVA model:
where - Point estimate:
the sample mean of treatment - Standard error of
is - Standard error of
is , since it is the mean of samples Thus the confidence interval for is:
Combining two such confidence intervals, we can get confidence intervals for the difference of treatment means
Thus, if two treatment sample means differ by more than
then the probability that
We can perform this test using the LSD.test(..) function from the "agricolae" library.