Summary so far:
- Given
samples of paired data - Consider
as a function of . - Assume data follows regression model:
where - The regression line is:
- We call
the predicted value of at .
(This is actually the "expected value" or "mean value" of at ) - The "error" of
away from are called residuals - Point estimates for the regression coefficients are
- The variance of
and are
where
Confidence intervals for , , and regression
Using this we could make confidence intervals for the coefficients
Regression Line:
We would like a ("vertical") confidence interval for at each value of . In this case, the variance has two parts:
(since we are chosing , in this computation is treated as a constant)
Somewhat roughly, the confidence interval should be similar to
Carefully combining standard errors correctly yields a better confidence interval for the regression line:
Confidence Interval for Regression Line:
Note: The magic of this confidence interval is that it uses pooled variance of residuals to make confidence intervals at every value of
x <- runif(50, 5,25) # independent variable ("predictor" or "regressor")
e <- rnorm(50, 0, 3) # random error ~ Normal(mean=0,sd=3)
y <- (5 + x/2) + e # response variable
n <- length(x)
x.mu <- mean(x)
y.mu <- mean(y)
# regression coefficients
b1 <- cov(x,y) / sd(x)^2
b0 <- y.mu - b1*x.mu
# errors used in confidence interval calculation
Sxx <- (n-1) * sd(x)^2
# = sum (x - x.mu)^2
MSE <- (n-1)/(n-2) * ( sd(y)^2 - ( cov(x,y) / sd(x) )^2 )
# = 1 / (n-2) ( S_yy - S_xy² / S_xx )
# = σₑ²
# upper and lower curves for regression line confidence intervals
ci.upper <- function(x, alpha=.05) {
return( b0 + b1 * x + sqrt( MSE * (1/n + (x-x.mu)^2 / Sxx) ) * qt(1-alpha/2,n-2) )
}
ci.lower <- function(x, alpha=.05) {
return( b0 + b1 * x - sqrt( MSE * (1/n + (x-x.mu)^2 / Sxx) ) * qt(1-alpha/2,n-2) )
}
# upper and lower lines for linear approximation of confidence curves
approx.upper <- function(x, alpha=.05) {
t <- qt(1-alpha/2,n-2)
return( (y.mu + sqrt(MSE/n)*t) + (b1 + sign(x-x.mu)*sqrt(MSE/Sxx)*t) * (x-x.mu) )
}
approx.lower <- function(x, alpha=.05) {
t <- qt(1-alpha/2,n-2)
return( (y.mu - sqrt(MSE/n)*t) + (b1 - sign(x-x.mu)*sqrt(MSE/Sxx)*t) * (x-x.mu) )
}
## Time to draw some plots!!!
plot(x,y, main='Regression Line') # scatter plot of data
points(x.mu,y.mu, pch=19, cex=2, col='blue') # mark the central point
abline(b0,b1, lwd=3, col='blue') # draw the regression line
curve(ci.upper, add=TRUE, lwd=3, lty=2, col='blue')# upper ci curve
curve(ci.lower, add=TRUE, lwd=3, lty=2, col='blue')# lower ci curve
curve(approx.upper, add=TRUE, lwd=2, lty=3, col='blue4') # linear approx to upper curve
curve(approx.lower, add=TRUE, lwd=2, lty=3, col='blue4') # linear approx to lower curve
Lets compare this output to performing everything in R using the lm(..) command.
We can get this by running the model through predict(..) with the parameter interval set to 'confidence'. The output of predict(..) has three columns -- the first column is the regression line value, the second column is the lower endpoint of the interval, and the last column is the upper endpoint.
# make a mesh of values for computing
x_range <- range(x)
x_mesh <- seq(x_range[1], x_range[2], length.out = 100)
# create the linear model object
model <- lm(y~x)
# create intervals for the linear model
ci <- predict(model, data.frame(x=x_mesh),
interval ='confidence',
level = .95)
# print out the first three rows of ci
# column 1: regression line value (predicted y = y_hat)
# column 2: lower confidence curve
# column 3: upper confidence curve
cat('ci matrix:\n')
head(ci,3)
## Now lets plot some data!!!
plot(x, y, main='Regression Line') # data points
abline(model, lwd=3, col='blue') # regression line
points(x.mu,y.mu, pch=19, cex=2, col='blue') # center of graph
lines(x_mesh, ci[,2], lwd=3, lty=2, col='blue') # lower curve
lines(x_mesh, ci[,3], lwd=3, lty=2, col='blue') # upper curve
polygon( c(x_mesh, rev(x_mesh)), # fill in the confidence intervals!
c(ci[,2], rev(ci[,3])),
col=rgb(0,0,1,.2), lty=0, border=2)
Prediction intervals
The confidence interval that we made previously was for the location of the regression line, which gives the mean
Rather than a confidence interval for the mean, prediction intervals are confidence intervals for new observations
- Confidence interval for regression line (mean at
):
- Prediction interval for new observation (at
):
Note: When
# create the linear model object
model <- lm(y~x)
# create intervals for the linear model
ci <- predict(model, newdata = data.frame(x=x_mesh),
interval ='confidence',
level = .95)
# create intervals for the linear model
pi <- predict(model, newdata = data.frame(x=x_mesh),
interval ='prediction',
level = .95)
## Now lets plot some data!!!
plot(x, y, main='Regression Line') # data points
abline(model, lwd=3, col='blue') # regression line
points(x.mu,y.mu, pch=19, cex=2, col='blue') # center of graph
lines(x_mesh, ci[,2], lwd=3, lty=2, col='blue') # lower curve
lines(x_mesh, ci[,3], lwd=3, lty=2, col='blue') # upper curve
lines(x_mesh, pi[,2], lwd=3, lty=2, col='red') # lower curve
lines(x_mesh, pi[,3], lwd=3, lty=2, col='red') # upper curve
# compare to result using only MSE
abline(b0+sqrt(MSE)*qt(1-.05/2,n-2),b1, col='coral')
abline(b0-sqrt(MSE)*qt(1-.05/2,n-2),b1, col='coral')
Note to self: Do bootstrap analysis on regression line confidence interval as well as prediction intervals.
Also do bootstrap analysis on prediction interval approximation using only MSE -- these intervals should be too small, but just how much too small are they? (They don't look bad in this example...)
Adequacy of Regression Model
Given some data, it is easy to blindly make regression lines and confidence intervals. However we should have some basic reality checks about whether the regression model assumption is reasonable to make, given observed data.
- Regression model:
where
We have already discussed considering the
These
Residual analysis
According to the regression model, the residuals should be
## Example of good residual plot
x <- runif(50, 5, 25)
y <- (5 + x/2) + rnorm(50, 0, 3)
b <- lm(y~x)$coefficients
y.hat <- b[1] + b[2]*x
res <- y - y.hat
plot(x, res,
main='Good residual plot')
abline(h=0)
## Bad residual plot -- should have been quadratic
x <- runif(50, 5, 25)
y <- (5 + x/2 + x^2/6) + rnorm(50, 0, 3)
b <- lm(y~x)$coefficients
y.hat <- b[1] + b[2]*x
res <- y - y.hat
plot(x, res,
main='Bad residual plot')
abline(h=0)
We can check for
- If there shouldn't be an
then coefficients will all lose significance. - If there should be an
then the -test for that coefficient should be significant.
## check for x^2 coefficient in linear
cat(' y = 5 + x/2 ')
x <- runif(50, 5, 25)
y <- (5 + x/2) + rnorm(50, 0, 3)
xx <- x^2
summary(lm(y~x))
summary(lm(y~x+xx))
## check for x^2 coefficient in quadratic
cat('\n-----------------------\n y = 5 + x/2 + x^2/6 ')
x <- runif(50, 5, 25)
y <- (5 + x/2 + x^2/6) + rnorm(50, 0, 3)
xx <- x^2
summary(lm(y~x))
summary(lm(y~x+xx))