Skip to content

Summary so far:

  • Given samples of paired data
  • Consider as a function of .
  • Assume data follows regression model:
    where
  • The regression line is:
  • We call the predicted value of at .
      (This is actually the "expected value" or "mean value" of at )
  • The "error" of away from are called residuals
  • Point estimates for the regression coefficients are
  • The variance of and are

    • where

Confidence intervals for , , and regression

Using this we could make confidence intervals for the coefficients and .... though what is more interesting is to have confidence intervals for the regression line itself. For this, it is cleaner if we think of the regression line as "the line through with slope " rather than as "the line with intercept and slope ":

Regression Line: We would like a ("vertical") confidence interval for at each value of . In this case, the variance has two parts:

  • (since we are chosing , in this computation is treated as a constant)

Somewhat roughly, the confidence interval should be similar to

Carefully combining standard errors correctly yields a better confidence interval for the regression line:

Confidence Interval for Regression Line:

Note: The magic of this confidence interval is that it uses pooled variance of residuals to make confidence intervals at every value of , not just at the sampled values . This allows us to make predictions about even at values where we have no data!

x <- runif(50, 5,25)  # independent variable ("predictor" or "regressor")
e <- rnorm(50, 0, 3)  # random error ~ Normal(mean=0,sd=3)
y <- (5 + x/2) + e    # response variable
n    <- length(x)

x.mu <- mean(x)
y.mu <- mean(y)

# regression coefficients
b1   <- cov(x,y) / sd(x)^2
b0   <- y.mu - b1*x.mu

# errors used in confidence interval calculation
Sxx  <- (n-1) * sd(x)^2
#     = sum (x - x.mu)^2
MSE  <- (n-1)/(n-2) * ( sd(y)^2 - ( cov(x,y) / sd(x) )^2 )
#     =   1 / (n-2)   (  S_yy   -      S_xy² / S_xx      )
#     =   σₑ²

# upper and lower curves for regression line confidence intervals
ci.upper <- function(x, alpha=.05) {
	return( b0 + b1 * x + sqrt( MSE * (1/n + (x-x.mu)^2 / Sxx) ) * qt(1-alpha/2,n-2) )
}
ci.lower <- function(x, alpha=.05) {
	return( b0 + b1 * x - sqrt( MSE * (1/n + (x-x.mu)^2 / Sxx) ) * qt(1-alpha/2,n-2) )
}

# upper and lower lines for linear approximation of confidence curves
approx.upper <- function(x, alpha=.05) {
	t <- qt(1-alpha/2,n-2)
	return( (y.mu + sqrt(MSE/n)*t) + (b1 + sign(x-x.mu)*sqrt(MSE/Sxx)*t) * (x-x.mu) ) 
}
approx.lower <- function(x, alpha=.05) {
	t <- qt(1-alpha/2,n-2)
	return( (y.mu - sqrt(MSE/n)*t) + (b1 - sign(x-x.mu)*sqrt(MSE/Sxx)*t) * (x-x.mu) ) 
}

## Time to draw some plots!!!
plot(x,y, main='Regression Line')                    # scatter plot of data
points(x.mu,y.mu, pch=19, cex=2, col='blue')         # mark the central point
abline(b0,b1, lwd=3, col='blue')                     # draw the regression line
curve(ci.upper, add=TRUE, lwd=3, lty=2, col='blue')# upper ci curve
curve(ci.lower, add=TRUE, lwd=3, lty=2, col='blue')# lower ci curve

curve(approx.upper, add=TRUE, lwd=2, lty=3, col='blue4') # linear approx to upper curve
curve(approx.lower, add=TRUE, lwd=2, lty=3, col='blue4') # linear approx to lower curve

Lets compare this output to performing everything in R using the lm(..) command.

We can get this by running the model through predict(..) with the parameter interval set to 'confidence'. The output of predict(..) has three columns -- the first column is the regression line value, the second column is the lower endpoint of the interval, and the last column is the upper endpoint.

# make a mesh of values for computing
x_range <- range(x)
x_mesh  <- seq(x_range[1], x_range[2], length.out = 100)

# create the linear model object
model    <- lm(y~x)

# create intervals for the linear model
ci <- predict(model, data.frame(x=x_mesh), 
			  interval ='confidence',
              level    = .95)

# print out the first three rows of ci
#   column 1: regression line value (predicted y = y_hat)
#   column 2: lower confidence curve
#   column 3: upper confidence curve
cat('ci matrix:\n')
head(ci,3)

## Now lets plot some data!!!
plot(x, y, main='Regression Line')           # data points
abline(model, lwd=3, col='blue')             # regression line
points(x.mu,y.mu, pch=19, cex=2, col='blue') # center of graph

lines(x_mesh, ci[,2], lwd=3, lty=2, col='blue') # lower curve
lines(x_mesh, ci[,3], lwd=3, lty=2, col='blue') # upper curve

polygon( c(x_mesh, rev(x_mesh)),     # fill in the confidence intervals!
		 c(ci[,2], rev(ci[,3])), 
		 col=rgb(0,0,1,.2), lty=0, border=2)

Prediction intervals

The confidence interval that we made previously was for the location of the regression line, which gives the mean value at each value:

Rather than a confidence interval for the mean, prediction intervals are confidence intervals for new observations . Just as before, this increases the size of the confidence interval by adding an extra to the variance.

  • Confidence interval for regression line (mean at ):
  • Prediction interval for new observation (at ):

Note: When is close to , the term above is much smaller than . So the prediction interval we get will be very close to

# create the linear model object
model    <- lm(y~x)

# create intervals for the linear model
ci <- predict(model, newdata = data.frame(x=x_mesh), 
			  interval ='confidence',
              level    = .95)
# create intervals for the linear model
pi <- predict(model, newdata = data.frame(x=x_mesh), 
			   interval ='prediction',
               level    = .95)


## Now lets plot some data!!!
plot(x, y, main='Regression Line')           # data points
abline(model, lwd=3, col='blue')             # regression line
points(x.mu,y.mu, pch=19, cex=2, col='blue') # center of graph

lines(x_mesh, ci[,2], lwd=3, lty=2, col='blue') # lower curve
lines(x_mesh, ci[,3], lwd=3, lty=2, col='blue') # upper curve

lines(x_mesh, pi[,2], lwd=3, lty=2, col='red') # lower curve
lines(x_mesh, pi[,3], lwd=3, lty=2, col='red') # upper curve

# compare to result using only MSE
abline(b0+sqrt(MSE)*qt(1-.05/2,n-2),b1, col='coral')
abline(b0-sqrt(MSE)*qt(1-.05/2,n-2),b1, col='coral')

Note to self: Do bootstrap analysis on regression line confidence interval as well as prediction intervals.

Also do bootstrap analysis on prediction interval approximation using only MSE -- these intervals should be too small, but just how much too small are they? (They don't look bad in this example...)

Adequacy of Regression Model

Given some data, it is easy to blindly make regression lines and confidence intervals. However we should have some basic reality checks about whether the regression model assumption is reasonable to make, given observed data.

  • Regression model: where

We have already discussed considering the statistic associated to the sources of variance, or equivalently the statistic testing the hypothesis H0:

These -values are useful, but sometimes it is worth digging a little deeper. Sometimes -values may fail to reach significance, but we can still repair our model to get good -values. Also, sometimes -values may reach significance, but the relationship between and captured is not itself very significant.

Residual analysis

According to the regression model, the residuals should be . By plotting the residuals, you can check whether this seems reasonable. In particular, it is useful to plot residuals vs. to check for effects beyond linear regression.

## Example of good residual plot
x <- runif(50, 5, 25)  
y <- (5 + x/2) + rnorm(50, 0, 3) 

b <- lm(y~x)$coefficients
y.hat <- b[1] + b[2]*x

res   <- y - y.hat

plot(x, res,
	main='Good residual plot')
abline(h=0)
## Bad residual plot -- should have been quadratic
x <- runif(50, 5, 25)  
y <- (5 + x/2 + x^2/6) + rnorm(50, 0, 3) 

b <- lm(y~x)$coefficients
y.hat <- b[1] + b[2]*x

res   <- y - y.hat

plot(x, res,
	main='Bad residual plot')
abline(h=0)

We can check for coefficients in regression models by doing lm(..) with an included variable.

  • If there shouldn't be an then coefficients will all lose significance.
  • If there should be an then the -test for that coefficient should be significant.
## check for x^2 coefficient in linear
cat(' y = 5 + x/2 ')
x <- runif(50, 5, 25)  
y <- (5 + x/2) + rnorm(50, 0, 3) 

xx <- x^2
summary(lm(y~x))
summary(lm(y~x+xx))

## check for x^2 coefficient in quadratic
cat('\n-----------------------\n y = 5 + x/2 + x^2/6 ')
x <- runif(50, 5, 25)  
y <- (5 + x/2 + x^2/6) + rnorm(50, 0, 3) 

xx <- x^2
summary(lm(y~x))
summary(lm(y~x+xx))