Skip to content

Simple Linear Regression

Linear regression is a method for analyzing paired data . It is similar to the paired-sample -test in that we assume the pairs are not independent and want to investigate the relationship between and .

Paired Sample -test
  • Assume is
         
            where
Linear Regression
  • Assume   where

Frequently linear regression is used when there is a causal ("cause-and-effect") relationship between and . We say is the "explanatory" or "regressor" variable -- and think of it as an independent variable in a function. We call the "response" variable -- and think of it as the dependent variable in a functional relationship. For example:

  • = age of a child       and   = height of child
  • = thickness of bar   and   = strength of bar
  • = time of year         and   = avg power output of solar panel
Graphing

Linear regression relationships are usually graphed as a scatterplot of data with the "best fit" or "regression" line drawn through the middle of the data.

x <- runif(50, 5,25)  # independent variable ("predictor" or "regressor")
e <- rnorm(50, 0, 3)  # random error ~ Normal(mean=0,sd=3)
y <- (5 + x/2) + e    # response variable

plot(x,y)             # scatter plot of data
abline(5, 1/2,        # y = 5 + x/2 ("center-line" of data) 
	   lwd=3, col='blue')

In regression analysis, we imagine that each data point is drawn from a normal distribution whose mean is shifting as changes, following the "regression line".

Hidden code

We can perform linear regression in R with the command lm(..) ("linear model")

For fancier versions, we can use glm(..) ("general linear model")

plot(x,y, main='Regression Line')
abline( lm(y~x), 
	   lwd = 3, col='blue')

Notation:

    is the "simple linear regression model"
                                        (the assumed relationship between and )

  • is the "response variable"
  • is the "predictor variable"
     
  • and are the "regression coefficients"
    • is the "intercept"
    • is the "slope" (in the R table this is 'x')
       
  • is the "residuals"

    •  
  • and are point estimates for and
  • are "predicted values"
    • is a point estimate for

Regression Analysis:

Given data , we can make point estimators and for the regression coefficients. An unbiased regression line will always go through , so we only need to carefully solve for one coefficient. The "best fit" coefficients can be found either using calculus (minimize square error) or linear algebra (project onto linear equation using QR-decomposition)


  •  

It is easy to remember the formula for because it should be slope, and if you look at you can imagine the "cancelling" on the top and bottom, which leaves = slope. (Note: this is only a mnemonic -- not proper math)

Connection to Variance and Covariance:

  • is the numerator for sample variance
  • is the numerator for sample covariance
## use covariance / variance to compute b1
b1 <- cov(x,y) / sd(x)^2
cat('cov/var = ', b1)
cat('\n---------------\n')

## use S_xy / S_xx to compute b1
n  <- length(x)
b1 <- (sum(x*y) - n*mean(x)*mean(y)) / (sum(x^2) - n*mean(x)^2)
cat('s_xy / s_xx = ', b1)
cat('\n-----------------\n')

## Get b0
b0 <- mean(y) - b1*mean(x)
cat('y - b1 x = ', b0)
cat('\n-----------------\n')

## compare to lm
summary( lm(y~x) )

Connection to Squared Error:

In order to analyse the precision of our regression estimate, we need to look at the standard error of and . This is usually computed in terms of "squared errors" (similar to , .)

  • Sum of Squared Error of Regression
      where
        (This is the part of 's variance coming from 's variance)
  • Sum of Squared Error of Residuals
      where
        (This is the part of 's variance coming from the residuals)
  • Total Sum of Squared Error
        (This gives total variance of )

As well as "mean squared errors" (similar to variance)

  • Mean Squared Regression
  • Mean Squared Error
  • Mean Squared Total (not used)

Often this is summarized in the form of an "analysis of variance" table:

SourcedfSum of SquareMean Square
Regression
Residual
Total

To analyse the quality of our regression we want and . Unfortunately is difficult to compute, so we usually get it by computing and first.


  •  

  •  

If our regression was meaningless, then . These would both be so we could test the hypothesis

  • H0: doesn't depend on using the statistic

This is the 'F-statistic' and 'p-value' given at the bottom of R's summary(lm(..))

## compute sum of squared errors
SST <- (n-1)*sd(y)^2
SSR <- (n-1)*( cov(x,y) / sd(x) )^2
SSE <- SST - SSR

## compute mean squared errors
MSR <- SSR / 1
MSE <- SSE / (n-2)
MST <- SST / (n-1)

cat('Source  |  df  |   SS  |  MS  \n')
cat('-------------------------------\n')
cat(' Regres |   1  | ',      signif(SSR,3), ' | ', signif(MSR,3), '\n')
cat(' Resid  | ', n-2, ' | ', signif(SSE,3), ' | ', signif(MSE,3), '\n')
cat(' Total  | ', n-1, ' | ', signif(SST,3), ' | ', signif(MST,3), '\n')

## F-test on Y~X
F   <- MSR / MSE

cat('F-statistic: ', signif(F,4), '\n' )
cat('p-value:     ', signif(1-pf(F,1,n-2),4))

## compare to lm
cat ('\n---------------------------\n output from anova(lm)')
anova( lm(y~x) )

Variance, CI, and H0 for and

We can also write variance of the regression coefficients in terms of Squared Error:


  •  

Using these we can make confidence intervals and -tests on

  • H0:     (not very useful test)
  • Statistic:
  • Conf. Int:  
  • H0:     (very useful test!!!)
  • Statistic:
  • Conf. Int:

Note that if then doesn't really depend on . So testing is another way to check whether there is much meaning in the regression analysis. Actually... it isn't hard to show that the -test on H0: is equivalant to the test mentioned previously....

b1     <- cov(x,y) / sd(x)^2
b1.err <- sqrt( MSE / sum((x-mean(x))^2) )
b1.t   <- b1 / b1.err
b1.p   <- 2*pt( - abs(b1.t), n-2 )

cat('slope  | std err  | t-value | p-value\n')
cat(signif(b1,    3), ' | ', 
	signif(b1.err,3), ' |  ', 
	signif(b1.t,  3), ' | ',
	signif(b1.p,  3), '\n')

cat('\n-------------------\n ouput of summary(lm):')
## compare to lm
summary( lm(y~x) )