Simple Linear Regression
Linear regression is a method for analyzing paired data
Paired Sample -test
- Assume
is
where
Linear Regression
- Assume
where
Frequently linear regression is used when there is a causal ("cause-and-effect") relationship between
= age of a child and = height of child = thickness of bar and = strength of bar = time of year and = avg power output of solar panel
Graphing
Linear regression relationships are usually graphed as a scatterplot of data with the "best fit" or "regression" line drawn through the middle of the data.
x <- runif(50, 5,25) # independent variable ("predictor" or "regressor")
e <- rnorm(50, 0, 3) # random error ~ Normal(mean=0,sd=3)
y <- (5 + x/2) + e # response variable
plot(x,y) # scatter plot of data
abline(5, 1/2, # y = 5 + x/2 ("center-line" of data)
lwd=3, col='blue')
In regression analysis, we imagine that each
We can perform linear regression in R with the command lm(..) ("linear model")
For fancier versions, we can use glm(..) ("general linear model")
plot(x,y, main='Regression Line')
abline( lm(y~x),
lwd = 3, col='blue')
Notation:
(the assumed relationship between
is the "response variable" is the "predictor variable"
and are the "regression coefficients" is the "intercept" is the "slope" (in the R table this is 'x')
is the "residuals" and are point estimates for and are "predicted values" is a point estimate for
Regression Analysis:
Given data
It is easy to remember the formula for
Connection to Variance and Covariance:
is the numerator for sample variance is the numerator for sample covariance
## use covariance / variance to compute b1
b1 <- cov(x,y) / sd(x)^2
cat('cov/var = ', b1)
cat('\n---------------\n')
## use S_xy / S_xx to compute b1
n <- length(x)
b1 <- (sum(x*y) - n*mean(x)*mean(y)) / (sum(x^2) - n*mean(x)^2)
cat('s_xy / s_xx = ', b1)
cat('\n-----------------\n')
## Get b0
b0 <- mean(y) - b1*mean(x)
cat('y - b1 x = ', b0)
cat('\n-----------------\n')
## compare to lm
summary( lm(y~x) )
Connection to Squared Error:
In order to analyse the precision of our regression estimate, we need to look at the standard error of
Sum of Squared Error of Regression
where
(This is the part of 's variance coming from 's variance) Sum of Squared Error of Residuals
where
(This is the part of 's variance coming from the residuals) Total Sum of Squared Error
(This gives total variance of )
As well as "mean squared errors" (similar to variance)
Mean Squared Regression Mean Squared Error Mean Squared Total (not used)
Often this is summarized in the form of an "analysis of variance" table:
Source | df | Sum of Square | Mean Square |
---|---|---|---|
Regression | |||
Residual | |||
Total |
To analyse the quality of our regression we want
If our regression was meaningless, then
- H0:
doesn't depend on using the statistic
This is the 'F-statistic' and 'p-value' given at the bottom of R's summary(lm(..))
## compute sum of squared errors
SST <- (n-1)*sd(y)^2
SSR <- (n-1)*( cov(x,y) / sd(x) )^2
SSE <- SST - SSR
## compute mean squared errors
MSR <- SSR / 1
MSE <- SSE / (n-2)
MST <- SST / (n-1)
cat('Source | df | SS | MS \n')
cat('-------------------------------\n')
cat(' Regres | 1 | ', signif(SSR,3), ' | ', signif(MSR,3), '\n')
cat(' Resid | ', n-2, ' | ', signif(SSE,3), ' | ', signif(MSE,3), '\n')
cat(' Total | ', n-1, ' | ', signif(SST,3), ' | ', signif(MST,3), '\n')
## F-test on Y~X
F <- MSR / MSE
cat('F-statistic: ', signif(F,4), '\n' )
cat('p-value: ', signif(1-pf(F,1,n-2),4))
## compare to lm
cat ('\n---------------------------\n output from anova(lm)')
anova( lm(y~x) )
Variance, CI, and H0 for and
We can also write variance of the regression coefficients in terms of Squared Error:
Using these we can make confidence intervals and
- H0:
(not very useful test) - Statistic:
- Conf. Int:
- H0:
(very useful test!!!) - Statistic:
- Conf. Int:
Note that if
b1 <- cov(x,y) / sd(x)^2
b1.err <- sqrt( MSE / sum((x-mean(x))^2) )
b1.t <- b1 / b1.err
b1.p <- 2*pt( - abs(b1.t), n-2 )
cat('slope | std err | t-value | p-value\n')
cat(signif(b1, 3), ' | ',
signif(b1.err,3), ' | ',
signif(b1.t, 3), ' | ',
signif(b1.p, 3), '\n')
cat('\n-------------------\n ouput of summary(lm):')
## compare to lm
summary( lm(y~x) )