Linear Regression in R
In this tutorial, you will learn the basics behind a very popular statistical model, the linear regression.
- What is a linear regression?
- Creating a linear regression in R.
- Learn the concepts of coefficients and residuals.
- How to test if your linear model has a good fit?
- Detecting influential points.
What is a linear regression?
A linear regression is a statistical model that analyzes the relationship between a response variable (often called y) and one or more variables and their interactions (often called x or explanatory variables). You make this kind of relationships in your head all the time, for example when you calculate the age of a child based on her height, you are assuming the older she is, the taller she will be. Linear regression is one of the most basic statistical models out there, its results can be interpreted by almost everyone, and it has been around since the 19th century. This is precisely what makes linear regression so popular. It’s simple, and it has survived hundreds of years. Even though it is not as sophisticated as other algorithms like artificial neural networks or random forests, according to a survey made by KD Nuggets, regression was the algorithm most used by data scientists in 2016 and 2017. It’s even predicted it’s still going to be the used in year 2118!
Creating a Linear Regression in R.
Not every problem can be solved with the same algorithm. In this case, linear regression assumes that there exists a linear relationship between the response variable and the explanatory variables. This means that you can fit a line between the two (or more variables). In the previous example, it is clear that there is a relationship between the age of children and their height.
In this particular example, you can calculate the height of a child if you know her age:
In this case, “a” and “b” are called the intercept and the slope respectively. With the same example, “a” or the intercept, is the value from where you start measuring. Newborn babies with zero months are not zero centimeters necessarily; this is the function of the intercept. The slope measures the change of height with respect to the age in months. In general, for every month older the child is, his or her height will increase with “b”.
A linear regression can be calculated in R with the command lm. In the next example, use this command to calculate the height based on the age of the child.
First, import the library readxl to read Microsoft Excel files, it can be any kind of format, as long R can read it. To know more about importing data to R, you can take this DataCamp course.
The data to use for this tutorial can be downloaded here. Download the data to an object called ageandheight and then create the linear regression in the third line. The lm command takes the variables in the format:
lm([target variable] ~ [predictor variables], data = [data source])
After this, it's useful to look at a summary with summary([name of the linear regression])
library(readxl)
# Upload the data
ageandheight <- read_excel("data/ageandheight.xls", sheet = "Hoja2")
# Create the linear regression
lmHeight = lm(height~age, data = ageandheight)
# Review the results
summary(lmHeight) With the command summary(lmHeight) you can see detailed information on the model’s performance and coefficients.
Coefficients
In the red square, you can see the values of the intercept (“a” value) and the slope (“b” value) for the age. These “a” and “b” values plot a line between all the points of the data. So in this case, if there is a child that is 20.5 months old, a is 64.92 and b is 0.635, the model predicts (on average) that its height in centimeters is around 64.92 + (0.635 * 20.5) = 77.93 cm.
When a regression takes into account two or more predictors to create the linear regression, it’s called multiple linear regression. By the same logic you used in the simple example before, the height of the child is going to be measured by:
You are now looking at the height as a function of the age in months and the number of siblings the child has. In the image above, the red rectangle indicates the coefficients (b1 and b2). You can interpret these coefficients in the following way:
When comparing children with the same number of siblings, the average predicted height increases in 0.63 cm for every month the child has. The same way, when comparing children with the same age, the height decreases (because the coefficient is negative) in -0.01 cm for each increase in the number of siblings.
In R, to add another coefficient, add the symbol "+" for every additional variable you want to add to the model.
# Create a linear regression with two variables
lmHeight2 = lm(height~age + no_siblings, data = ageandheight)
# Review the results
summary(lmHeight2) As you might notice already, looking at the number of siblings is a silly way to predict the height of a child. Another aspect to pay attention to your linear models is the p-value of the coefficients. In the previous example, the blue rectangle indicates the p-values for the coefficients age and number of siblings. In simple terms, a p-value indicates whether or not you can reject or accept a hypothesis. The hypothesis, in this case, is that the predictor is not meaningful for your model.
- The p-value for age is 4.34*e-10 or 0.000000000434. A very small value means that age is probably an excellent addition to your model.
- The p-value for the number of siblings is 0.85. In other words, there’s 85% chance that this predictor is not meaningful for the regression.
A standard way to test if the predictors are not meaningful is to look at whether the p-values are smaller than 0.05.
Residuals
A good way to test the quality of the fit of the model is to look at the residuals or the differences between the real values and the predicted values. The straight line in the image above represents the predicted values. The red vertical line from the straight line to the observed data value is the residual.
The idea in here is that the sum of the residuals is approximately zero or as low as possible. In real life, most cases will not follow a perfectly straight line, so residuals are expected. In the R summary of the lm function, you can see descriptive statistics about the residuals of the model, following the same example, the red square shows how the residuals are approximately zero.
How to test if your linear model has a good fit?
One measure used to test how good your model is, is the coefficient of determination or R². This measure is defined by the proportion of the total variability explained by the regression model.
This can seem a little bit complicated, but in general, for models that fit the data well, R² is near 1. Models that poorly fit the data have R² near 0. In the examples above, the one on the left has an R² of 0.02; this means that the model explains only 2% of the data variability. The one on the right has an R² of 0.99, and the model can explain 99% of the total variability.
However, it’s essential to keep in mind that sometimes a high R² is not necessarily good every single time (see below residual plots) and a low R² is not necessarily always bad. In real life, events don’t fit in a perfectly straight line all the time. For example, you can have in your data taller or smaller children with the same age. In some fields, an R² of 0.5 is considered good.
With the same example as above, look at the summary of the linear model to see its R².
In the blue rectangle, notice that there’s two different R², one multiple and one adjusted. The multiple is the R² that you saw previously. One problem with this R² is that it cannot decrease as you add more independent variables to your model, it will continue increasing as you make the model more complex, even if these variables don’t add anything to your predictions (like the example of the number of siblings). For this reason, the adjusted R² is probably better to look at if you are adding more than one variable to the model, since it only increases if it reduces the overall error of the predictions.
Don’t forget to look at the residuals!
You can have a pretty good R² in your model, but let’s not rush to conclusions here. Let’s see an example. You are going to predict the pressure of a material in a laboratory based on its temperature.
Let’s plot the data (in a simple scatterplot) and add the line you built with your linear model. In this example, let R read the data first, again with the read_excel command, to create a dataframe with the data, then create a linear regression with your new data. The command plot takes a data frame and plots the variables on it. In this case, it plots the pressure against the temperature of the material. Then, add the line made by the linear regression with the command abline.
# Upload the data
pressure <- read_excel("data/pressure.xlsx")
# Create the linear regression
lmTemp = lm(Pressure~Temperature, data = pressure)
# Plot the results
plot(pressure, pch = 16, col = "blue")
# Add a regression line
abline(lmTemp) If you see the summary of your new model, you can see that it has pretty good results (look at the R²and the adjusted R²)
summary(lmTemp)Ideally, when you plot the residuals, they should look random. Otherwise means that maybe there is a hidden pattern that the linear model is not considering.
To plot the residuals, use the command plot(lmTemp$residuals).