What is a Linear Regression?
A linear regression is a statistical model that analyzes the relationship between a response variable (often called y) and one or more variables and their interactions (often called x or explanatory variables). You make this kind of relationship in your head all the time, for example, when you calculate the age of a child based on their height, you are assuming the older they are, the taller they will be.
Linear regression is one of the most basic statistical models out there, its results can be interpreted by almost everyone, and it has been around since the 19th century. This is precisely what makes linear regression so popular. It’s simple, and it has survived for hundreds of years. Even though it is not as sophisticated as other algorithms like artificial neural networks or random forests, according to a survey made by KD Nuggets, regression was the algorithm most used by data scientists in 2016 and 2017. It’s even predicted it will still be used in the year 2118!
In this linear regression tutorial, we will explore how to create a linear regression in R, looking at the steps you'll need to take with an example you can work through.
Creating a Linear Regression in R
Not every problem can be solved with the same algorithm. In this case, linear regression assumes that there exists a linear relationship between the response variable and the explanatory variables. This means that you can fit a line between the two (or more variables). In the previous example of the child's age, it is clear that there is a relationship between the age of children and their height.
In this particular example, you can calculate the height of a child if you know her age:
In this case, “a” and “b” are called the intercept and the slope, respectively. With the same example, “a” or the intercept, is the value from where you start measuring. Newborn babies with zero months are not zero centimeters necessarily; this is the function of the intercept. The slope measures the change in height with respect to the age in months. In general, for every month older the child is, their height will increase with “b”.
lm in R
A linear regression can be calculated in R with the command
lm. In the next example, use this command to calculate the height based on the age of the child.
First, import the library
readxl to read Microsoft Excel files, it can be any kind of format, as long R can read it. To know more about importing data to R, you can take this DataCamp course.
You can download the data to use for this tutorial before you get started. Download the data to an object called
ageandheight and then create the linear regression in the third line.
lm function in R:
lm command takes the variables in the format:
lm([target] ~ [predictor / features], data = [data source])
With the command
summary(lmHeight) you can see detailed information on the model’s performance and coefficients.
library(readxl) ageandheight <- read_excel("ageandheight.xls", sheet = "Hoja2") #Upload the data lmHeight = lm(height~age, data = ageandheight) #Create the linear regression summary(lmHeight) #Review the results
Start Learning R For Free
In the red square, you can see the values of the intercept (“a” value) and the slope (“b” value) for the age. These “a” and “b” values plot a line between all the points of the data. So, in this case, if there is a child that is 20.5 months old, a is 64.92, and b is 0.635, the model predicts (on average) that its height in centimeters is around 64.92 + (0.635 * 20.5) = 77.93 cm.
When a regression takes into account two or more predictors to create the linear regression, it’s called multiple linear regression. By the same logic you used in the simple example before, the height of the child is going to be measured by:
Height = a + Age × b1 + (Number of Siblings} × b2
You are now looking at the height as a function of the age in months and the number of siblings the child has. In the image above, the red rectangle indicates the coefficients (b1 and b2). You can interpret these coefficients in the following way:
When comparing children with the same number of siblings, the average predicted height increases in 0.63 cm for every month the child has. The same way, when comparing children with the same age, the height decreases (because the coefficient is negative) in -0.01 cm for each increase in the number of siblings.
In R, to add another coefficient, add the symbol "+" for every additional variable you want to add to the model.
lmHeight2 = lm(height~age + no_siblings, data = ageandheight) #Create a linear regression with two variables summary(lmHeight2) #Review the results
As you might notice already, looking at the number of siblings is a silly way to predict the height of a child. Another aspect to pay attention to in your linear models is the p-value of the coefficients. In the previous example, the blue rectangle indicates the p-values for the coefficients age and number of siblings. In simple terms, a p-value indicates whether or not you can reject or accept a hypothesis. The hypothesis, in this case, is that the predictor is not meaningful for your model.
- The p-value for age is 4.34*e-10 or 0.000000000434. A very small value means that age is probably an excellent addition to your model.
- The p-value for the number of siblings is 0.85. In other words, there’s an 85% chance that this predictor is not meaningful for the regression.
A standard way to test if the predictors are not meaningful is by looking if the p-values smaller than 0.05.
A good way to test the quality of the fit of the model is to look at the residuals or the differences between the real values and the predicted values. The straight line in the image above represents the predicted values. The red vertical line from the straight line to the observed data value is the residual.
The idea here is that the sum of the residuals is approximately zero or as low as possible. In real life, most cases will not follow a perfectly straight line, so residuals are expected. In the R summary of the lm function, you can see descriptive statistics about the residuals of the model, following the same example, the red square shows how the residuals are approximately zero.
How to Test if your Linear Model has a Good Fit
One measure very used to test how good your model is is the coefficient of determination or R². This measure is defined by the proportion of the total variability explained by the regression model.
This can seem a little bit complicated, but in general, for models that fit the data well, R² is near 1. Models that poorly fit the data have R² near 0. In the examples below, the first one has an R² of 0.02; this means that the model explains only 2% of the data variability. The second one has an R² of 0.99, and the model can explain 99% of the total variability.**
However, it’s essential to keep in mind that sometimes a high R² is not necessarily good every single time (see below residual plots) and a low R² is not necessarily always bad. In real life, events don’t fit in a perfectly straight line all the time. For example, you can have in your data taller or smaller children with the same age. In some fields, an R² of 0.5 is considered good.
With the same example as above, look at the summary of the linear model to see its R².
In the blue rectangle, notice that there are two different R², one multiple and one adjusted. The multiple is the R² that you saw previously. One problem with this R² is that it cannot decrease as you add more independent variables to your model, it will continue increasing as you make the model more complex, even if these variables don’t add anything to your predictions (like the example of the number of siblings). For this reason, the adjusted R² is probably better to look at if you are adding more than one variable to the model since it only increases if it reduces the overall error of the predictions.
Don’t Forget to Look at the Residuals!
You can have a pretty good R² in your model, but let’s not rush to conclusions here. Let’s see an example. You are going to predict the pressure of a material in a laboratory based on its temperature.
Let’s plot the data (in a simple scatterplot) and add the line you built with your linear model. In this example, let R read the data first, again with the
read_excel command, to create a dataframe with the data, then create a linear regression with your new data. The command
plot takes a data frame and plots the variables on it. In this case, it plots the pressure against the temperature of the material. Then, add the line made by the linear regression with the command
pressure <- read_excel("pressure.xlsx") #Upload the data lmTemp = lm(Pressure~Temperature, data = pressure) #Create the linear regression plot(pressure, pch = 16, col = "blue") #Plot the results abline(lmTemp) #Add a regression line
If you see the summary of your new model, you can see that it has pretty good results (look at the R²and the adjusted R²)
Call: lm(formula = Pressure ~ Temperature, data = pressure) Residuals: Min 1Q Median 3Q Max -41.85 -34.72 -10.90 24.69 63.51 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -81.5000 29.1395 -2.797 0.0233 * Temperature 4.0309 0.4696 8.583 2.62e-05 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 42.66 on 8 degrees of freedom Multiple R-squared: 0.902, Adjusted R-squared: 0.8898 F-statistic: 73.67 on 1 and 8 DF, p-value: 2.622e-05
Ideally, when you plot the residuals, they should look random. Otherwise means that maybe there is a hidden pattern that the linear model is not considering. To plot the residuals, use the command
plot(lmTemp$residuals, pch = 16, col = "red")
This can be a problem. If you have more data, your simple linear model will not be able to generalize well. In the previous picture, notice that there is a pattern (like a curve on the residuals). This is not random at all.
What you can do is a transformation of the variable. Many possible transformations can be performed on your data, such as adding a quadratic term (x2), a cubic (x3), or even more complex such as ln(X), ln(X+1), sqrt(X), 1/x, Exp(X). The choice of the correct transformation will come with some knowledge of algebraic functions, practice, trial, and error.
Let’s try with a quadratic term. For this, add the term “I” (capital "I") before your transformation, for example, this will be the normal linear regression formula:
lmTemp2 = lm(Pressure~Temperature + I(Temperature^2), data = pressure) #Create a linear regression with a quadratic coefficient summary(lmTemp2) #Review the results
Call: lm(formula = Pressure ~ Temperature + I(Temperature^2), data = pressure) Residuals: Min 1Q Median 3Q Max -4.6045 -1.6330 0.5545 1.1795 4.8273 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 33.750000 3.615591 9.335 3.36e-05 *** Temperature -1.731591 0.151002 -11.467 8.62e-06 *** I(Temperature^2) 0.052386 0.001338 39.158 1.84e-09 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 3.074 on 7 degrees of freedom Multiple R-squared: 0.9996, Adjusted R-squared: 0.9994 F-statistic: 7859 on 2 and 7 DF, p-value: 1.861e-12
Notice that the model improved significantly. If you plot the residuals of the new model, they will look like this:
plot(lmTemp2$residuals, pch = 16, col = "red")
Now you don’t see any clear patterns on your residuals, which is good!
Detect Influential Points
In your data, you may have influential points that might skew your model, sometimes unnecessarily. Think of a mistake in the data entry, and instead of writing “2.3” the value was “23”. The most common kind of influential point are the outliers, which are data points where the observed response does not appear to follow the pattern established by the rest of the data.
You can detect influential points by looking at the object containing the linear model, using the function
cooks.distance and then plot these distances. Change a value on purpose to see how it looks on the Cooks Distance plot. To change a specific value, you can directly point at it with
ageandheight[row number, column number] = [new value]. In this case, the height is changed to 7.7 of the second example:
ageandheight[2, 2] = 7.7 head(ageandheight)
You create the model again and see how the summary is giving a bad fit, and then plot the Cooks Distances. For this, after creating the linear regression, use the command
cooks.distance([linear model] , and then, if you want, you can plot these distances with the command
lmHeight3 = lm(height~age, data = ageandheight)#Create the linear regression summary(lmHeight3)#Review the results plot(cooks.distance(lmHeight3), pch = 16, col = "blue") #Plot the Cooks Distances.
Call: lm(formula = height ~ age, data = ageandheight) Residuals: Min 1Q Median 3Q Max -53.704 -2.584 3.609 9.503 17.512 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.905 38.319 0.206 0.841 age 2.816 1.613 1.745 0.112 Residual standard error: 19.29 on 10 degrees of freedom Multiple R-squared: 0.2335, Adjusted R-squared: 0.1568 F-statistic: 3.046 on 1 and 10 DF, p-value: 0.1115
Notice that there is a point that does not follow the pattern, and it might be affecting the model. Here you can make decisions on this point, in general, there are three reasons why a point is so influential:
- Someone made a recording error
- Someone made a fundamental mistake collecting the observation
- The data point is perfectly valid, in which case the model cannot account for the behavior.
If the case is 1 or 2, then you can remove the point (or correct it). If it's 3, it's not worth deleting a valid point; maybe you can try on a non-linear model rather than a linear model like linear regression.
Beware that an influential point can be a valid point, be sure to check the data and its source before deleting it. It’s common to see in statistics books this quote: “Sometimes we throw out perfectly good data when we should be throwing out questionable models.”
You made it to the end! Linear regression is a big topic, and it's here to stay. Here, we've presented a few tricks that can help to tune and take the most advantage of such a powerful yet simple algorithm. You also learned how to understand what's behind this simple statistical model and how you can modify it according to your needs. You can also explore other options by typing
?lm on the R console and looking at the different parameters not covered here. Check out our Regularization Tutorial: Ridge, Lasso and Elastic Net. If you are interested in diving into statistical models, go ahead and check the course on Statistical Modeling in R.
Run and edit the code from this tutorial onlineOpen Workspace
How to Choose The Right Data Science Bootcamp in 2023 (With Examples)
DataCamp Portfolio Challenge: Win $500 Publishing Your Best Work
What is Microsoft Fabric?
Chroma DB Tutorial: A Step-By-Step Guide
Performance and Scalability Unleashed: Mastering Single Table Database Design with DynamoDB
Introduction to Non-Linear Model and Insights Using R