Skip to main content
HomeAbout RLearn R

Linear Regression in R Tutorial

In this tutorial, you will learn the basics behind a very popular statistical model; the linear regression.
Updated Dec 2022  · 15 min read

 

Read the Spanish version 🇪🇸 of this article.

What is a Linear Regression?

A linear regression is a statistical model that analyzes the relationship between a response variable (often called y) and one or more variables and their interactions (often called x or explanatory variables). You make this kind of relationship in your head all the time, for example, when you calculate the age of a child based on their height, you are assuming the older they are, the taller they will be.

Linear regression is one of the most basic statistical models out there, its results can be interpreted by almost everyone, and it has been around since the 19th century. This is precisely what makes linear regression so popular. It’s simple, and it has survived for hundreds of years. Even though it is not as sophisticated as other algorithms like artificial neural networks or random forests, according to a survey made by KD Nuggets, regression was the algorithm most used by data scientists in 2016 and 2017. It’s even predicted it will still be used in the year 2118!

In this linear regression tutorial, we will explore how to create a linear regression in R, looking at the steps you'll need to take with an example you can work through. 

Practice Linear Regression in R with this hands-on exercise.

Creating a Linear Regression in R

Not every problem can be solved with the same algorithm. In this case, linear regression assumes that there exists a linear relationship between the response variable and the explanatory variables. This means that you can fit a line between the two (or more variables). In the previous example of the child's age, it is clear that there is a relationship between the age of children and their height.

Creating a Linear Regression in R

In this particular example, you can calculate the height of a child if you know her age:

linear regression

In this case, “a” and “b” are called the intercept and the slope, respectively. With the same example, “a” or the intercept, is the value from where you start measuring. Newborn babies with zero months are not zero centimeters necessarily; this is the function of the intercept. The slope measures the change in height with respect to the age in months. In general, for every month older the child is, their height will increase with “b”.

lm in R

A linear regression can be calculated in R with the command lm. In the next example, use this command to calculate the height based on the age of the child.

First, import the library readxl to read Microsoft Excel files, it can be any kind of format, as long R can read it. To know more about importing data to R, you can take this DataCamp course.

You can download the data to use for this tutorial before you get started. Download the data to an object called ageandheight and then create the linear regression in the third line.

lm function in R:

The lm command takes the variables in the format:

lm([target] ~ [predictor / features], data = [data source])

With the command summary(lmHeight) you can see detailed information on the model’s performance and coefficients.

library(readxl)
ageandheight <- read_excel("ageandheight.xls", sheet = "Hoja2") #Upload the data
lmHeight = lm(height~age, data = ageandheight) #Create the linear regression
summary(lmHeight) #Review the results
Creating a Linear Regression in R

Start Learning R For Free

Inference for Linear Regression in R

BeginnerSkill Level
4 hr
13.2K learners
In this course you'll learn how to perform inference using linear models.

Coefficients

In the red square, you can see the values of the intercept (“a” value) and the slope (“b” value) for the age. These “a” and “b” values plot a line between all the points of the data. So, in this case, if there is a child that is 20.5 months old, a is 64.92, and b is 0.635, the model predicts (on average) that its height in centimeters is around 64.92 + (0.635 * 20.5) = 77.93 cm.

When a regression takes into account two or more predictors to create the linear regression, it’s called multiple linear regression. By the same logic you used in the simple example before, the height of the child is going to be measured by:

Height = a + Age × b1 + (Number of Siblings} × b2

You are now looking at the height as a function of the age in months and the number of siblings the child has. In the image above, the red rectangle indicates the coefficients (b1 and b2). You can interpret these coefficients in the following way:

When comparing children with the same number of siblings, the average predicted height increases in 0.63 cm for every month the child has. The same way, when comparing children with the same age, the height decreases (because the coefficient is negative) in -0.01 cm for each increase in the number of siblings.

In R, to add another coefficient, add the symbol "+" for every additional variable you want to add to the model.

lmHeight2 = lm(height~age + no_siblings, data = ageandheight) #Create a linear regression with two variables
summary(lmHeight2) #Review the results

Coefficients

As you might notice already, looking at the number of siblings is a silly way to predict the height of a child. Another aspect to pay attention to in your linear models is the p-value of the coefficients. In the previous example, the blue rectangle indicates the p-values for the coefficients age and number of siblings. In simple terms, a p-value indicates whether or not you can reject or accept a hypothesis. The hypothesis, in this case, is that the predictor is not meaningful for your model.

  • The p-value for age is 4.34*e-10 or 0.000000000434. A very small value means that age is probably an excellent addition to your model.
  • The p-value for the number of siblings is 0.85. In other words, there’s an 85% chance that this predictor is not meaningful for the regression.

A standard way to test if the predictors are not meaningful is by looking if the p-values smaller than 0.05.

Residuals

A good way to test the quality of the fit of the model is to look at the residuals or the differences between the real values and the predicted values. The straight line in the image above represents the predicted values. The red vertical line from the straight line to the observed data value is the residual.

Residuals

The idea here is that the sum of the residuals is approximately zero or as low as possible. In real life, most cases will not follow a perfectly straight line, so residuals are expected. In the R summary of the lm function, you can see descriptive statistics about the residuals of the model, following the same example, the red square shows how the residuals are approximately zero.

Residuals

How to Test if your Linear Model has a Good Fit

One measure very used to test how good your model is is the coefficient of determination or R². This measure is defined by the proportion of the total variability explained by the regression model.

coefficient of determination

This can seem a little bit complicated, but in general, for models that fit the data well, R² is near 1. Models that poorly fit the data have R² near 0. In the examples below, the first one has an R² of 0.02; this means that the model explains only 2% of the data variability. The second one has an R² of 0.99, and the model can explain 99% of the total variability.**

How to Test if your Linear Model has a Good Fit? How to Test if your Linear Model has a Good Fit?

However, it’s essential to keep in mind that sometimes a high R² is not necessarily good every single time (see below residual plots) and a low R² is not necessarily always bad. In real life, events don’t fit in a perfectly straight line all the time. For example, you can have in your data taller or smaller children with the same age. In some fields, an R² of 0.5 is considered good.

With the same example as above, look at the summary of the linear model to see its R².

How to Test if your Linear Model has a Good Fit?

In the blue rectangle, notice that there are two different R², one multiple and one adjusted. The multiple is the R² that you saw previously. One problem with this R² is that it cannot decrease as you add more independent variables to your model, it will continue increasing as you make the model more complex, even if these variables don’t add anything to your predictions (like the example of the number of siblings). For this reason, the adjusted R² is probably better to look at if you are adding more than one variable to the model since it only increases if it reduces the overall error of the predictions.

Don’t Forget to Look at the Residuals!

You can have a pretty good R² in your model, but let’s not rush to conclusions here. Let’s see an example. You are going to predict the pressure of a material in a laboratory based on its temperature.

Let’s plot the data (in a simple scatterplot) and add the line you built with your linear model. In this example, let R read the data first, again with the read_excel command, to create a dataframe with the data, then create a linear regression with your new data. The command plot takes a data frame and plots the variables on it. In this case, it plots the pressure against the temperature of the material. Then, add the line made by the linear regression with the command abline.

pressure <- read_excel("pressure.xlsx") #Upload the data
lmTemp = lm(Pressure~Temperature, data = pressure) #Create the linear regression
plot(pressure, pch = 16, col = "blue") #Plot the results
abline(lmTemp) #Add a regression line

line chart

If you see the summary of your new model, you can see that it has pretty good results (look at the R²and the adjusted R²)

summary(lmTemp)
Call:
lm(formula = Pressure ~ Temperature, data = pressure)

Residuals:
   Min     1Q Median     3Q    Max
-41.85 -34.72 -10.90  24.69  63.51

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -81.5000    29.1395  -2.797   0.0233 *  
Temperature   4.0309     0.4696   8.583 2.62e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 42.66 on 8 degrees of freedom
Multiple R-squared:  0.902,    Adjusted R-squared:  0.8898
F-statistic: 73.67 on 1 and 8 DF,  p-value: 2.622e-05

Ideally, when you plot the residuals, they should look random. Otherwise means that maybe there is a hidden pattern that the linear model is not considering. To plot the residuals, use the command plot(lmTemp$residuals).

plot(lmTemp$residuals, pch = 16, col = "red")

scatter plot

This can be a problem. If you have more data, your simple linear model will not be able to generalize well. In the previous picture, notice that there is a pattern (like a curve on the residuals). This is not random at all.

What you can do is a transformation of the variable. Many possible transformations can be performed on your data, such as adding a quadratic term (x2), a cubic (x3), or even more complex such as ln(X), ln(X+1), sqrt(X), 1/x, Exp(X). The choice of the correct transformation will come with some knowledge of algebraic functions, practice, trial, and error.

Let’s try with a quadratic term. For this, add the term “I” (capital "I") before your transformation, for example, this will be the normal linear regression formula:

lmTemp2 = lm(Pressure~Temperature + I(Temperature^2), data = pressure) #Create a linear regression with a quadratic coefficient
summary(lmTemp2) #Review the results
Call:
lm(formula = Pressure ~ Temperature + I(Temperature^2), data = pressure)

Residuals:
    Min      1Q  Median      3Q     Max
-4.6045 -1.6330  0.5545  1.1795  4.8273

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      33.750000   3.615591   9.335 3.36e-05 ***
Temperature      -1.731591   0.151002 -11.467 8.62e-06 ***
I(Temperature^2)  0.052386   0.001338  39.158 1.84e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.074 on 7 degrees of freedom
Multiple R-squared:  0.9996,    Adjusted R-squared:  0.9994
F-statistic:  7859 on 2 and 7 DF,  p-value: 1.861e-12

Notice that the model improved significantly. If you plot the residuals of the new model, they will look like this:

plot(lmTemp2$residuals, pch = 16, col = "red")

scatter plot

Now you don’t see any clear patterns on your residuals, which is good!

Detect Influential Points

In your data, you may have influential points that might skew your model, sometimes unnecessarily. Think of a mistake in the data entry, and instead of writing “2.3” the value was “23”. The most common kind of influential point are the outliers, which are data points where the observed response does not appear to follow the pattern established by the rest of the data.

You can detect influential points by looking at the object containing the linear model, using the function cooks.distance and then plot these distances. Change a value on purpose to see how it looks on the Cooks Distance plot. To change a specific value, you can directly point at it with ageandheight[row number, column number] = [new value]. In this case, the height is changed to 7.7 of the second example:

ageandheight[2, 2] = 7.7
head(ageandheight)
age height no_siblings
18 76.1 0
19 7.7 2
20 78.1 0
21 78.2 3
22 78.8 4
23 79.7 1

You create the model again and see how the summary is giving a bad fit, and then plot the Cooks Distances. For this, after creating the linear regression, use the command cooks.distance([linear model] , and then, if you want, you can plot these distances with the command plot.

lmHeight3 = lm(height~age, data = ageandheight)#Create the linear regression
summary(lmHeight3)#Review the results
plot(cooks.distance(lmHeight3), pch = 16, col = "blue") #Plot the Cooks Distances.
Call:
lm(formula = height ~ age, data = ageandheight)

Residuals:
    Min      1Q  Median      3Q     Max
-53.704  -2.584   3.609   9.503  17.512

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    7.905     38.319   0.206    0.841
age            2.816      1.613   1.745    0.112

Residual standard error: 19.29 on 10 degrees of freedom
Multiple R-squared:  0.2335,    Adjusted R-squared:  0.1568
F-statistic: 3.046 on 1 and 10 DF,  p-value: 0.1115

plot

Notice that there is a point that does not follow the pattern, and it might be affecting the model. Here you can make decisions on this point, in general, there are three reasons why a point is so influential:

  1. Someone made a recording error
  2. Someone made a fundamental mistake collecting the observation
  3. The data point is perfectly valid, in which case the model cannot account for the behavior.

If the case is 1 or 2, then you can remove the point (or correct it). If it's 3, it's not worth deleting a valid point; maybe you can try on a non-linear model rather than a linear model like linear regression.

Beware that an influential point can be a valid point, be sure to check the data and its source before deleting it. It’s common to see in statistics books this quote: “Sometimes we throw out perfectly good data when we should be throwing out questionable models.”

Conclusion

You made it to the end! Linear regression is a big topic, and it's here to stay. Here, we've presented a few tricks that can help to tune and take the most advantage of such a powerful yet simple algorithm. You also learned how to understand what's behind this simple statistical model and how you can modify it according to your needs. You can also explore other options by typing ?lm on the R console and looking at the different parameters not covered here. Check out our Regularization Tutorial: Ridge, Lasso and Elastic Net. If you are interested in diving into statistical models, go ahead and check the course on Statistical Modeling in R.

Run and edit the code from this tutorial online

Open Workspace

Get certified in your dream Data Scientist role

Our certification programs help you stand out and prove your skills are job-ready to potential employers.

Get Your Certification
Timeline mobile.png
Topics

R Courses

Certification available

Course

Introduction to R

4 hr
2.7M
Master the basics of data analysis in R, including vectors, lists, and data frames, and practice R with real data sets.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

Data Sets and Where to Find Them: Navigating the Landscape of Information

Are you struggling to find interesting data sets to analyze? Do you have a plan for what to do with a sample data set once you’ve found it? If you have data set questions, this tutorial is for you! We’ll go over the basics of what a data set is, where to find one, how to clean and explore it, and where to showcase your data story.

Amberle McKee

11 min

You’re invited! Join us for Radar: The Analytics Edition

Join us for a full day of events sharing best practices from thought leaders in the analytics space
DataCamp Team's photo

DataCamp Team

4 min

10 Top Data Analytics Conferences for 2024

Discover the most popular analytics conferences and events scheduled for 2024.
Javier Canales Luna's photo

Javier Canales Luna

7 min

A Data Science Roadmap for 2024

Do you want to start or grow in the field of data science? This data science roadmap helps you understand and get started in the data science landscape.
Mark Graus's photo

Mark Graus

10 min

A Complete Guide to Alteryx Certifications

Advance your career with our Alteryx certification guide. Learn key strategies, tips, and resources to excel in data science.
Matt Crabtree's photo

Matt Crabtree

9 min

Mastering Bayesian Optimization in Data Science

Unlock the power of Bayesian Optimization for hyperparameter tuning in Machine Learning. Master theoretical foundations and practical applications with Python to enhance model accuracy.
Zoumana Keita 's photo

Zoumana Keita

11 min

See MoreSee More