How to Use na.rm to Handle Missing Values in R

We set na.rm = TRUE in common R functions to exclude missing (NA) values. This helps us compute accurate statistics and enhances the reliability of our results.

Jul 3, 2024 · 8 min read

Data is being generated and consumed at an industrial pace to drive key business decisions. However, one common challenge with data analysis is the presence of missing values. This can skew results, making it unavoidable for any data analyst to handle them effectively. In R, one of the fundamental tools for managing missing values is the na.rm parameter.

This post explains the importance of handling missing data, how na.rm in R works, demonstrates its use in various functions, and concludes with discussing some alternate techniques for identifying and handling missing data.

The Short Answer: What is na.rm?

The parameter na.rm in R stands for "NA remove" and ignores NA (missing) values during calculations. By setting na.rm = TRUE, functions like mean(), sum(), min(), max(), median(), and sd() compute results without being affected by missing values.

The na.rm parameter, written in lowercase, includes a Boolean value—TRUE or FALSE. When we set na.rm = TRUE, R excludes NA values from the calculations. Without this parameter, functions would return NA if missing values are present in the data. Take a look.

vector_with_na <- c(1, 2, NA)

sum(vector_with_na, na.rm = TRUE) # removes NA values — returns 3
sum(vector_with_na) # includes NA values — returns NA
sum(vector_with_na, na.rm = FALSE) # includes NA values — returns NA

Check out the Introduction to R course for more on basic R programming.

Why Use na.rm?

Handling missing values is crucial in data analysis to ensure accurate results, as they can significantly impact statistical calculations, inadvertently leading to incorrect conclusions. Using na.rm makes it convenient to avoid errors by skipping over these missing values, thereby performing reliable computations and maintaining data integrity.

The Data Cleaning in R course is valuable in furthering our data analysis skills.

Common Functions Using na.rm: Some Practical Examples

Now, let’s understand how to use na.rm in R. Several R functions incorporate the na.rm parameter (set to FALSE by default). Here are some common examples:

sum()

As the name suggests, the sum() computes the total of values in a vector. Let’s create a vector called vector_with_na , which we will use throughout this tutorial.

vector_with_na <- c(1, 2, NA, 4)

sum(vector_with_na) 
# Returns NA

sum(vector_with_na, na.rm = TRUE) 
# Returns 7

mean()

Using the same example, the mean() function calculates the mean, i.e., the average of the values in our vector.

mean(vector_with_na) 
# Returns NA

mean(vector_with_na, na.rm = TRUE) 
# Returns 2.33

sd()

Continuing with vector_with_na, the sd() computes the standard deviation of the vector.

sd(vector_with_na) 
# Returns NA

sd(vector_with_na, na.rm = TRUE) 
# Returns 1.53

min()

The min() finds the minimum value.

min(vector_with_na) 
# Returns NA

min(vector_with_na, na.rm = TRUE) 
# Returns 1

max()

max() finds the maximum value.

max(vector_with_na) 
# Returns NA

max(vector_with_na, na.rm = TRUE) 
# Returns 4

median()

median() finds the middle value when arranged in order.

median(vector_with_na) 
# Returns NA

median(vector_with_na, na.rm = TRUE) 
# Returns 2

Note that in the absence of na.rm = TRUE all these aggregate functions return NA.

By setting na.rm = TRUE, these functions exclude NA values, leading to accurate and meaningful computations. The R Programming Fundamentals skill track can be a good resource for better understanding the usage and syntax of these functions.

Handling Missing Values in Different Data Structures

Let's take a look at how to handle missing values in different data structures. Before we continue, check out our Mastering Data Structures in the R Programming Language tutorial if you want to learn more about R data structures.

Vectors

In vectors, na.rm can be used directly within functions to exclude missing values. The examples shared in the previous section are all examples of na.rm in action on vectors.

Data frames

For data frames, na.rm can be applied within functions used on specific columns or across rows using the apply() as shown below:

apply(X, MARGIN, FUN)

Here:

X: an array, data frame, or matrix
MARGIN: argument to identify where to apply the function:
- MARGIN=1 for row manipulation
- MARGIN=2 for column manipulation
- MARGIN=c(1,2) for both row and column manipulation
FUN: tells which function to apply. Built-in functions like mean(), median(), sum(), min(), max(), and user-defined functions can be applied.

Let’s create a data frame for our reference throughout this tutorial.

dataframe_with_na <- data.frame(col1 = c(1, NA, 3, 6), col2 = c(4, 5, NA, 7))

print(dataframe_with_na)
# Returns
#   col1 col2
# 1    1    4
# 2   NA    5
# 3    3   NA
# 4    6    7

apply(dataframe_with_na, 1, mean)
# Returns
# [1] 2.5  NA  NA 6.5

apply(dataframe_with_na, 1, mean, na.rm = TRUE)
# Returns
# [1] 2.5 5.0 3.0 6.5

apply(dataframe_with_na, 2, mean)
# Returns
# col1 col2 
#  NA   NA 

apply(dataframe_with_na, 2, mean, na.rm = TRUE)
# Returns   
#   col1     col2 
# 3.333333 5.333333

Lists

The lapply() function in R is used to apply a specified function to each element of a list, vector, or data frame, and it returns a list of the same length. This function does not require a MARGIN parameter, as it automatically applies the operation to all elements.

Syntax:

lapply(vector_with_na, fun)

Here:

vector_with_na: The input list, vector, or data frame.
fun: The function to be applied to each input data element.

Let’s understand lapply() using a list example. Here we have a list of collections (item_1 and item_2) with elements similar to the data frame (dataframe_with_na). Our goal is to find the mean of each collection.

list_with_na <- list(item_1=c(1, NA, 3, 6), item_2=c(4, 5, NA, 7))

lapply(list_with_na, mean)
# Returns
# $item_1
# [1] NA
# 
# $item_2
# [1] NA

lapply(list_with_na, mean, na.rm =TRUE)
# Returns
# $item_1
# [1] 3.333333
# 
# $item_2
# [1] 5.333333

The sapply() function is similar to lapply() but returns an array or matrix instead of a list. Let’s use the list (list_with_na) and the sapply() to compute the mean of the values of each collection inside it.

sapply(list_with_na, mean)
# Returns
# a  b 
# NA NA 

sapply(list_with_na, mean, na.rm = TRUE)

# Returns
# a        b 
# 3.333333 5.333333

Now, let’s use the data frame (dataframe_with_na) and the sapply() to compute the sum of the values of each column.

sapply(dataframe_with_na, sum)

# Returns
# col1 col2 
# NA   NA 

sapply(dataframe_with_na, sum, na.rm=TRUE)
# Returns
# col1 col2 
# 10   16

For a broader understanding of data manipulation in R, the Data Manipulation with dplyr course is highly recommended. If you feel shaky on the apply() family of functions specifically, read through our Tutorial on the R Apply Family.

Comparison with na.omit() and complete.cases()

na.omit(): This function removes all the rows containing NA values in a data frame. Let’s understand this by an example:

na.omit(dataframe_with_na)
# Returns
# col1 col2
# 1    1    4
# 4    6    7

complete.cases(): On the other hand, complete.cases() identifies rows without any NA values and returns a bool corresponding to each row (FALSE for rows with NA and vice-versa). This can be used to filter the non-missing values in the data frame, as shown below.

complete.cases(dataframe_with_na)
# Returns
# [1]  TRUE FALSE FALSE  TRUE

dataframe_with_na[complete.cases(dataframe_with_na), ]
# Returns
# col1 col2
# 1    1    4
# 4    6    7

Advanced Techniques for Handling Missing Values

Beyond na.rm, R offers advanced methods for handling missing data:

Using is.na() to identify missing values

is.na() can be directly applied over a vector or data frame to identify missing data. It returns a TRUE corresponding to each missing value.

is.na(vector_with_na)
# Returns
# [1] FALSE FALSE  TRUE FALSE

is.na(dataframe_with_na)
# Returns
# col1  col2
# [1,] FALSE FALSE
# [2,]  TRUE FALSE
# [3,] FALSE  TRUE
# [4,] FALSE FALSE

Applying imputation methods

Imputation replaces missing values with estimated values, such as the mean or median, as shown below:

vector_with_na[is.na(vector_with_na)] <- mean(vector_with_na, na.rm = TRUE)

print(vector_with_na)
# Returns
# [1] 1.000000 2.000000 2.333333 4.000000

dataframe_with_na[is.na(dataframe_with_na)] <- sapply(dataframe_with_na, mean, na.rm=TRUE)
print(dataframe_with_na)
# Returns
# col1     col2
# 1 1.000000 4.000000
# 2 3.333333 5.000000
# 3 3.000000 5.333333
# 4 6.000000 7.000000

Using the summary() function

Interestingly, thesummary() function differs from using na.rm for handling missing values. By default, summary() automatically excludes missing values when computing summary statistics and subsequently reports the number of NAs present in the data.

Let's run summary() on the vector vector_with_na:

summary(vector_with_na)
#Returns
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#  1.000   1.500   2.000   2.333   3.000   4.000       1

And on the data frame dataframe_with_na:

summary(dataframe_with_na)
# Returns
# col1           col2     
# Min.   :1.00   Min.   :4.00  
# 1st Qu.:2.50   1st Qu.:4.75  
# Median :3.00   Median :5.00  
# Mean   :3.25   Mean   :5.25  
# 3rd Qu.:3.75   3rd Qu.:5.50  
# Max.   :6.00   Max.   :7.00

The Intermediate R course offers in-depth insights for those interested in more advanced data handling techniques.

Best Practices for Handling Missing Values

Handling missing values effectively involves several best practices:

Always check for missing values

Identify and understand the extent of missing data before performing calculations. The below example shows that the data frame has a total of 2 missing values.

print(dataframe_with_na)
# Returns
# col1 col2
# 1    1    4
# 2   NA    5
# 3    3   NA
# 4    6    7

sum(is.na(dataframe_with_na))
# Returns
# [1] 2

Use conditional checks

Always verify the presence of missing data and accordingly apply any imputation techniques, ensuring all edge cases are well-handled.

if(any(is.na(dataframe_with_na))) {  
  if (class(dataframe_with_na) == "data.frame") {    
   dataframe_with_na[is.na(dataframe_with_na)] <- sapply(dataframe_with_na, median, na.rm=TRUE)      
  }
}

print(dataframe_with_na)
# Returns
# col1 col2
# 1    1    4
# 2    3    5
# 3    3    5
# 4    6    7

Consider the context

Understand why data is missing and choose an appropriate method to handle them —removal or imputation.

For a structured approach to data analysis and handling missing values, the Data Cleaning in R course can be a great resource.

Conclusion

Understanding and using na.rm is key to ensuring reliable and accurate data analysis. Practice using na.rm in R projects and explore advanced techniques to handle missing data comprehensively.

For further learning, explore these additional resources:

Data Cleaning: Enhance your data analysis skills with our Data Cleaning in R course.
Introduction to R: Get started with R programming with our Introduction to R course.
Data Visualization: Master creating compelling visuals with our Introduction to Data Visualization with ggplot2 course.
Data Manipulation: Improve your data transformation techniques with our Data Manipulation with dplyr course.
Regression Analysis: Dive into regression techniques and learn statistics with our Introduction to Regression in R course.

Author

Vidhi Chugh

What is the na.rm parameter in R?

How do you use na.rm in common R functions?

Why is handling missing values important in data analysis?

Can na.rm be used with data frames and lists in R?

What are some advanced techniques for handling missing values in R?

Topics

Data Analysis

Learn R with DataCamp

Course

Introduction to R

4 hr

Master the basics of data analysis in R, including vectors, lists, and data frames, and practice R with real data sets.

See Details

Start Course

Course

Intermediate R

6 hr

657.8K

Continue your journey to becoming an R ninja by learning about conditional statements, loops, and vector functions.

See Details

Start Course

Course

Introduction to Regression in R

4 hr

71.4K

Predict housing prices and ad click-through rate by implementing, analyzing, and interpreting regression analysis in R.

See Details

Start Course

cheat-sheet

Reshaping Data with tidyr in R

In this cheat sheet, you will learn how to reshape data with tidyr. From separating and combining columns, to dealing with missing data, you'll get the download on how to manipulate data in R.

Richie Cotton

Tutorial

How to Do Linear Regression in R

Learn linear regression, a statistical model that analyzes the relationship between variables. Follow our step-by-step guide to learn the lm() function in R.

Eladio Montero Porras

Tutorial

Sorting Data in R

How to sort a data frame in R.

DataCamp Team

Tutorial

Utilities in R Tutorial

Learn about several useful functions for data structure manipulation, nested-lists, regular expressions, and working with times and dates in the R programming language.

Aditya Sharma

Tutorial

Chi-Square Test in R: A Complete Guide

Learn how to create a contingency table and perform chi-square tests in R using the chisq.test() function. Discover practical applications and interpret results with confidence.

Arunn Thevapalan

Tutorial

Merging Data in R

Merging data is a common task in data analysis, especially when working with large datasets. The merge function in R is a powerful tool that allows you to combine two or more datasets based on shared variables.

DataCamp Team

See More See More

The Short Answer: What is na.rm?

Why Use na.rm?

Common Functions Using na.rm: Some Practical Examples

sum()

mean()

sd()

min()

max()

median()

Handling Missing Values in Different Data Structures

Vectors

Data frames

Lists

Comparison with na.omit() and complete.cases()

Advanced Techniques for Handling Missing Values

Using is.na() to identify missing values

Applying imputation methods

Using the summary() function

Best Practices for Handling Missing Values

Always check for missing values

Use conditional checks

Consider the context

Conclusion

Frequently Asked Questions

Why is handling missing values important in data analysis?

Can na.rm be used with data frames and lists in R?

What are some advanced techniques for handling missing values in R?

Reshaping Data with tidyr in R

How to Do Linear Regression in R

Sorting Data in R

Utilities in R Tutorial

Chi-Square Test in R: A Complete Guide

Merging Data in R

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}Introduction to R

Intermediate R

Introduction to Regression in R

Reshaping Data with tidyr in R

How to Do Linear Regression in R

Sorting Data in R

Utilities in R Tutorial

Chi-Square Test in R: A Complete Guide

Merging Data in R

Introduction to R