HomeTutorialsR Programming

# How to Use na.rm to Handle Missing Values in R

We set na.rm = TRUE in common R functions to exclude missing (NA) values. This helps us compute accurate statistics and enhances the reliability of our results.
Jul 2024  · 8 min read

Data is being generated and consumed at an industrial pace to drive key business decisions. However, one common challenge with data analysis is the presence of missing values. This can skew results, making it unavoidable for any data analyst to handle them effectively. In R, one of the fundamental tools for managing missing values is the `na.rm` parameter.

This post explains the importance of handling missing data, how `na.rm` in R works, demonstrates its use in various functions, and concludes with discussing some alternate techniques for identifying and handling missing data.

## The Short Answer: What is na.rm?

The parameter `na.rm` in R stands for "NA remove" and ignores `NA` (missing) values during calculations. By setting `na.rm = TRUE`, functions like `mean()`, `sum()`, `min()`, `max()`, `median()`, and `sd()` compute results without being affected by missing values.

The `na.rm` parameter, written in lowercase, includes a Boolean value—`TRUE` or `FALSE`. When we set `na.rm = TRUE`, R excludes `NA` values from the calculations. Without this parameter, functions would return `NA` if missing values are present in the data. Take a look.

``````vector_with_na <- c(1, 2, NA)

sum(vector_with_na, na.rm = TRUE) # removes NA values — returns 3
sum(vector_with_na) # includes NA values — returns NA
sum(vector_with_na, na.rm = FALSE) # includes NA values — returns NA``````

Check out the Introduction to R course for more on basic R programming.

## Why Use na.rm?

Handling missing values is crucial in data analysis to ensure accurate results, as they can significantly impact statistical calculations, inadvertently leading to incorrect conclusions. Using `na.rm` makes it convenient to avoid errors by skipping over these missing values, thereby performing reliable computations and maintaining data integrity.

The Data Cleaning in R course is valuable in furthering our data analysis skills.

## Common Functions Using na.rm: Some Practical Examples

Now, let’s understand how to use `na.rm` in R. Several R functions incorporate the `na.rm` parameter (set to `FALSE` by default). Here are some common examples:

### sum()

As the name suggests, the `sum()` computes the total of values in a vector. Let’s create a vector called `vector_with_na` , which we will use throughout this tutorial.

``````vector_with_na <- c(1, 2, NA, 4)

sum(vector_with_na)
# Returns NA

sum(vector_with_na, na.rm = TRUE)
# Returns 7``````

### mean()

Using the same example, the `mean()` function calculates the mean, i.e., the average of the values in our vector.

``````mean(vector_with_na)
# Returns NA

mean(vector_with_na, na.rm = TRUE)
# Returns 2.33``````

### sd()

Continuing with `vector_with_na`, the `sd()` computes the standard deviation of the vector.

``````sd(vector_with_na)
# Returns NA

sd(vector_with_na, na.rm = TRUE)
# Returns 1.53``````

### min()

The `min()` finds the minimum value.

``````min(vector_with_na)
# Returns NA

min(vector_with_na, na.rm = TRUE)
# Returns 1``````

### max()

`max()` finds the maximum value.

``````max(vector_with_na)
# Returns NA

max(vector_with_na, na.rm = TRUE)
# Returns 4``````

### median()

`median()` finds the middle value when arranged in order.

``````median(vector_with_na)
# Returns NA

median(vector_with_na, na.rm = TRUE)
# Returns 2``````

Note that in the absence of `na.rm = TRUE` all these aggregate functions return `NA`.

By setting `na.rm = TRUE`, these functions exclude `NA` values, leading to accurate and meaningful computations. The R Programming Fundamentals skill track can be a good resource for better understanding the usage and syntax of these functions.

## Handling Missing Values in Different Data Structures

Let's take a look at how to handle missing values in different data structures. Before we continue, check out our Mastering Data Structures in the R Programming Language tutorial if you want to learn more about R data structures.

### Vectors

In vectors, `na.rm` can be used directly within functions to exclude missing values. The examples shared in the previous section are all examples of `na.rm` in action on vectors.

### Data frames

For data frames, `na.rm` can be applied within functions used on specific columns or across rows using the `apply()` as shown below:

``apply(X, MARGIN, FUN)``

Here:

• `X`: an array, data frame, or matrix

• `MARGIN`: argument to identify where to apply the function:

• `MARGIN=1` for row manipulation

• `MARGIN=2` for column manipulation

• `MARGIN=c(1,2)` for both row and column manipulation

• `FUN`: tells which function to apply. Built-in functions like `mean()`, `median()`, `sum()`, `min()`, `max()`, and user-defined functions can be applied.

Let’s create a data frame for our reference throughout this tutorial.

``````dataframe_with_na <- data.frame(col1 = c(1, NA, 3, 6), col2 = c(4, 5, NA, 7))

print(dataframe_with_na)
# Returns
#   col1 col2
# 1    1    4
# 2   NA    5
# 3    3   NA
# 4    6    7

apply(dataframe_with_na, 1, mean)
# Returns
# [1] 2.5  NA  NA 6.5

apply(dataframe_with_na, 1, mean, na.rm = TRUE)
# Returns
# [1] 2.5 5.0 3.0 6.5

apply(dataframe_with_na, 2, mean)
# Returns
# col1 col2
#  NA   NA

apply(dataframe_with_na, 2, mean, na.rm = TRUE)
# Returns
#   col1     col2
# 3.333333 5.333333``````

### Lists

The `lapply()` function in R is used to apply a specified function to each element of a list, vector, or data frame, and it returns a list of the same length. This function does not require a `MARGIN` parameter, as it automatically applies the operation to all elements.

Syntax:

``lapply(vector_with_na, fun)``

Here:

• `vector_with_na`: The input list, vector, or data frame.
• `fun`: The function to be applied to each input data element.

Let’s understand `lapply()` using a list example. Here we have a list of collections (`item_1` and `item_2`) with elements similar to the data frame (`dataframe_with_na`). Our goal is to find the mean of each collection.

``````list_with_na <- list(item_1=c(1, NA, 3, 6), item_2=c(4, 5, NA, 7))

lapply(list_with_na, mean)
# Returns
# \$item_1
# [1] NA
#
# \$item_2
# [1] NA

lapply(list_with_na, mean, na.rm =TRUE)
# Returns
# \$item_1
# [1] 3.333333
#
# \$item_2
# [1] 5.333333``````

The `sapply()` function is similar to `lapply()` but returns an array or matrix instead of a list. Let’s use the list (`list_with_na`) and the `sapply()` to compute the mean of the values of each collection inside it.

``````sapply(list_with_na, mean)
# Returns
# a  b
# NA NA

sapply(list_with_na, mean, na.rm = TRUE)

# Returns
# a        b
# 3.333333 5.333333``````

Now, let’s use the data frame (`dataframe_with_na`) and the `sapply()` to compute the sum of the values of each column.

``````sapply(dataframe_with_na, sum)

# Returns
# col1 col2
# NA   NA

sapply(dataframe_with_na, sum, na.rm=TRUE)
# Returns
# col1 col2
# 10   16``````

For a broader understanding of data manipulation in R, the Data Manipulation with dplyr course is highly recommended. If you feel shaky on the apply() family of functions specifically, read through our Tutorial on the R Apply Family.

### Comparison with na.omit() and complete.cases()

`na.omit()`: This function removes all the rows containing `NA` values in a data frame. Let’s understand this by an example:

``````na.omit(dataframe_with_na)
# Returns
# col1 col2
# 1    1    4
# 4    6    7``````

`complete.cases()`: On the other hand, `complete.cases()` identifies rows without any `NA` values and returns a bool corresponding to each row (`FALSE` for rows with `NA` and vice-versa). This can be used to filter the non-missing values in the data frame, as shown below.

``````complete.cases(dataframe_with_na)
# Returns
# [1]  TRUE FALSE FALSE  TRUE

dataframe_with_na[complete.cases(dataframe_with_na), ]
# Returns
# col1 col2
# 1    1    4
# 4    6    7``````

## Advanced Techniques for Handling Missing Values

Beyond `na.rm`, R offers advanced methods for handling missing data:

### Using is.na() to identify missing values

`is.na()` can be directly applied over a vector or data frame to identify missing data. It returns a `TRUE` corresponding to each missing value.

``````is.na(vector_with_na)
# Returns
# [1] FALSE FALSE  TRUE FALSE

is.na(dataframe_with_na)
# Returns
# col1  col2
# [1,] FALSE FALSE
# [2,]  TRUE FALSE
# [3,] FALSE  TRUE
# [4,] FALSE FALSE``````

### Applying imputation methods

Imputation replaces missing values with estimated values, such as the mean or median, as shown below:

``````vector_with_na[is.na(vector_with_na)] <- mean(vector_with_na, na.rm = TRUE)

print(vector_with_na)
# Returns
# [1] 1.000000 2.000000 2.333333 4.000000

dataframe_with_na[is.na(dataframe_with_na)] <- sapply(dataframe_with_na, mean, na.rm=TRUE)
print(dataframe_with_na)
# Returns
# col1     col2
# 1 1.000000 4.000000
# 2 3.333333 5.000000
# 3 3.000000 5.333333
# 4 6.000000 7.000000``````

### Using the summary() function

Interestingly, the`summary()` function differs from using `na.rm` for handling missing values. By default, `summary()` automatically excludes missing values when computing summary statistics and subsequently reports the number of `NA`s present in the data.

Let's run `summary()` on the vector `vector_with_na`:

``````summary(vector_with_na)
#Returns
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
#  1.000   1.500   2.000   2.333   3.000   4.000       1``````

And on the data frame `dataframe_with_na`:

``````summary(dataframe_with_na)
# Returns
# col1           col2
# Min.   :1.00   Min.   :4.00
# 1st Qu.:2.50   1st Qu.:4.75
# Median :3.00   Median :5.00
# Mean   :3.25   Mean   :5.25
# 3rd Qu.:3.75   3rd Qu.:5.50
# Max.   :6.00   Max.   :7.00 ``````

The Intermediate R course offers in-depth insights for those interested in more advanced data handling techniques.

## Best Practices for Handling Missing Values

Handling missing values effectively involves several best practices:

### Always check for missing values

Identify and understand the extent of missing data before performing calculations. The below example shows that the data frame has a total of 2 missing values.

``````print(dataframe_with_na)
# Returns
# col1 col2
# 1    1    4
# 2   NA    5
# 3    3   NA
# 4    6    7

sum(is.na(dataframe_with_na))
# Returns
# [1] 2``````

### Use conditional checks

Always verify the presence of missing data and accordingly apply any imputation techniques, ensuring all edge cases are well-handled.

``````if(any(is.na(dataframe_with_na))) {
if (class(dataframe_with_na) == "data.frame") {
dataframe_with_na[is.na(dataframe_with_na)] <- sapply(dataframe_with_na, median, na.rm=TRUE)
}
}

print(dataframe_with_na)
# Returns
# col1 col2
# 1    1    4
# 2    3    5
# 3    3    5
# 4    6    7``````

### Consider the context

Understand why data is missing and choose an appropriate method to handle them —removal or imputation.

For a structured approach to data analysis and handling missing values, the course can be a great resource.

## Conclusion

Understanding and using `na.rm` is key to ensuring reliable and accurate data analysis. Practice using `na.rm` in R projects and explore advanced techniques to handle missing data comprehensively.

For further learning, explore these additional resources:

Author
Vidhi Chugh

I am an AI Strategist and Ethicist working at the intersection of data science, product, and engineering to build scalable machine learning systems. Listed as one of the "Top 200 Business and Technology Innovators" in the world, I am on a mission to democratize machine learning and break the jargon for everyone to be a part of this transformation.

### What is the na.rm parameter in R?.css-18x2vi3{-webkit-flex-shrink:0;-ms-flex-negative:0;flex-shrink:0;height:18px;padding-top:6px;-webkit-transform:rotate(0.5turn) translate(21%, -10%);-moz-transform:rotate(0.5turn) translate(21%, -10%);-ms-transform:rotate(0.5turn) translate(21%, -10%);transform:rotate(0.5turn) translate(21%, -10%);-webkit-transition:-webkit-transform 0.3s cubic-bezier(0.85, 0, 0.15, 1);transition:transform 0.3s cubic-bezier(0.85, 0, 0.15, 1);width:18px;}

The `na.rm` parameter in R stands for "NA remove." When set to TRUE, it instructs functions to ignore `NA` (missing) values during calculations, ensuring accurate results.

### How do you use na.rm in common R functions?.css-167dpqb{-webkit-flex-shrink:0;-ms-flex-negative:0;flex-shrink:0;height:18px;padding-top:6px;-webkit-transform:none;-moz-transform:none;-ms-transform:none;transform:none;-webkit-transition:-webkit-transform 0.3s cubic-bezier(0.85, 0, 0.15, 1);transition:transform 0.3s cubic-bezier(0.85, 0, 0.15, 1);width:18px;}

You can use na.rm in functions like `sum()`, `mean()`, `sd()`, `min()`, `max()`, and `median()` by setting `na.rm = TRUE` to exclude missing values from calculations.

### Why is handling missing values important in data analysis?

Missing values can skew results and lead to inaccurate conclusions. Handling them effectively ensures reliable computations and maintains data integrity.

### Can na.rm be used with data frames and lists in R?

Yes, `na.rm` can be used with data frames using functions like `apply()`, and with lists using `lapply()` and `sapply()`, to exclude missing values during operations on these data structures.

### What are some advanced techniques for handling missing values in R?

Beyond `na.rm`, advanced techniques include using `is.na()` to identify missing values, applying imputation methods to replace them, and leveraging functions like `na.omit()` and `complete.cases()` for more complex data cleaning.

Topics

Learn R with DataCamp

Course

### .css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}Introduction to R

4 hr
2.7M
Master the basics of data analysis in R, including vectors, lists, and data frames, and practice R with real data sets.
See Details
Start Course

Course

### Intermediate R

6 hr
602.8K
Continue your journey to becoming an R ninja by learning about conditional statements, loops, and vector functions.

Course

### Introduction to Regression in R

4 hr
51.6K
Predict housing prices and ad click-through rate by implementing, analyzing, and interpreting regression analysis in R.
See More
Related

cheat sheet

### Reshaping Data with tidyr in R

In this cheat sheet, you will learn how to reshape data with tidyr. From separating and combining columns, to dealing with missing data, you'll get the download on how to manipulate data in R.

Richie Cotton

6 min

tutorial

### Utilities in R Tutorial

Learn about several useful functions for data structure manipulation, nested-lists, regular expressions, and working with times and dates in the R programming language.

18 min

tutorial

### Sorting Data in R

How to sort a data frame in R.

DataCamp Team

2 min

tutorial

### R Formula Tutorial

Discover the R formula and how you can use it in modeling- and graphical functions of well-known packages such as stats, and ggplot2.

Karlijn Willems

33 min

tutorial

### Merging Data in R

Merging data is a common task in data analysis, especially when working with large datasets. The merge function in R is a powerful tool that allows you to combine two or more datasets based on shared variables.

DataCamp Team

4 min

tutorial

### Visualize Missing Data with VIM Package

Learn to use data visualization tools provided by the VIM package to gain quick insights into the missing data patterns.

Michał Oleszak

17 min

See MoreSee More