Skip to content
New Workbook
Sign up
Course Notes: Dealing With Missing Data in R

Dealing with Missing Data in R

Chapter 1: Why Care About Missing Data?

How do I check if I have missing values?

install.packages("naniar")
# None of the functions used in this course work without the 'naniar' package.
library('naniar')
library('dplyr')
x <- c(1, NA, 3, NA, NA, 5) # sample vector
any_na(x) # returns T/F
are_na(x) # returns vector of T/F values
n_miss(x) # returns number of missing values
prop_miss(x) # proportion of missing values
n_complete(x) # number of complete records
prop_complete(x) # proportion of complete records
# operations with NA values
1 + NA
NA + NA
NA | TRUE
NA | FALSE

How to Summarize Missing Values

Basic summaries of missingness:

  • n_miss() number missing
  • n_complete() number complete

Dataframe summaries of missingness:

  • miss_var_summary() summarize number of missings in each variable
  • miss_case_summary() each case represents a dataset row number
  • miss_var_table() returns a dataframe with the number of missings in a variable, and the number and percentage of variables affected.
  • miss_case_table() returns the same information, but for cases
  • These functions work with group_by

Spans of missing data

  • miss_var_span(df, var=, span_every=) calculates the number of missings in a variable for a repeating span
  • miss_var_run(var=) returns the "runs" or "streaks" of missingness

Using summaries with group_by

airquality %>% group_by(Month) %>% miss_var_summary()
# Calculate the summaries for each run of missingness for the variable, hourly_counts
miss_var_run(pedestrian, var = hourly_counts)
# Calculate the summaries for each span of missingness, 
# for a span of 4000, for the variable hourly_counts
miss_var_span(pedestrian, var = hourly_counts, span_every = 4000)
# For each `month` variable, calculate the run of missingness for hourly_counts
pedestrian %>% group_by(month) %>% miss_var_run(var = hourly_counts)
# For each `month` variable, calculate the span of missingness 
# of a span of 2000, for the variable hourly_counts
pedestrian %>% group_by(month) %>% miss_var_span(var = hourly_counts, span_every = 2000)

How to Visualize Missing Values

  • naniar provides a friendly family of missing data visualization functions.
  • Each visualization corresponds to a data summary.
  • Visualizations help you operate closer to the speed of thought.
  • Get a bird's eye view of the missing data
    • vis_miss(airquality) produces heatmap of missingness
  • Can also cluster data:
    • vis_miss(airquality, cluster = TRUE) orders rows by missingness to identify common co-occurrences
  • Look at missings in variables and cases
    • gg_miss_var(airquality) each point represents the amount of missingness in that variable
    • gg_miss_case(airquality)each line represents the amount of missingness in that case. The orderings in gg_miss_case can be turned off with option order_cases = FALSE.
    • gg_miss_var(airquality, facet = Month)
    • gg_miss_upset(airquality) visualize the common combinations of missingness - which variables and cases go missing together
    • gg_miss_fct To explore how missingness in each variable changes across a factor, use gg_miss_fct. This displays a heatmap visualization showing the factors on the x axis, each other variable on the y axis, and the amount of missingness colored from dark purple to yellow. gg_miss_fct does not support faceting.
    • gg_miss_span is the visual analogue of miss_var_span. This calculates the number of missings in a given span, the number of missings for every 3000 rows. It displays the amount of missing values in each span in a filled barplot. gg_miss_span supports faceting.
# Using the pedestrian dataset, explore how the missingness of hourly_counts changes over a span of 3000 
gg_miss_span(pedestrian, var = hourly_counts, span_every = 3000)
# Using the pedestrian dataset, explore the impact of month by faceting by month
# and explore how missingness changes for a span of 1000
gg_miss_span(pedestrian, var = hourly_counts , span_every = 1000, facet = month)

Chapter 2: Wrangling and Tidying Missing Values

Searching for and replacing missing values

  • Ideal = NA
  • Missing values can be coded incorrectly: e.g. "missing", "Not Available", "N/A"
  • Assuming that missing vales are coded as NA is a mistake.
miss_scan_count() chaos %>% miss_scan_count(search = list("N/A")) # miss_scan_count() can also take multiple arguments in the search chaos %>% miss_scan_count(search = list("N/A", "N/a"))

Replacing Missing Values

# Use chaos, then replace with NA for the variable "grade" with the values "N/A" and "N/a" chaos %>% replace_with_na(replace = list(grade = c("N/A", "N/a")))

"scoped variants" of replace_with_na

  • replace_with_na can be repetitive:
    • Use it across many different variables and values
    • Complex cases, replacing values less than -1, only affect character columns.
  • replace_with_na_all() all variables
  • replace_with_na_at() a subset of selected variables
  • replace_with_na_if() a subset of variables that fulfill some condition (numeric, character)
# "chaos THEN replace_with_na_all, where the variable is equal to -99." chaos %>% replace_with_na_all(condition = ~.x == -99) # replace "N/A", "missing", "na" values chaos %>% replace_with_na_all(condition = ~.x %in% c("N/A", "missing", "na"))