Skip to content
Course Notes: Dealing With Missing Data in R
Dealing with Missing Data in R
Chapter 1: Why Care About Missing Data?
How do I check if I have missing values?
install.packages("naniar")
# None of the functions used in this course work without the 'naniar' package.
library('naniar')
library('dplyr')
x <- c(1, NA, 3, NA, NA, 5) # sample vector
any_na(x) # returns T/F
are_na(x) # returns vector of T/F values
n_miss(x) # returns number of missing values
prop_miss(x) # proportion of missing values
n_complete(x) # number of complete records
prop_complete(x) # proportion of complete records
# operations with NA values
1 + NA
NA + NA
NA | TRUE
NA | FALSE
How to Summarize Missing Values
Basic summaries of missingness:
n_miss()
number missingn_complete()
number complete
Dataframe summaries of missingness:
miss_var_summary()
summarize number of missings in each variablemiss_case_summary()
each case represents a dataset row numbermiss_var_table()
returns a dataframe with the number of missings in a variable, and the number and percentage of variables affected.miss_case_table()
returns the same information, but for cases- These functions work with
group_by
Spans of missing data
miss_var_span(df, var=, span_every=)
calculates the number of missings in a variable for a repeating spanmiss_var_run(var=)
returns the "runs" or "streaks" of missingness
Using summaries with group_by
airquality %>% group_by(Month) %>% miss_var_summary()
# Calculate the summaries for each run of missingness for the variable, hourly_counts
miss_var_run(pedestrian, var = hourly_counts)
# Calculate the summaries for each span of missingness,
# for a span of 4000, for the variable hourly_counts
miss_var_span(pedestrian, var = hourly_counts, span_every = 4000)
# For each `month` variable, calculate the run of missingness for hourly_counts
pedestrian %>% group_by(month) %>% miss_var_run(var = hourly_counts)
# For each `month` variable, calculate the span of missingness
# of a span of 2000, for the variable hourly_counts
pedestrian %>% group_by(month) %>% miss_var_span(var = hourly_counts, span_every = 2000)
How to Visualize Missing Values
naniar
provides a friendly family of missing data visualization functions.- Each visualization corresponds to a data summary.
- Visualizations help you operate closer to the speed of thought.
- Get a bird's eye view of the missing data
vis_miss(airquality)
produces heatmap of missingness
- Can also cluster data:
vis_miss(airquality, cluster = TRUE)
orders rows by missingness to identify common co-occurrences
- Look at missings in variables and cases
gg_miss_var(airquality)
each point represents the amount of missingness in that variablegg_miss_case(airquality)
each line represents the amount of missingness in that case. The orderings ingg_miss_case
can be turned off with optionorder_cases = FALSE
.gg_miss_var(airquality, facet = Month)
gg_miss_upset(airquality)
visualize the common combinations of missingness - which variables and cases go missing togethergg_miss_fct
To explore how missingness in each variable changes across a factor, usegg_miss_fct
. This displays a heatmap visualization showing the factors on the x axis, each other variable on the y axis, and the amount of missingness colored from dark purple to yellow.gg_miss_fct
does not support faceting.gg_miss_span
is the visual analogue ofmiss_var_span
. This calculates the number of missings in a given span, the number of missings for every 3000 rows. It displays the amount of missing values in each span in a filled barplot.gg_miss_span
supports faceting.
# Using the pedestrian dataset, explore how the missingness of hourly_counts changes over a span of 3000
gg_miss_span(pedestrian, var = hourly_counts, span_every = 3000)
# Using the pedestrian dataset, explore the impact of month by faceting by month
# and explore how missingness changes for a span of 1000
gg_miss_span(pedestrian, var = hourly_counts , span_every = 1000, facet = month)
Chapter 2: Wrangling and Tidying Missing Values
Searching for and replacing missing values
- Ideal =
NA
- Missing values can be coded incorrectly: e.g. "missing", "Not Available", "N/A"
- Assuming that missing vales are coded as
NA
is a mistake.
miss_scan_count() chaos %>% miss_scan_count(search = list("N/A")) # miss_scan_count() can also take multiple arguments in the search chaos %>% miss_scan_count(search = list("N/A", "N/a"))
Replacing Missing Values
# Use chaos, then replace with NA for the variable "grade" with the values "N/A" and "N/a" chaos %>% replace_with_na(replace = list(grade = c("N/A", "N/a")))
"scoped variants" of replace_with_na
replace_with_na
can be repetitive:- Use it across many different variables and values
- Complex cases, replacing values less than -1, only affect character columns.
replace_with_na_all()
all variablesreplace_with_na_at()
a subset of selected variablesreplace_with_na_if()
a subset of variables that fulfill some condition (numeric, character)
# "chaos THEN replace_with_na_all, where the variable is equal to -99." chaos %>% replace_with_na_all(condition = ~.x == -99) # replace "N/A", "missing", "na" values chaos %>% replace_with_na_all(condition = ~.x %in% c("N/A", "missing", "na"))