Skip to content
Course Notes: Dealing With Missing Data in R
Dealing with Missing Data in R
Chapter 1: Why Care About Missing Data?
How do I check if I have missing values?
install.packages("naniar")# None of the functions used in this course work without the 'naniar' package.
library('naniar')
library('dplyr')x <- c(1, NA, 3, NA, NA, 5) # sample vector
any_na(x) # returns T/F
are_na(x) # returns vector of T/F values
n_miss(x) # returns number of missing values
prop_miss(x) # proportion of missing values
n_complete(x) # number of complete records
prop_complete(x) # proportion of complete records# operations with NA values
1 + NA
NA + NA
NA | TRUE
NA | FALSEHow to Summarize Missing Values
Basic summaries of missingness:
n_miss()number missingn_complete()number complete
Dataframe summaries of missingness:
miss_var_summary()summarize number of missings in each variablemiss_case_summary()each case represents a dataset row numbermiss_var_table()returns a dataframe with the number of missings in a variable, and the number and percentage of variables affected.miss_case_table()returns the same information, but for cases- These functions work with
group_by
Spans of missing data
miss_var_span(df, var=, span_every=)calculates the number of missings in a variable for a repeating spanmiss_var_run(var=)returns the "runs" or "streaks" of missingness
Using summaries with group_by
airquality %>% group_by(Month) %>% miss_var_summary()
# Calculate the summaries for each run of missingness for the variable, hourly_counts
miss_var_run(pedestrian, var = hourly_counts)# Calculate the summaries for each span of missingness,
# for a span of 4000, for the variable hourly_counts
miss_var_span(pedestrian, var = hourly_counts, span_every = 4000)# For each `month` variable, calculate the run of missingness for hourly_counts
pedestrian %>% group_by(month) %>% miss_var_run(var = hourly_counts)# For each `month` variable, calculate the span of missingness
# of a span of 2000, for the variable hourly_counts
pedestrian %>% group_by(month) %>% miss_var_span(var = hourly_counts, span_every = 2000)How to Visualize Missing Values
naniarprovides a friendly family of missing data visualization functions.- Each visualization corresponds to a data summary.
- Visualizations help you operate closer to the speed of thought.
- Get a bird's eye view of the missing data
vis_miss(airquality)produces heatmap of missingness
- Can also cluster data:
vis_miss(airquality, cluster = TRUE)orders rows by missingness to identify common co-occurrences
- Look at missings in variables and cases
gg_miss_var(airquality)each point represents the amount of missingness in that variablegg_miss_case(airquality)each line represents the amount of missingness in that case. The orderings ingg_miss_casecan be turned off with optionorder_cases = FALSE.gg_miss_var(airquality, facet = Month)gg_miss_upset(airquality)visualize the common combinations of missingness - which variables and cases go missing togethergg_miss_fctTo explore how missingness in each variable changes across a factor, usegg_miss_fct. This displays a heatmap visualization showing the factors on the x axis, each other variable on the y axis, and the amount of missingness colored from dark purple to yellow.gg_miss_fctdoes not support faceting.gg_miss_spanis the visual analogue ofmiss_var_span. This calculates the number of missings in a given span, the number of missings for every 3000 rows. It displays the amount of missing values in each span in a filled barplot.gg_miss_spansupports faceting.
# Using the pedestrian dataset, explore how the missingness of hourly_counts changes over a span of 3000
gg_miss_span(pedestrian, var = hourly_counts, span_every = 3000)# Using the pedestrian dataset, explore the impact of month by faceting by month
# and explore how missingness changes for a span of 1000
gg_miss_span(pedestrian, var = hourly_counts , span_every = 1000, facet = month)Chapter 2: Wrangling and Tidying Missing Values
Searching for and replacing missing values
- Ideal =
NA - Missing values can be coded incorrectly: e.g. "missing", "Not Available", "N/A"
- Assuming that missing vales are coded as
NAis a mistake.
miss_scan_count() chaos %>% miss_scan_count(search = list("N/A")) # miss_scan_count() can also take multiple arguments in the search chaos %>% miss_scan_count(search = list("N/A", "N/a"))
Replacing Missing Values
# Use chaos, then replace with NA for the variable "grade" with the values "N/A" and "N/a" chaos %>% replace_with_na(replace = list(grade = c("N/A", "N/a")))
"scoped variants" of replace_with_na
replace_with_nacan be repetitive:- Use it across many different variables and values
- Complex cases, replacing values less than -1, only affect character columns.
replace_with_na_all()all variablesreplace_with_na_at()a subset of selected variablesreplace_with_na_if()a subset of variables that fulfill some condition (numeric, character)
# "chaos THEN replace_with_na_all, where the variable is equal to -99." chaos %>% replace_with_na_all(condition = ~.x == -99) # replace "N/A", "missing", "na" values chaos %>% replace_with_na_all(condition = ~.x %in% c("N/A", "missing", "na"))