Skip to content
Course Notes: Exploratory Data Analysis in R
  • AI Chat
  • Code
  • Report
  • 4 Things to Consider during Exploratory Analysis

    • Centre
    • Variability
    • Shape
    • Outliers

    Contingency table review

    In this chapter you'll continue working with the comics dataset introduced in the video. This is a collection of characteristics on all of the superheroes created by Marvel and DC comics in the last 80 years.

    Let's start by creating a contingency table, which is a useful way to represent the total counts of observations that fall into each combination of the levels of categorical variables.

    Dropping levels

    The contingency table from the last exercise revealed that there are some levels that have very low counts. To simplify the analysis, it often helps to drop such levels.

    In R, this requires two steps: first filtering out any rows with the levels that have very low counts, then removing these levels from the factor variable with droplevels(). This is because the droplevels() function would keep levels that have just 1 or 2 counts; it only drops levels that don't exist in a dataset.

    # Print the comics data
    comics
    
    # Check levels of align
    levels(comics$align)
    
    # Check the levels of gender
    levels(comics$gender)
    
    # Create a 2-way contingency table
    table(comics$align, comics$gender)
    
    
    # Load dplyr
    library(dplyr)
    
    # Print tab
    tab
    
    # Remove align level
    comics_filtered <- comics %>%
      filter(align != "Reformed Criminals") %>%
      
      droplevels()
    
    # See the result
    comics_filtered

    Counts vs. proportions (2)

    Bar charts can tell dramatically different stories depending on whether they represent counts or proportions and, if proportions, what the proportions are conditioned on. To demonstrate this difference, you'll construct two bar charts in this exercise: one of counts and one of proportions.

    Marginal bar chart

    If you are interested in the distribution of alignment of all superheroes, it makes sense to construct a bar chart for just that single variable.

    You can improve the interpretability of the plot, though, by implementing some sensible ordering. Superheroes that are "Neutral" show an alignment between "Good" and "Bad", so it makes sense to put that bar in the middle.

    Conditional bar chart

    Now, if you want to break down the distribution of alignment based on gender, you're looking for conditional distributions.

    You could make these by creating multiple filtered datasets (one for each gender) or by faceting the plot of alignment based on gender.

    Improve pie chart

    The pie chart is a very common way to represent the distribution of a single categorical variable, but they can be more difficult to interpret than bar charts.

    This is a pie chart of a dataset called pies that contains the favorite pie flavors of 98 people. Improve the representation of these data by constructing a bar chart that is ordered in descending order of count.

    # Plot of gender by align
    ggplot(comics, aes(x = align, fill = gender)) +
      geom_bar()
      
    # Plot proportion of gender, conditional on align
    ggplot(comics, aes(x = align, fill = gender)) + 
      geom_bar(position = "fill") +
      ylab("proportion")
    
    
    # Change the order of the levels in align
    comics$align <- factor(comics$align, 
                           levels = c("Bad", "Neutral", "Good"))
    
    # Create plot of align
    ggplot(comics, aes(x = align)) + 
      geom_bar()
    
    # Plot of alignment broken down by gender
    ggplot(comics, aes(x = align)) + 
      geom_bar() +
      facet_wrap(~ gender)
    
    
    # Put levels of flavor in descending order
    lev <- c("apple", "key lime", "boston creme", "blueberry", "cherry", "pumpkin", "strawberry")
    pies$flavor <- factor(pies$flavor, levels = lev)
    
    # Create bar chart of flavor
    ggplot(pies, aes(x = flavor)) + 
      geom_bar(fill = "chartreuse") + 
      theme(axis.text.x = element_text(angle = 90))

    Boxplots and density plots

    The mileage of a car tends to be associated with the size of its engine (as measured by the number of cylinders). To explore the relationship between these two variables, you could stick to using histograms, but in this exercise you'll try your hand at two alternatives: the box plot and the density plot.

    # Filter cars with 4, 6, 8 cylinders
    common_cyl <- filter(cars, ncyl %in% c(4, 6,8))
    
    # Create box plots of city mpg by ncyl
    ggplot(common_cyl, aes(x 
    = as.factor(ncyl), y = city_mpg)) +
      geom_boxplot()
    
    # Create overlaid density plots for same data
    ggplot(common_cyl, aes(x = city_mpg, fill = as.factor(ncyl))) +
      geom_density(alpha = .3)

    Marginal and conditional histograms

    Now, turn your attention to a new variable: horsepwr. The goal is to get a sense of the marginal distribution of this variable and then compare it to the distribution of horsepower conditional on the price of the car being less than $25,000.

    Three binwidths

    Before you take these plots for granted, it's a good idea to see how things change when you alter the binwidth. The binwidth determines how smooth your distribution will appear: the smaller the binwidth, the more jagged your distribution becomes. It's good practice to consider several binwidths in order to detect different types of structure in your data.

    # Create hist of horsepwr
    cars %>%
      ggplot(aes(horsepwr)) +
      geom_histogram() +
      ggtitle("Distribution of Histogram")
    
    # Create hist of horsepwr for affordable cars
    cars %>% 
      filter(msrp < 25000) %>%
      ggplot(aes(horsepwr)) +
      geom_histogram() +
      xlim(c(90, 550)) +
      ggtitle("Histogram Distribution for Affordable Cars")
    
    # Create hist of horsepwr with binwidth of 3
    cars %>%
      ggplot(aes(horsepwr)) +
      geom_histogram(binwidth = 3) +
      ggtitle("Hist of Horsepwr with bindwidth of 3")
    
    # Create hist of horsepwr with binwidth of 30
    cars %>%
      ggplot(aes(horsepwr)) +
      geom_histogram (binwidth = 30)+
      ggtitle("Hist of Horsepwr with bindwidth of 30")
    
    # Create hist of horsepwr with binwidth of 60
    cars %>%
      ggplot(aes(horsepwr)) +
      geom_histogram(binwidth = 60)+
      ggtitle("Hist of Horsepwr with bindwidth of 60")
    

    Box plots for outliers

    In addition to indicating the center and spread of a distribution, a box plot provides a graphical means to detect outliers. You can apply this method to the msrp column (manufacturer's suggested retail price) to detect if there are unusually expensive or cheap cars.

    Plot selection

    Consider two other columns in the cars dataset: city_mpg and width. Which is the most appropriate plot for displaying the important features of their distributions? Remember, both density plots and box plots display the central tendency and spread of the data, but the box plot is more robust to outliers.

    # Construct box plot of msrp
    cars %>%
      ggplot(aes(x = 1, y = msrp)) +
      geom_boxplot()
    
    # Exclude outliers from data
    cars_no_out <- cars %>%
      filter(msrp < 100000)
    
    # Construct box plot of msrp using the reduced dataset
    cars_no_out %>%
      ggplot(aes(x =1, y = msrp)) +
      geom_boxplot()
    
    # Create plot of city_mpg
    cars %>%
      ggplot(aes(1, city_mpg)) +
      geom_boxplot()
    
    # Create plot of width
    cars %>% 
      ggplot(aes(width)) +
      geom_density()

    3 variable plot

    Faceting is a valuable technique for looking at several conditional distributions at the same time. If the faceted distributions are laid out in a grid, you can consider the association between a variable and two others, one on the rows of the grid and the other on the columns.

    Calculate spread measures

    Let's extend the powerful group_by() and summarize() syntax to measures of spread. If you're unsure whether you're working with symmetric or skewed distributions, it's a good idea to consider a robust measure like IQR in addition to the usual measures of variance or standard deviation.

    Choose measures for center and spread

    Consider the density plots shown here. What are the most appropriate measures to describe their centers and spreads? In this exercise, you'll select the measures and then calculate them.

    # Facet hists using hwy mileage and ncyl
    common_cyl %>%
      ggplot(aes(x = hwy_mpg)) +
      geom_histogram() +
      facet_grid(ncyl ~ suv) +
      ggtitle("Facet hists using hwy mileage and ncyl")
    
    
    # Compute groupwise measures of spread
    gap2007 %>%
      group_by(continent) %>%
      summarize(sd(lifeExp),
                IQR(lifeExp),
                n())
    
    # Generate overlaid density plots
    gap2007 %>%
      ggplot(aes(x = lifeExp, fill = continent)) +
      geom_density(alpha = 0.3)

    Transformations

    Highly skewed distributions can make it very difficult to learn anything from a visualization. Transformations can be helpful in revealing the more subtle structure.

    Here you'll focus on the population variable, which exhibits strong right skew, and transform it with the natural logarithm function (log() in R).

    Identify outliers

    Consider the distribution, shown here, of the life expectancies of the countries in Asia. The box plot identifies one clear outlier: a country with a notably low life expectancy. Do you have a guess as to which country this might be? Test your guess in the console using either min() or filter(), then proceed to building a plot with that country removed.

    # Create density plot of old variable
    gap2007 %>%
      ggplot(aes(x = pop)) +
      geom_density()
    
    # Transform the skewed pop variable
    gap2007 <- gap2007 %>%
      mutate(log_pop = log(pop))
    
    # Create density plot of new variable
    gap2007 %>%
      ggplot(aes(x = log_pop)) +
      geom_density
    
    # Filter for Asia, add column indicating outliers
    gap_asia <- gap2007 %>%
      filter(continent == "Asia") %>%
      mutate(is_outlier = lifeExp <50)
    
    # Remove outliers, create box plot of lifeExp
    gap_asia %>%
      filter(!is_outlier) %>%
      ggplot(aes(x =1 , y = lifeExp)) +
      geom_boxplot()