Skip to content

Visualizing distributions

  • one value - histogram, box plot
  • two values - scatter plot
  • two values where consecutive values are related - line plot
  • counts of a categorical variable - bar plot
  • percentages of a categorical variable - stacked bar plot
  • counts/percentages of a categorical variable & can be expressed logarithmically - dot plot (similar to a bar plot)

A plot tells a thousand words

Three ways of getting insights

  • Calculating summary statistics - mean, median, standard deviation
  • Running models - linear and logistic regression
  • Drawing plots - scatter, bar, histogram

Continuous and categorical variables

  • Continuous - usually numbers
    • Things you can do arithmetic on
    • Heights, temperatures, revenues
  • Categorical - usually text
    • Things that can be classified
    • Eye, colors, countries, industry
  • Can be either
    • Age is continuous, but age group is categorical

Histograms

When should you use a histogram?

  1. If you have a single continuous variable.
  2. You want to ask questions about the shape of its distribution.

Choosing bindwidth

  • The appearance of a histogram is strongly influenced by the choice of binwidth.

Modality: how many peaks?

  • Unimodal - distribution with one peak
  • Bimodal - distribution with two peaks
  • Trimodal - distribution with three peaks

Skewness: is it symmetric?

  • Left-skewed - if most of the data are on the right, with a few smaller values showing up on the left side of the histogram
  • Symmetric - they have about the same shape on either side of the middle
  • Right-skewed - if most of the data are on the left side of the histogram but a few larger values are on the rightoutliers on the right

Kurtosis: how many extreme values?

  • Leptokurtic - has a narrow peak and lots of extreme values
  • Mesokurtic - bell curve from a normal distribution
  • Platykurtic - broad peak and few extreme values

Box plots

  • Inter-quartile range - difference between lower quartile and upper quartile
  • Points - extreme values, values that are outside the range of the whiskers

Visualizing two variables

Scatter plots

Correlation

  • Negative correlation - y axis decreases when x axis increases
  • Positive correlation - both x and y axis increases at the same time

Line plots

  • Consecutive data points are connected

Bar plots

  • Plots counts or percentages of a categorical variable
  • Stacked bar plot - plots percentages of a categorical variable

Dot plots

  • Similar to a bar plot wherein it plots the counts/percentages of a categorical variable
  • The difference is that it can be expressed logarithmically

The color and the shape

Using color

Colorspaces: Hue-Chroma-Luminance

  • Hue-chroma-luminance (HCL) - colorspace designed for data visualization
  • Hue - color of the rainbow
  • Chroma - intensity of a color (gray to bright)
  • Luminance - brightness of a color (black to white)
  • Viridis colorspace - a colorspace easily viewable by color blind people

Three types of color scale: qualitative

  • Type: qualitative
  • Purpose: distinguish unordered categories
  • What to vary: hue

Three types of color scale: sequential

  • Type: sequential
  • Purpose: show ordering
  • What to vary: chroma or luminance

Three types of color scale: diverging

  • Type: diverging
  • Purpose: show above or below a midpoint
  • What to vary: chroma or luminance, with 2 hues

Plotting many variables at once

Pair plot

  • You have up to ten variables (either continuous, categorical, or a mix).
  • You want to see the distribution for each variable.
  • You want to see the relationship between each pair of variables.

Correlation heatmap

  • You have lots of continuous variables.
  • You want a simple overview of how each pair of variables is related.

Parallel coordinates plot

  • You have lots of continuous variables.
  • You want to find patterns across these variables.
  • You want to visualize clusters ofobservations.

99 problems but a plot ain't one of them

Sensory overload

Chartjunk

  • Any element of the plot that distracts from the reader getting insight
  • Pictures
  • Skueomorphism - adding things that happen in the real world to virtual objects (reflections, shadows, etc.)
  • Extra dimensions
  • Ostentatious color or lines