Skip to content
[Course] Understanding Data Visualization
Visualizing distributions
- one value - histogram, box plot
- two values - scatter plot
- two values where consecutive values are related - line plot
- counts of a categorical variable - bar plot
- percentages of a categorical variable - stacked bar plot
- counts/percentages of a categorical variable & can be expressed logarithmically - dot plot (similar to a bar plot)
A plot tells a thousand words
Three ways of getting insights
- Calculating summary statistics - mean, median, standard deviation
- Running models - linear and logistic regression
- Drawing plots - scatter, bar, histogram
Continuous and categorical variables
- Continuous - usually numbers
- Things you can do arithmetic on
- Heights, temperatures, revenues
- Categorical - usually text
- Things that can be classified
- Eye, colors, countries, industry
- Can be either
- Age is continuous, but age group is categorical
Histograms
When should you use a histogram?
- If you have a single continuous variable.
- You want to ask questions about the shape of its distribution.
Choosing bindwidth
- The appearance of a histogram is strongly influenced by the choice of binwidth.
Modality: how many peaks?
- Unimodal - distribution with one peak
- Bimodal - distribution with two peaks
- Trimodal - distribution with three peaks
Skewness: is it symmetric?
- Left-skewed - if most of the data are on the right, with a few smaller values showing up on the left side of the histogram
- Symmetric - they have about the same shape on either side of the middle
- Right-skewed - if most of the data are on the left side of the histogram but a few larger values are on the rightoutliers on the right
Kurtosis: how many extreme values?
- Leptokurtic - has a narrow peak and lots of extreme values
- Mesokurtic - bell curve from a normal distribution
- Platykurtic - broad peak and few extreme values
Box plots
- Inter-quartile range - difference between lower quartile and upper quartile
- Points - extreme values, values that are outside the range of the whiskers
Visualizing two variables
Scatter plots
Correlation
- Negative correlation - y axis decreases when x axis increases
- Positive correlation - both x and y axis increases at the same time
Line plots
- Consecutive data points are connected
Bar plots
- Plots counts or percentages of a categorical variable
- Stacked bar plot - plots percentages of a categorical variable
Dot plots
- Similar to a bar plot wherein it plots the counts/percentages of a categorical variable
- The difference is that it can be expressed logarithmically
The color and the shape
Using color
Colorspaces: Hue-Chroma-Luminance
- Hue-chroma-luminance (HCL) - colorspace designed for data visualization
- Hue - color of the rainbow
- Chroma - intensity of a color (gray to bright)
- Luminance - brightness of a color (black to white)
- Viridis colorspace - a colorspace easily viewable by color blind people
Three types of color scale: qualitative
- Type: qualitative
- Purpose: distinguish unordered categories
- What to vary: hue
Three types of color scale: sequential
- Type: sequential
- Purpose: show ordering
- What to vary: chroma or luminance
Three types of color scale: diverging
- Type: diverging
- Purpose: show above or below a midpoint
- What to vary: chroma or luminance, with 2 hues
Plotting many variables at once
Pair plot
- You have up to ten variables (either continuous, categorical, or a mix).
- You want to see the distribution for each variable.
- You want to see the relationship between each pair of variables.
Correlation heatmap
- You have lots of continuous variables.
- You want a simple overview of how each pair of variables is related.
Parallel coordinates plot
- You have lots of continuous variables.
- You want to find patterns across these variables.
- You want to visualize clusters ofobservations.
99 problems but a plot ain't one of them
Sensory overload
Chartjunk
- Any element of the plot that distracts from the reader getting insight
- Pictures
- Skueomorphism - adding things that happen in the real world to virtual objects (reflections, shadows, etc.)
- Extra dimensions
- Ostentatious color or lines