Introduction to the Tidyverse

Run the hidden code cell below to import the data used in this course.

1 hidden cell

Tidyverse

Link to the Tidyverse for Beginners Cheat Sheet

Data Visualization Notes:

Create a subset of dataframe (gapminder_1952)

gapminder_1952 <- gapminder %>% filter(year == 1952)

Creating scatterplots

Useful for comparing two variables

ggplot(gapminder_1952, aes(x = gdpPercap, y = lifeExp)) + geom_point()

This is the standard code: ggplot(dataframe, aes(x = Var on X axis, y = Var on Y axis)) + geom_point()

Geom: Meaning that you're placing a geometric shape
Point: Meaning that each datapoint coincides with a point on the graph (aka a scatterplot)

Log Scales

Used when one of your axes is on a logarithmetic scales - where each fixed distance represents a multiplication of the value

This is the standard code: ggplot(dataframe, aes(x = Var on X axis, y = Var on Y axis)) + geom_point() + scale_x_log10()

Additional Aesthetics

Color (good for categorical variables) -- R will auto add a legend
Size (good for showing varying sizes, e.g., population)

This is the standard code: ggplot(dataframe, aes(x = Var on X axis, y = Var on Y axis, color = Var to dictate color, size = Var to dictate size)) + geom_point() + scale_x_log10()

Faceting

Used to further explore data by dividing plots into subplots based on a variable

This is the standard code: ggplot(dataframe, aes(x = Var on X axis, y = Var on Y axis, color = Var to dictate color, size = Var to dictate size)) + geom_point() + scale_x_log10() + facet_wrap(~ Var Name)

Note: scale_x_log10() is required
Note: ~ this usually means "by" in R

Grouping and Summarizing

Summarize verb

Used to create summary statistics; R has many of the standard built in functions, for example:

mean
median
sum
min
max

General code is: dataframe %>% summarize(NewVar = median(Var to be summarized))

Can also combine summarize with filters, for example: gapminder %>% filter(year == 1957) %>% summarize(medianLifeExp= median(lifeExp))

Group_by verb

Instead of filtering your data and then summarizing repeatedly by each new filter, you can group your data and run summaries on all the groups at once.

Standard code is: dataframe %>% group_by(Var Name) %>% summarize(NewVar = median(VarA to be summarized), NewVar2 = max(VarB to be summarized))

Can add grouping & filtering together, for example: dataframe %>% filter(year == 2007) %>% group_by(Var Name) %>% summarize(NewVar = median(VarA to be summarized), NewVar2 = max(VarB to be summarized))

Can add numerous variables into your grouping command, for example: gapminder %>% group_by(year, continent) %>% summarize(totalPop= sum(pop), meanLifeExp= mean(lifeExp))

The code above will group all output first by year, then by continent

Visualizing Summarized Data

This is essentially layering the concepts from Chapters 1 + 2

In general, the process consists of saving summarized data as an object (i.e., new Variable), and then passing that object into a graph to visualize it.

Example code:

first summarize into a new variable (object)
by_year <- gapminder %>% group_by(year) %>% summarize(medianLifeExp = median(lifeExp), maxGdpPercap = max(gdpPercap))
then plot; Be sure to add expand_limits(y = 0) IF you need the plot's y-axis to include zero.
ggplot(by_year, aes(x= , y= )) + geom_point() + expand_limits(y= 0)

Additional Types of Plots

Line Plots (change over time)
Bar plots (comparing stats across cats)
Histograms (distribution of one numeric variable)
Box plots (compare the distribution of a numeric variable across several categories)

The code for all of these, is very similar to scatterplots - mapping data and axises for your plot

Line Plot

Visualizing change over time
Easier to spot trends over time - Only need to change geom_point() to geom_line()

Bar Plot

Comparing values across discrete categories (e.g., continents)
X = categorical var; y = var that determines length of bars
Bar plots ALWAYS start at zero, don't need to expand this limit
- Only need to change geom_point() to geom_col()

Histograms

Investigating 1 dimension (var) of data at a time (i.e., look at a distribution)
E.g., Every bar represents a bin of life expectancies, and the height represents how many countries fall into that bin. This lets you get a sense of the distribution based on the histogram's shape.
Sample code: ggplot(dataframe, aes(x = Var)) + geom_histogram()
The width of each bin is chosen automatically, if you need to change it, the code becomes: ggplot(dataframe, aes(x = Var)) + geom_histogram(bins = # of meters) - will focus on general shape, rather than the smaller details
Sometimes X needs to be on a log scale, you just add the scale_x_log()

Box plots

Allows you to compare a distribution of a variable across multiple categories so you can compare them
Sample code: ggplot(dataframe, aes(x= CATEGORICAL Var, y= DISTRIBUTION var) + geom_boxplot()
Components of a box plot:
1. Dark line in middle is median of the distribution
2. Top of box = 75th percentile
3. Bottom of box = 25th percentile
4. Therefore, 50% of distribution lies within the box
5. The lines (whiskers) cover additional countries
6. Dots below the whiskers represent outliers - countries with unusual values relative to the rest of the distribution

GGPlot instructions

As a final exercise in this course, you'll practice looking up ggplot2 instructions by completing a task we haven't shown you how to do.

Add a title to the graph

I Googled it :) and found this:
dataframe + labs(title="Plot of length \n by dose", x ="Dose (mg)", y = "Teeth length")

Introduction to the Tidyverse

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Introduction to the Tidyverse

Tidyverse

Data Visualization Notes:

Create a subset of dataframe (gapminder_1952)

Creating scatterplots

Log Scales

Additional Aesthetics

Faceting

Grouping and Summarizing

Summarize verb

Group_by verb

Visualizing Summarized Data

This is essentially layering the concepts from Chapters 1 + 2

Additional Types of Plots

Line Plot

Bar Plot

Histograms

Box plots

GGPlot instructions

Add a title to the graph

Introduction to the Tidyverse