Introduction to the Tidyverse
Run the hidden code cell below to import the data used in this course.
Tidyverse
Link to the Tidyverse for Beginners Cheat Sheet
Data Visualization Notes:
Create a subset of dataframe (gapminder_1952)
gapminder_1952 <- gapminder %>% filter(year == 1952)
Creating scatterplots
- Useful for comparing two variables
ggplot(gapminder_1952, aes(x = gdpPercap, y = lifeExp)) + geom_point()
This is the standard code: ggplot(dataframe, aes(x = Var on X axis, y = Var on Y axis)) + geom_point()
- Geom: Meaning that you're placing a geometric shape
- Point: Meaning that each datapoint coincides with a point on the graph (aka a scatterplot)
Log Scales
Used when one of your axes is on a logarithmetic scales - where each fixed distance represents a multiplication of the value
This is the standard code: ggplot(dataframe, aes(x = Var on X axis, y = Var on Y axis)) + geom_point() + scale_x_log10()
Additional Aesthetics
- Color (good for categorical variables) -- R will auto add a legend
- Size (good for showing varying sizes, e.g., population)
This is the standard code: ggplot(dataframe, aes(x = Var on X axis, y = Var on Y axis, color = Var to dictate color, size = Var to dictate size)) + geom_point() + scale_x_log10()
Faceting
Used to further explore data by dividing plots into subplots based on a variable
This is the standard code: ggplot(dataframe, aes(x = Var on X axis, y = Var on Y axis, color = Var to dictate color, size = Var to dictate size)) + geom_point() + scale_x_log10() + facet_wrap(~ Var Name)
- Note: scale_x_log10() is required
- Note: ~ this usually means "by" in R
Grouping and Summarizing
Summarize verb
Used to create summary statistics; R has many of the standard built in functions, for example:
- mean
- median
- sum
- min
- max
General code is: dataframe %>% summarize(NewVar = median(Var to be summarized))
Can also combine summarize with filters, for example: gapminder %>% filter(year == 1957) %>% summarize(medianLifeExp= median(lifeExp))
Group_by verb
Instead of filtering your data and then summarizing repeatedly by each new filter, you can group your data and run summaries on all the groups at once.
Standard code is: dataframe %>% group_by(Var Name) %>% summarize(NewVar = median(VarA to be summarized), NewVar2 = max(VarB to be summarized))
Can add grouping & filtering together, for example: dataframe %>% filter(year == 2007) %>% group_by(Var Name) %>% summarize(NewVar = median(VarA to be summarized), NewVar2 = max(VarB to be summarized))
Can add numerous variables into your grouping command, for example: gapminder %>% group_by(year, continent) %>% summarize(totalPop= sum(pop), meanLifeExp= mean(lifeExp))
- The code above will group all output first by year, then by continent
Visualizing Summarized Data
This is essentially layering the concepts from Chapters 1 + 2
In general, the process consists of saving summarized data as an object (i.e., new Variable), and then passing that object into a graph to visualize it.
Example code:
- first summarize into a new variable (object)
- by_year <- gapminder %>% group_by(year) %>% summarize(medianLifeExp = median(lifeExp), maxGdpPercap = max(gdpPercap))
- then plot; Be sure to add expand_limits(y = 0) IF you need the plot's y-axis to include zero.
- ggplot(by_year, aes(x= , y= )) + geom_point() + expand_limits(y= 0)
Additional Types of Plots
- Line Plots (change over time)
- Bar plots (comparing stats across cats)
- Histograms (distribution of one numeric variable)
- Box plots (compare the distribution of a numeric variable across several categories)
The code for all of these, is very similar to scatterplots - mapping data and axises for your plot
Line Plot
- Visualizing change over time
- Easier to spot trends over time - Only need to change geom_point() to geom_line()
Bar Plot
- Comparing values across discrete categories (e.g., continents)
- X = categorical var; y = var that determines length of bars
- Bar plots ALWAYS start at zero, don't need to expand this limit
- - Only need to change geom_point() to geom_col()
Histograms
- Investigating 1 dimension (var) of data at a time (i.e., look at a distribution)
- E.g., Every bar represents a bin of life expectancies, and the height represents how many countries fall into that bin. This lets you get a sense of the distribution based on the histogram's shape.
- Sample code: ggplot(dataframe, aes(x = Var)) + geom_histogram()
- The width of each bin is chosen automatically, if you need to change it, the code becomes: ggplot(dataframe, aes(x = Var)) + geom_histogram(bins = # of meters) - will focus on general shape, rather than the smaller details
- Sometimes X needs to be on a log scale, you just add the scale_x_log()
Box plots
- Allows you to compare a distribution of a variable across multiple categories so you can compare them
- Sample code: ggplot(dataframe, aes(x= CATEGORICAL Var, y= DISTRIBUTION var) + geom_boxplot()
- Components of a box plot:
- Dark line in middle is median of the distribution
- Top of box = 75th percentile
- Bottom of box = 25th percentile
- Therefore, 50% of distribution lies within the box
- The lines (whiskers) cover additional countries
- Dots below the whiskers represent outliers - countries with unusual values relative to the rest of the distribution
GGPlot instructions
As a final exercise in this course, you'll practice looking up ggplot2 instructions by completing a task we haven't shown you how to do.
Add a title to the graph
- I Googled it :) and found this:
- dataframe + labs(title="Plot of length \n by dose", x ="Dose (mg)", y = "Teeth length")