In a previous blog post, you learned how to make histograms with the
hist() function. You can also make histograms by using
ggplot2, “a plotting system for R, based on the grammar of graphics” that was created by Hadley Wickham. This post will focus on making a Histogram With ggplot2.
- Check that you have ggplot2 installed
- The Data
- Making your Histogram with ggplot2
- Taking it one Step Further
- Feeling Like Going Far and Beyond?
(Want to learn how to do more plots with ggplot2? Try this interactive course on data visualization with gglot2.)
Step One. Check that you have ggplot2 Installed
First, go to the tab “packages” in RStudio, an IDE to work with R efficiently, search for
ggplot2 and mark the checkbox. Alternatively, it could be that you need to install the package. In this case, you stay in the same tab, and you click on “Install”. Enter ggplot2, press ENTER and wait one or two minutes for the package to install.
You can also install
ggplot2 from the console with the
To effectively load the
ggplot2 package, execute the following command:
Step Two. The Data
Let’s leave the
ggplot2 library for what it is for a bit and make sure that you have some dataset to work with: import the necessary file or use one that is built into R. This tutorial will again be working with the
If you’re just tuning into this tutorial series, you can download the dataset from here.
You can load in the
chol data set by using the
url() function embedded into the
read.table() function. Next, you can inspect whether the import was successful with functions such as
Note that you use the
head() function to retrieve the first parts of the
data.frame, while you use
summary() to return a summary of the
chol object. Lastly, you can use
str() to display the structure of the
chol data frame.
Tip: if you want to double check the class of the
chol data frame, use the
class() function, just like this
Step Three. Making your Histogram with
You have two options to create your histograms with the
ggplot2 package. On the one hand, you can use the
qplot() function, which looks very much like the
You see that it’s easy to use plot with the
qplot() function: you pass in the data that you want to have on the x-axis, in this case,
chol$AGE, and by adding the
geom argument, you can specify the type of graph you want. In this case, by specifying
"histogram", you indicate that you want to plot the distribution of
On the other hand, you can also use the
ggplot() function to make the same histogram. In this case, you take the dataset
chol and pass it to the
data argument. Next, pass the
AGE column from the dataset as values on the x-axis and compute a histogram of this:
As you saw before,
ggplot2 is an implementation of the grammar of graphics, which means that there is a basic grammar to producing graphics: you need data and graphical elements to make your plots, just like you need a personal pronoun and a conjugated verb to make sentences. This means that you feed data to a plot as
y elements and you need to manipulate some details, such as colors, markers, etc. as graphical elements, which are added as layers.
This is precisely what happens in this plot: besides the
data argument that you specify, you also add
aes to describe how variables in the data (such as
chol$AGE) are mapped to visual properties of geoms (
geom_histogram() in this case, which is added as a layer).
But what is the difference between these two options?
qplot() function is supposed to make the same graph as
ggplot(), but with a simpler syntax. This might seem entirely random, but it really isn’t if you understand where the name
qplot() comes from; It’s short for “quick plot”, and it’s a shortcut designed to be familiar if you’re used to base
ggplot() allows for maximum features and flexibility,
qplot() is a more straightforward but less customizable wrapper around
Note: in practice,
ggplot() is used more often.
Step Four. Taking it one Step Further
Now that you know how to make a basic histogram with this R package that is based on the grammar of graphics, it’s time to take things up a notch, and adjust the
qplot() and the
ggplot() that you have just made to customize it to your needs.
The options to adjust your histogram through
qplot() are not too extensive, but this function does allow you to change the basics to improve the visualization and hence the understanding of the histograms; All you need to do is add some more arguments, just like you did with the
You might have already seen the following warning pop up in the previous histograms"
stat_bin()` using `bins = 30`. Pick better value with `binwidth`.; The warning refers to the
binwidth argument that you can add to the
ggplot() functions to change the width of the histogram bins.
In any case, you could adjust the original plot to look like this:
Tip: compare the arguments to the ones that are used in the
hist() function in the first part of this tutorial series to get some more insight!
You’ll have a histogram for the
AGE column in the
chol dataset, with title
Histogram for Age and label for the x-axis (
Age), with bins of a width of 5 that range from values 20 to 50 on the x-axis and that have a transparent blue filling and red borders.
Since the R commands are only getting longer and longer, you might need some help to understand what each part of the code does to the histogram’s appearance.
Let’s just break it down to smaller pieces:
You can change the binwidth by specifying a
binwidth argument in your
qplot() function. Play around with the binwidth in the DataCamp Light chunk below:
As with the
hist() function, you can use the argument
main to change the title of the histogram:
To change the labels that refer to the x-and y-axes, use
ylab, just like you do when you use the
However, if you want to adjust the colors of your histogram, you have to take a slightly different approach than with the
This different approach also counts if you want to change the border of the bins; You add the
col argument, with the
I() function in which you can nest a color:
I() function inhibits the interpretation of its arguments. In this case, the
col argument is affected. Without it, the
qplot() function would print a legend, saying that “col =”red“”, which is definitely not what you want in this case (Muenchen et al. 2010).
Tip: try removing the
I() function and see for yourself what happens!
If you want to set the transparency of the bins’ filling, just add the argument
alpha, together with a value that is between 0 (fully transparent) and 1 (opaque). In the code below, set
Note that the
I() function is used here also! Again, try to leave this function out and see what effect this has on the histogram.
X- and Y-Axes
qplot() function also allows you to set limits on the values that appear on the x-and y-axes. Just use
ylim, in the same way as it was described for the
hist() function in the first part of this tutorial on histograms. After adding the
xlim argument and some reasonable parameters, you end up with the histogram from the start of this section:
Tip: do not forget to use the
c() function to specify
Just like the two other options that have been discussed so far, adjusting your histogram through the
ggplot() function is also very easy. The general message stays the same: just add more code to the original code that plots your (basic) histogram!
This way, you can customize your basic ggplot!
In the following exercise, you’ll use the
chol data again to make a histogram. More specifically, you’ll plot the
chol$AGE data along the x-axis. After that, you’ll use the
geom_histogram() function to tell ggplot2 that you’re actually interested in plotting the distribution of
chol$AGE with the help of a histogram. Lastly, you customize your ggplot by adding
labs(), to which you’ll pass the
y arguments to add labels, and
ylim() to set the limits of the x- and y-axes.
Try this out in the following interactive exercise:
Again, let’s break this massive chunk of code into pieces to see exactly what each part contributes to the visualization of your histogram:
To adjust the bin width and the breakpoints, you can basically follow the general guidelines that were provided in the first part of the tutorial on histograms, since the arguments work alike. This means that you can add
breaks to change the bin width:
Note that it is possible for the
seq() function to explicitly specify the
by argument name as the last argument. This can be more informative, but it doesn’t change the resulting histogram!
Remember that you could also express the same constraints on the bins with the
c() function, but that this can make your code messy.
To adjust the colors of your histogram, just add the arguments
fill, together with the desired color:
alpha argument controls the fill transparency. Remember to pass a value between 0 (transparent) and 1 (opaque):
You can also fill the bins with colors according to the count numbers that are presented in the y-axis by passing
..count.., something that is not possible in the
fill argument of
..count.. results in a variety of blue colors, which is actually the default color scheme. If you want to change this, you should add something more to your code: the
scale_fill_gradient(), which allows you to specify, for example:
- that you’re taking the count values from the y-axis,
- that the low values should be in green and
- that the higher values should appear in red:
Remember that the ultimate purpose of adjusting your histogram should always be improving the understanding of it; Even though the histograms above look very fancy, they might not be exactly what you need; So always keep in mind what you’re trying to achieve!
Note that there are several more options to adjust the color of your histograms. If you want to experiment some more, you can find other arguments in the “Scales” section of the
ggplot documentation page.
To adjust the title of your histogram, add the argument
To adjust the labels on the x-and y-axes of your histogram, add the arguments
y, followed by a string of your choice:
X- and Y-Axes
Similar to the arguments that the
hist() function uses to adjust the x-and y-axes, you can use the
ylim(). If you add these two functions, you end up with the histogram from the start of this section:
Tip: do not forget to use the
c() function when you use the arguments
ylim! And you should probably watch out for those parentheses, too :)
You can easily add a trendline to your histogram by adding
geom_density to your code:
Remember: just like with the
hist() function, your histograms with
ggplot2 also need to plot the density for this to work. Remember also that the
hist() function required you to make a trendline by entering two separate commands while
ggplot2 allows you to do it all in one single command.
Step Five. Feeling Like Going Far and Beyond?
If you’re intrigued by the histograms that you can make with
ggplot2, and if you want to discover what more you can do with this package, you can read about it on the RDocumentation page. It is a great starting point for anybody that is interested in taking
ggplot2 to the next level.
If you already have some understanding of SAS, SPSS and STATA, and you want to discover more about
ggplot2 but also other useful R packages. You might want to check out DataCamp’s course “R for SAS, SPSS and STATA Users”. The course is taught by Bob Muenchen, who is considered one of the prominent figures in the R community and whose book has briefly been mentioned in this tutorial.
This is the second of 3 posts on creating histograms with R. The next post will cover the creation of histograms using ggvis.
← Back to tutorial