In a previous blog post you learned how to make histograms with the hist() function. You can also make histograms by using ggplot2, “a plotting system for R, based on the grammar of graphics” that was created by Hadley Wickham. This post will focus on making a Histogram With ggplot2.
First, go to the tab “packages” in RStudio, an IDE to work with R efficiently, search for ggplot2 and mark the checkbox. Alternatively, it could be that you need to install the package. In this case, you stay in the same tab and you click on “Install”. Enter ggplot2, press ENTER and wait one or two minutes for the package to install.
You can also install ggplot2 from the console with the install.packages() function:
To effectively load the ggplot2 package, execute the following command:
Step Two. The Data
Let’s leave the ggplot2 library for what it is for a bit and make sure that you have some dataset to work with: import the necessary file or use one that is built into R. This tutorial will again be working with the chol dataset.
If you’re just tuning in to this tutorial series, you can download the this dataset from here.
You can load in the chol data set by using the url() function embedded into the read.table() function. Next, you can inspect whether the import was successful with functions such as head(), summary() and str():
Note that you use the head() function to retrieve the first parts of the choldata.frame, while you use summary() to return a summary of the chol object. Lastly, you can use str() to display the structure of the chol data frame.
Tip: if you want to double check the class of the chol data frame, use the class() function, just like this class(chol).
Step Three. Making Your Histogram With ggplot2
You have two options to make your histograms with the ggplot2 package. On the one hand, you can use the qplot() function, which looks very much like the hist() function:
You see that it’s easy to use plot with the qplot() function: you pass in the data that you want to have on the x-axis, in this case, chol$AGE, and by adding the geom argument, you can specify the type of graph you want. In this case, by specifying "histogram", you indicate that you want to plot the distribution of chol$AGE.
On the other hand, you can also use the ggplot() function to make the same histogram. In this case, you take the dataset chol and pass it to the data argument. Next, pass the AGE column from the dataset as values on the x-axis and compute a histogram of this:
As you saw before, ggplot2 is an implementation of the grammar of graphics, which means that there is a basic grammar to producing graphics: you need data and graphical elements to make your plots, just like you need a personal pronouns and a conjugated verb to make sentences. This means that you feed data to a plot as x and y elements and you need to manipulate some details, such as colors, markers, etc. as graphical elements, which are added as layers.
This is exactly what happens in this plot: besides the data argument that you specify, you also add aes to describe how variables in the data (such as chol$AGE) are mapped to visual properties of geoms (geom_histogram() in this case, which is added as a layer).
But what is the difference between these two options?
The qplot() function is supposed to make the same graph as ggplot(), but with a simpler syntax. This might seem quite random, but it really isn’t if you understand where the name qplot() comes from; It’s short for “quick plot” and it’s a shortcut designed to be familiar if you’re used to base plot(). While ggplot() allows for maximum features and flexibility, qplot() is a simpler but less customizable wrapper around ggplot.
Note: in practice, ggplot() is used more often.
Step Four. Taking It One Step Further
Now that you know how to make a basic histogram with this R package that is based on the grammar of graphics, it’s time to take things up a notch, and adjust the qplot() and the ggplot() that you have just made to customize it to your needs.
The options to adjust your histogram through qplot() are not too extensive, but this function does allow you to adjust the basics to improve the visualization and hence the understanding of the histograms; All you need to do is add some more arguments, just like you did with the hist() function.
You might have already seen the following warning pop up in the previous histograms" stat_bin()` using `bins = 30`. Pick better value with `binwidth`.; The warning refers to the binwidth argument that you can add to the qplot() and ggplot() functions to change the width of the histogram bins.
In any case, you could adjust the original plot to look like this:
You’ll have a histogram for the AGE column in the chol dataset, with title Histogram for Age and label for the x-axis (Age), with bins of a width of 5 that range from values 20 to 50 on the x-axis and that have transparent blue filling and red borders.
Since the R commands are only getting longer and longer, you might need some help to understand what each part of the code does to the histogram’s appearance.
Let’s just break it down to smaller pieces:
You can change the binwidth by specifying a binwidth argument in your qplot() function. Play around with the binwidth in the DataCamp Light chunk below:
The I() function inhibits the interpretation of its arguments. In this case, the col argument is affected. Without it, the qplot() function would print a legend, saying that “col =”red“”, which is definitely not what you want in this case (Muenchen et al. 2010).
Tip: try removing the I() function and see for yourself what happens!
If you want to set the transparency of the bins’ filling, just add the argument alpha, together with a value that is between 0 (fully transparent) and 1 (opaque). In the code below, set alpha to 0.2:
Note that the I() function is used here also! Again, try to leave this function out and see what effect this has on the histogram.
X- and Y-Axes
The qplot() function also allows you to set limits on the values that appear on the x-and y-axes. Just use xlim and ylim, in the same way as it was described for the hist() function in the first part of this tutorial on histograms. After adding the xlim argument and some reasonable paramters, you end up with the histogram from the start of this section:
Tip: do not forget to use the c() function to specify xlim and ylim!
Just like the two other options that have been discussed so far, adjusting your histogram through the ggplot() function is also very easy. The general message stays the same: just add more code to the original code that plots your (basic) histogram!
This way, you can customize your basic ggplot!
In the following exercise, you’ll use the chol data again to make a histogram. More specifically, you’ll plot the chol$AGE data along the x-axis. After that, you’ll use the geom_histogram() function to tell ggplot2 that you’re actually interested in plotting the distribution of chol$AGE with the help of a histogram. Lastly, you customize your ggplot by adding labs(), to which you’ll pass the title, x and y arguments to add labels, and xlim() and ylim() to set the limits of the x- and y-axes.
Try this out in the following interactive exercise:
Again, let’s break this huge chunk of code into pieces to see exactly what each part contributes to the visualization of your histogram:
To adjust the bin width and the breakpoints, you can basically follow the general guidelines that were provided in the first part of the tutorial on histograms, since the arguments work alike. This means that you can add breaks to change the bin width:
Setting the fill argument of aes() within geom_histogram() to ..count.. results in a variety of blue colors, which is actually the default color scheme. If you want to change this, you should add something more to your code: the scale_fill_gradient(), which allows you to specify, for example:
that you’re taking the count values from the y-axis,
Remember that the ultimate purpose of adjusting your histogram should always be improving the understanding of it; Even though the histograms above look very fancy, they might not be exactly what you need; So always keep in mind what you’re trying to achieve!
Note that there are several more options to adjust the color of your histograms. If you want to experiment some more, you can find other arguments in the “Scales” section of the ggplot documentation page.
To adjust the title of your histogram, add the argument title:
Similar to the arguments that the hist() function uses to adjust the x-and y-axes, you can use the xlim() and ylim(). If you add these two functions, you end up with the histogram from the start of this section:
Remember: just like with the hist() function, your histograms with ggplot2 also need to plot the density for this to work. Remember also that the hist() function required you to make a trendline by entering two separate commands while ggplot2 allows you to do it all in one single command.
Step Five. Feeling Like Going Far And Beyond?
If you’re intrigued by the histograms that you can make with ggplot2, and if you want to discover what more you can do with this package, you can read about it on the RDocumentation page. It is a great starting point for anybody that is interested in taking ggplot2 to the next level.
If you already have some understanding of SAS, SPSS and STATA and you want to discover more about ggplot2 but also other useful R packages, you might want to check out DataCamp’s course “R for SAS, SPSS and STATA Users”. The course is taught by Bob Muenchen, who is considered one of the prominent figures in the R community and whose book has briefly been mentioned in this tutorial.
This is the second of 3 posts on creating histograms with R. The next post will cover the creation of histograms using ggvis. Spotted a mistake? Send us a tweet!