Start Learning for Free
Join over 500,000 other Data Science learners and start one of our interactive tutorials today!
Tutorial on the R Apply Family
February 2nd, 2016 in R ProgrammingThe Apply Functions As Alternatives To Loops
This post will show you how you can use the Rapply()
function, its variants such as mapply()
and a few of apply()'s
relatives, applied to different data structures. Of course, not all the variants can be discussed, but when possible, you will be introduced to the use of these functions in cooperation, via a couple of slightly more beefy examples.
Also, you might find it useful to look at this introduction to R tutorial to better understand lists, vectors, arrays and dataframes, though you don’t necessarily need to have completed the tutorial to follow this post!
Content
The apply()
Family
The apply()
family pertains to the R base package and is populated with functions to manipulate slices of data from matrices, arrays, lists and dataframes in a repetitive way. These functions allow crossing the data in a number of ways and avoid explicit use of loop constructs. They act on an input list, matrix or array and apply a named function with one or several optional arguments.
The called function could be:
 An aggregating function, like for example the mean, or the sum (that return a number or scalar);
 Other transforming or subsetting functions; and
 Other vectorized functions, which return more complex structures like lists, vectors, matrices and arrays.
The apply()
functions form the basis of more complex combinations and helps to perform operations with very few lines of code. More specifically, the family is made up of the apply()
, lapply()
, sapply()
, vapply()
, mapply()
, rapply()
, and tapply()
functions.
But how and when should we use these?
Well, this depends on the structure of the data that you want to operate on and the format of the output that you need.
How To Use apply()
in R
Let’s start with the godfather of the family, apply()
, which operates on arrays. For simplicity, the tutorial limits itself to 2D arrays, which are also known as matrices.
The R base manual tells you that it’s called as follows: apply(X, MARGIN, FUN, ...)
where:

X
is an array or a matrix if the dimension of the array is 2; 
MARGIN
is a variable defining how the function is applied: whenMARGIN=1
, it applies over rows, whereas withMARGIN=2
, it works over columns. Note that when you use the constructMARGIN=c(1,2)
, it applies to both rows and columns; and 
FUN
, which is the function that you want to apply to the data. It can be any R function, including a User Defined Function (UDF).
Now, beginners may have difficulties in visualizing what is actually happening, so a picture and some code will come in handy to help you to figure this out.
Let’s construct a 5 x 6 matrix and imagine you want to sum the values of each column.
You can write something like this:
Remember that in R, a matrix can be seen as a collection of line vectors when you cross the matrix from top to bottom (along the vertical line 1, which specifies the dimension or margin 1), or as a list of columns vectors, spanning the matrix left to right along the dimension or margin 2.
That means that the instruction you have just entered, depicted in figure 1, translates into: “apply the function ‘sum’ to the matrix X along margin 2 (by column), summing up the values of each column.
Note that, to avoid cluttering the picture, just one of the columns is highlighted.
You end up with a line vector containing the sums of the values of each column.
The output of the above code, a line vector, would have been given also if you summed along the lines of the matrix. This is just how R displays the result.
A note for the following: in most cases, R can return a value even if the latter has not been specified, or more precisely the return value of the function has not been assigned to a variable. R simply returns the last object evaluated. In practice, however, when you want to check the return value and when you need to do further operations on those return values, it is best to assign the results of a given function to a variable explicitly.
The lapply()
Function
You want to apply a given function to every element of a list and obtain a list as result. When you execute ?lapply
, you see that the syntax looks like the apply()
function.
The difference is that:
 It can be used for other objects like dataframes, lists or vectors; and
 The output returned is a list (which explains the “l” in the function name), which has the same number of elements as the object passed to it.
To see how this works, create a few matrices and extract from each a given column.
This is a quite common operation performed on real data when making comparisons or aggregations from different dataframes.
Our toy example, depicted in figure 2 can be coded as:
The operation is shown in the left part of figure 2.
Again, you start by specifying the object of interest, the list Mylist
. You use the standard R selection operator [
and then omit the first parameter (which therefore translates into “any”, that’s why you see the two commas).
Next, you specify the second parameter, which is 2
: our margin is ‘column’. So you extract the second column from all the matrices within the list.
A few notes to the code above:
 The
[
notation is the select operator. Remember, for example, that to extract all the elements of the third line of B requires:B[3,]
;  The
[[ ]]
notation expresses the fact that the we are dealing with lists: [[2]] means the second element of the list. This is shown also in the output given by R;  The output is a list with as many elements as the element in the input; and
 Note that you could also have extracted a single element for each matrice, like this:
lapply(MyList,"[", 1, 2)
In the righthand side of figure 2, you can see an alternative extraction: this time you omit the first parameter and you get the first row from each of the matrices.
Try it yourself! Select the second column from each matrix in the list:
The sapply()
Function
The sapply()
function works like lapply()
, but it tries to simplify the output to the most elementary data structure that is possible. And indeed, sapply()
is a ‘wrapper’ function for lapply()
.
An example may help to understand this: let’s say that you want to repeat the extraction operation of a single element as in the last example, but now take the first element of the second row (indexes 2 and 1) for each matrix.
Applying the lapply()
function would give us a list, unless you pass simplify=FALSE
as parameter to sapply()
. Then, a list will be returned. See how it works in the code chunk below:
Conversely, a function like unlist()
, can tell lappy()
to give you a vector!
Anyway, to avoid confusion, it is best to use these functions in their ‘native format’ and avoid conversions unless strictly necessary.
The rep()
Function
Something that is often used together with apply()
functions is rep()
. When you apply it to a vector or a factor x
, the function replicates its values a specified number of times.
Let’s use one of the vectors that you generated above with lapply()
into MyList
.
This time, however, you only select the elements of the first line and first column from each elements of the list MyList
(and you use sapply()
to get a vector):
You see that the code above replicates the values of Z
a number of times as established by c(3,1,2)
: three times the first, one time the second and two times the third:
Handy, no?
The mapply()
Function
The mapply()
function stands for ‘multivariate’ apply. Its purpose is to be able to vectorize arguments to a function that is not usually accepting vectors as arguments.
In short, mapply()
applies a Function to Multiple List or multiple Vector Arguments.
Let’s look at an mapply()
example where you create a 4 x 4 matrix with a call to the rep()
function repeatedly:
But you see that there is a more efficient way to bind the results of the rep()
function instead of with c()
: when you call mapply()
, you vectorize the action of the function rep()
.
The aggregate()
Function
This function is contained in the stats
package and you use it like this: aggregate(x, by, FUN, ..., simplify = TRUE)
.
In other words, it works similarly to the apply()
function: you specify the object, the function and you say whether you want to simplify, just like with the sapply()
function. The key difference is the use of the by
clause, which sets the variable or dataframe field by which we want to perform the aggregation.
The next section will show you how this works.
An Example of aggregate()
Consider the toy dataset called Mydf
, which contains data about the sales of a product and where some of the values of the variable DepPC
column repeat.
This variable classifies the data on a geographical location, like the portion of a post code (here the numbers correspond to the departments of the Île de France, the region comprising Paris).
You want to do some stats on the sales columns. These are DProgr
, a progressive number in increasing time order, and the sales of the product (the quantity Qty
), plus a logical variable, Delivered
, which is a logical, telling us whether the product has been delivered (T) or not (F).
First, you can do a number of very simple things to get acquainted with the data set, other than showing it all, by just typing its name (here we only have 120 records, but imagine doing this for a real file with thousands of lines!).
Let’s explore the data:
Note that if you want to see the number of rows and columns that the dataframe contains, you could have also called nrow(Mydf)
and ncol(Mydf)
.
Many other enquiries on the data are possible.
Here, you are interested in knowing where the product sells best in which department, for example. That’s why you should regroup the data by department, summing up the sales, Qty
, for each department DepPC
with the help of the aggregate()
function:
So, aggregate()
tells R that you wish to sum over all the Qty
that belong to the same department.
Note that R assigned the sum to a variable ‘x’, because you didn’t say otherwise.
The output is quite readable as is, but for a higher number of departments, this might be less readable. In these cases, you can resort to some graphical output: you plot the results by using one of R’s graphical output systems togehter with the aggregate()
function:
This gives us the sales for each department.
You might ask the same question, but only for the goods that were delivered. To do this, you first subset the data for which delivered is true (T) using the now familiar subsetting operator "["
.
Note that here you assign the result to a new variable Y
, which is a new dataframe that inherits the same columns names from the parent dataframe Mydf
. You do this to avoid repeating the aggregate instruction within the call to the plotting for readability:
So you could have posed different questions to the data in a vectorized way like with aggregate()
, and this you often do in conjunction with a handy plotting system like ggplot2
, so you get the jist.
Note that to get this, you only needed very few lines of code.
Vectorization As An Alternative To Loops And Apply Functions?
You have seen some variations on the same theme, which is “act on a structured set of data in a repetitive way”. In this sense, these functions can be seen not only as an alternative to loops, but also as a vectorized form of doing things.
“Vectorized” here in the loose sense, we won’t enter the debate that asks whether – and which of the – apply()
functions are truly vectorized or not (see for example the discussion here).
In practice, in order to choose which apply()
function to use, you need to consider the following:
 The data type of the input: this is the object you will act upon (vector, matrix, array…, list, data frame or perhaps a combination of those)
 What you intend to do: the
FUN
function you want to pass  The subsets of that data : rows, columns, or perhaps all?
 What type of data do you want to get from the function? Because you might want to perform further operations on it (and do you want a new object, or do you want to transform the input object directly?)
These are quite general questions that you may ask for the related functions, of which we have considered aggregate()
, by()
, sweep()
, etc.
But there are many more! Don’t stop exploring now!
As a followup to this tutorial, consider taking DataCamp’s introduction to R tutorial or Intermediate R course.
Comments
It cleared my doubts.Thank you