Tutorial on the R Apply Family
The Apply Functions As Alternatives To Loops
This post will show you how you can use the R
apply() function, its variants such as
mapply() and a few of
apply()'s relatives, applied to different data structures. Of course, not all the variants can be discussed, but when possible, you will be introduced to the use of these functions in cooperation, via a couple of slightly more beefy examples.
Also, you might find it useful to look at this introduction to R tutorial to better understand lists, vectors, arrays and dataframes, though you don’t necessarily need to have completed the tutorial to follow this post!
apply() family pertains to the R base package and is populated with functions to manipulate slices of data from matrices, arrays, lists and dataframes in a repetitive way. These functions allow crossing the data in a number of ways and avoid explicit use of loop constructs. They act on an input list, matrix or array and apply a named function with one or several optional arguments.
The called function could be:
- An aggregating function, like for example the mean, or the sum (that return a number or scalar);
- Other transforming or subsetting functions; and
- Other vectorized functions, which return more complex structures like lists, vectors, matrices and arrays.
apply() functions form the basis of more complex combinations and helps to perform operations with very few lines of code. More specifically, the family is made up of the
But how and when should we use these?
Well, this depends on the structure of the data that you want to operate on and the format of the output that you need.
How To Use
apply() in R
Let’s start with the godfather of the family,
apply(), which operates on arrays. For simplicity, the tutorial limits itself to 2D arrays, which are also known as matrices.
The R base manual tells you that it’s called as follows:
apply(X, MARGIN, FUN, ...)
Xis an array or a matrix if the dimension of the array is 2;
MARGINis a variable defining how the function is applied: when
MARGIN=1, it applies over rows, whereas with
MARGIN=2, it works over columns. Note that when you use the construct
MARGIN=c(1,2), it applies to both rows and columns; and
FUN, which is the function that you want to apply to the data. It can be any R function, including a User Defined Function (UDF).
Now, beginners may have difficulties in visualizing what is actually happening, so a picture and some code will come in handy to help you to figure this out.
Let’s construct a 5 x 6 matrix and imagine you want to sum the values of each column.
You can write something like this:
Remember that in R, a matrix can be seen as a collection of line vectors when you cross the matrix from top to bottom (along the vertical line 1, which specifies the dimension or margin 1), or as a list of columns vectors, spanning the matrix left to right along the dimension or margin 2.
That means that the instruction you have just entered, depicted in figure 1, translates into: “apply the function ‘sum’ to the matrix X along margin 2 (by column), summing up the values of each column.
Note that, to avoid cluttering the picture, just one of the columns is highlighted.
You end up with a line vector containing the sums of the values of each column.
The output of the above code, a line vector, would have been given also if you summed along the lines of the matrix. This is just how R displays the result.
A note for the following: in most cases, R can return a value even if the latter has not been specified, or more precisely the return value of the function has not been assigned to a variable. R simply returns the last object evaluated. In practice, however, when you want to check the return value and when you need to do further operations on those return values, it is best to assign the results of a given function to a variable explicitly.
You want to apply a given function to every element of a list and obtain a list as result. When you execute
?lapply, you see that the syntax looks like the
The difference is that:
- It can be used for other objects like dataframes, lists or vectors; and
- The output returned is a list (which explains the “l” in the function name), which has the same number of elements as the object passed to it.
To see how this works, create a few matrices and extract from each a given column.
This is a quite common operation performed on real data when making comparisons or aggregations from different dataframes.
Our toy example, depicted in figure 2 can be coded as:
The operation is shown in the left part of figure 2.
Again, you start by specifying the object of interest, the list
Mylist. You use the standard R selection operator
[ and then omit the first parameter (which therefore translates into “any”, that’s why you see the two commas).
Next, you specify the second parameter, which is
2: our margin is ‘column’. So you extract the second column from all the matrices within the list.
A few notes to the code above:
[notation is the select operator. Remember, for example, that to extract all the elements of the third line of B requires:
[[ ]]notation expresses the fact that the we are dealing with lists: [] means the second element of the list. This is shown also in the output given by R;
- The output is a list with as many elements as the element in the input; and
- Note that you could also have extracted a single element for each matrice, like this:
lapply(MyList,"[", 1, 2)
In the right-hand side of figure 2, you can see an alternative extraction: this time you omit the first parameter and you get the first row from each of the matrices.
Try it yourself! Select the second column from each matrix in the list:
sapply() function works like
lapply(), but it tries to simplify the output to the most elementary data structure that is possible. And indeed,
sapply() is a ‘wrapper’ function for
An example may help to understand this: let’s say that you want to repeat the extraction operation of a single element as in the last example, but now take the first element of the second row (indexes 2 and 1) for each matrix.
lapply() function would give us a list, unless you pass
simplify=FALSE as parameter to
sapply(). Then, a list will be returned. See how it works in the code chunk below:
Conversely, a function like
unlist(), can tell
lappy() to give you a vector!
Anyway, to avoid confusion, it is best to use these functions in their ‘native format’ and avoid conversions unless strictly necessary.
Something that is often used together with
apply() functions is
rep(). When you apply it to a vector or a factor
x, the function replicates its values a specified number of times.
Let’s use one of the vectors that you generated above with
This time, however, you only select the elements of the first line and first column from each elements of the list
MyList (and you use
sapply() to get a vector):
You see that the code above replicates the values of
Z a number of times as established by
c(3,1,2): three times the first, one time the second and two times the third:
mapply() function stands for ‘multivariate’ apply. Its purpose is to be able to vectorize arguments to a function that is not usually accepting vectors as arguments.
mapply() applies a Function to Multiple List or multiple Vector Arguments.
Let’s look at an
mapply() example where you create a 4 x 4 matrix with a call to the
rep() function repeatedly:
But you see that there is a more efficient way to bind the results of the
rep() function instead of with
c(): when you call
mapply(), you vectorize the action of the function
This function is contained in the
stats package and you use it like this:
aggregate(x, by, FUN, ..., simplify = TRUE).
In other words, it works similarly to the
apply() function: you specify the object, the function and you say whether you want to simplify, just like with the
sapply() function. The key difference is the use of the
by clause, which sets the variable or dataframe field by which we want to perform the aggregation.
The next section will show you how this works.
An Example of
Consider the toy dataset called
Mydf, which contains data about the sales of a product and where some of the values of the variable
DepPC column repeat.
This variable classifies the data on a geographical location, like the portion of a post code (here the numbers correspond to the departments of the Île de France, the region comprising Paris).
You want to do some stats on the sales columns. These are
DProgr, a progressive number in increasing time order, and the sales of the product (the quantity
Qty), plus a logical variable,
Delivered, which is a logical, telling us whether the product has been delivered (T) or not (F).
First, you can do a number of very simple things to get acquainted with the data set, other than showing it all, by just typing its name (here we only have 120 records, but imagine doing this for a real file with thousands of lines!).
Let’s explore the data:
Note that if you want to see the number of rows and columns that the dataframe contains, you could have also called
Many other enquiries on the data are possible.
Here, you are interested in knowing where the product sells best in which department, for example. That’s why you should regroup the data by department, summing up the sales,
Qty, for each department
DepPC with the help of the
aggregate() tells R that you wish to sum over all the
Qty that belong to the same department.
Note that R assigned the sum to a variable ‘x’, because you didn’t say otherwise.
The output is quite readable as is, but for a higher number of departments, this might be less readable. In these cases, you can resort to some graphical output: you plot the results by using one of R’s graphical output systems togehter with the
This gives us the sales for each department.
You might ask the same question, but only for the goods that were delivered. To do this, you first subset the data for which delivered is true (T) using the now familiar subsetting operator
Note that here you assign the result to a new variable
Y, which is a new dataframe that inherits the same columns names from the parent dataframe
Mydf. You do this to avoid repeating the aggregate instruction within the call to the plotting for readability:
So you could have posed different questions to the data in a vectorized way like with
aggregate(), and this you often do in conjunction with a handy plotting system like
ggplot2, so you get the jist.
Note that to get this, you only needed very few lines of code.
Vectorization As An Alternative To Loops And Apply Functions?
You have seen some variations on the same theme, which is “act on a structured set of data in a repetitive way”. In this sense, these functions can be seen not only as an alternative to loops, but also as a vectorized form of doing things.
“Vectorized” here in the loose sense, we won’t enter the debate that asks whether – and which of the –
apply() functions are truly vectorized or not (see for example the discussion here).
In practice, in order to choose which
apply() function to use, you need to consider the following:
- The data type of the input: this is the object you will act upon (vector, matrix, array…, list, data frame or perhaps a combination of those)
- What you intend to do: the
FUNfunction you want to pass
- The subsets of that data : rows, columns, or perhaps all?
- What type of data do you want to get from the function? Because you might want to perform further operations on it (and do you want a new object, or do you want to transform the input object directly?)
These are quite general questions that you may ask for the related functions, of which we have considered
But there are many more! Don’t stop exploring now!