Skip to main content

Tutorial on the R Apply Family

In this tutorial, you'll learn about the use of the apply functions in R, its variants, and a few of its relatives applied to different data structures.
Jan 2019  · 15 min read

learn r banner

The Apply Functions as Alternatives to Loops

This post will show you how you can use the R apply() function, its variants such as mapply() and a few of apply()'s relatives, applied to different data structures. Of course, not all the variants can be discussed, but when possible, you will be introduced to the use of these functions in cooperation, via a couple of slightly more beefy examples.

Also, you might find it useful to look at this introduction to R course to better understand lists, vectors, arrays, and dataframes, though you don’t necessarily need to have completed the tutorial to follow this post!

The apply() Family

The apply() family pertains to the R base package and is populated with functions to manipulate slices of data from matrices, arrays, lists and dataframes in a repetitive way. These functions allow crossing the data in a number of ways and avoid explicit use of loop constructs. They act on an input list, matrix or array and apply a named function with one or several optional arguments.

The called function could be:

  • An aggregating function, like for example the mean, or the sum (that return a number or scalar);
  • Other transforming or subsetting functions; and
  • Other vectorized functions, which yield more complex structures like lists, vectors, matrices, and arrays.

The apply() functions form the basis of more complex combinations and helps to perform operations with very few lines of code. More specifically, the family is made up of the apply(), lapply() , sapply(), vapply(), mapply(), rapply(), and tapply() functions.

But how and when should we use these?

Well, this depends on the structure of the data that you want to operate on and the format of the output that you need.

How to Use apply() in R

Let’s start with the godfather of the family, apply(), which operates on arrays. For simplicity, the tutorial limits itself to 2D arrays, which are also known as matrices.

The R base manual tells you that it’s called as follows: apply(X, MARGIN, FUN, ...)


  • X is an array or a matrix if the dimension of the array is 2;
  • MARGIN is a variable defining how the function is applied: when MARGIN=1, it applies over rows, whereas with MARGIN=2, it works over columns. Note that when you use the construct MARGIN=c(1,2), it applies to both rows and columns; and
  • FUN, which is the function that you want to apply to the data. It can be any R function, including a User Defined Function (UDF).

Now, beginners may have difficulties in visualizing what is happening, so a picture and some code will come in handy to help you to figure this out.

Let’s construct a 5 x 6 matrix and imagine you want to sum the values of each column.

You can write something like this:


R apply functions

Remember that in R, a matrix can be seen as a collection of line vectors when you cross the matrix from top to bottom (along the vertical line 1, which specifies the dimension or margin 1), or as a list of columns vectors, spanning the matrix left to right along the dimension or margin 2.

That means that the instruction you have just entered, depicted in figure 1, translates into: “apply the function ‘sum’ to the matrix X along margin 2 (by column), summing up the values of each column.

Note that, to avoid cluttering the picture, just one of the columns is highlighted.

You end up with a line vector containing the sums of the values of each column.

The output of the above code, a line vector, would have also been given if you summed along the lines of the matrix. This is how R displays the result.

A note for the following: in most cases, R can return a value even if the latter has not been specified, or more precisely the return value of the function has not been assigned to a variable. R simply returns the last object evaluated. In practice, however, when you want to check the return value and when you need to do further operations on those return values, it is best to assign the results of a given function to a variable explicitly.

The lapply() Function

You want to apply a given function to every element of a list and obtain a list as a result. When you execute ?lapply, you see that the syntax looks like the apply() function.

The difference is that:

  1. It can be used for other objects like dataframes, lists or vectors; and
  2. The output returned is a list (which explains the “l” in the function name), which has the same number of elements as the object passed to it.

To see how this works, create a few matrices and extract from each a given column.

This is a quite common operation performed on real data when making comparisons or aggregations from different dataframes.

Our toy example, depicted in figure 2 can be coded as:


lapply vs apply functions

The operation is shown in the left part of figure 2.

Again, you start by specifying the object of interest, the list Mylist. You use the standard R selection operator [ and then omit the first parameter (which therefore translates into “any”, that’s why you see the two commas).

Next, you specify the second parameter, which is 2: our margin is ‘column’. So you extract the second column from all the matrices within the list.

A few notes to the code above:

  • The [ notation is the select operator. Remember, for example, that to extract all the elements of the third line of B requires: B[3,];
  • The [[ ]] notation expresses the fact that we are dealing with lists: [[2]] means the second element of the list. This is also shown in the output given by R;
  • The output is a list with as many elements as the element in the input; and
  • Note that you could also have extracted a single element for each matrice, like this: lapply(MyList,"[", 1, 2)

In the right-hand side of figure 2, you can see an alternative extraction: this time you omit the first parameter, and you get the first row from each of the matrices.

Try it yourself! Select the second column from each matrix in the list:


The sapply() Function

The sapply() function works like lapply(), but it tries to simplify the output to the most elementary data structure that is possible. And indeed, sapply() is a ‘wrapper’ function for lapply().

An example may help to understand this: let’s say that you want to repeat the extraction operation of a single element as in the last example, but now take the first element of the second row (indexes 2 and 1) for each matrix.

Applying the lapply() function would give us a list unless you pass simplify=FALSE as a parameter to sapply(). Then, a list will be returned. See how it works in the code chunk below:


Conversely, a function like unlist(), can tell lappy() to give you a vector!

Anyway, to avoid confusion, it is best to use these functions in their ‘native format’ and avoid conversions unless strictly necessary.

The rep() Function

Something that is often used together with apply() functions is rep(). When you apply it to a vector or a factor x, the function replicates its values a specified number of times.

Let’s use one of the vectors that you generated above with lapply() into MyList.

This time, however, you only select the elements of the first line and first column from each elements of the list MyList (and you use sapply() to get a vector):


You see that the code above replicates the values of Z a number of times as established by c(3,1,2): three times the first, one time the second and two times the third:

Handy, no?

The mapply() Function

The mapply() function stands for ‘multivariate’ apply. Its purpose is to be able to vectorize arguments to a function that is not usually accepting vectors as arguments.

In short, mapply() applies a Function to Multiple List or multiple Vector Arguments.

Let’s look at a mapply() example where you create a 4 x 4 matrix with a call to the rep() function repeatedly:


But you see that there is a more efficient way to bind the results of the rep() function instead of with c(): when you call mapply(), you vectorize the action of the function rep().

The aggregate() Function

This function is contained in the stats package, and you use it like this: aggregate(x, by, FUN, ..., simplify = TRUE).

In other words, it works similarly to the apply() function: you specify the object, the function and you say whether you want to simplify, just like with the sapply() function. The critical difference is the use of the by clause, which sets the variable or dataframe field by which we want to perform the aggregation.

The next section will show you how this works.

An Example of aggregate()

Consider the toy dataset called Mydf, which contains data about the sales of a product and where some of the values of the variable DepPC column repeat.

This variable classifies the data on a geographical location, like the portion of a post code (here the numbers correspond to the departments of the Île de France, the region comprising Paris).

You want to do some stats on the sales columns. These are DProgr, a progressive number in increasing time order, and the sales of the product (the quantity Qty), plus a logical variable, Delivered, which is logical, telling us whether the product has been delivered (T) or not (F).

First, you can do a number of very simple things to get acquainted with the data set, other than showing it all, by just typing its name (here we only have 120 records, but imagine doing this for a real file with thousands of lines!).

Let’s explore the data:


Note that if you want to see the number of rows and columns that the dataframe contains, you could have also called nrow(Mydf) and ncol(Mydf).

Many other inquiries on the data are possible.

Here, you are interested in knowing where the product sells best in which department, for example. That’s why you should regroup the data by department, summing up the sales, Qty, for each department DepPC with the help of the aggregate() function:


So, aggregate() tells R that you wish to sum over all the Qty that belong to the same department.

Note that R assigned the sum to a variable ‘x’ because you didn’t say otherwise.

The output is quite readable as is, but for a higher number of departments, this might be less readable. In these cases, you can resort to some graphical output: you plot the results by using one of R’s graphical output systems together with the aggregate() function:


This gives us the sales for each department.

You might ask the same question, but only for the goods that were delivered. To do this, you first subset the data for which delivered is true (T) using the now familiar subsetting operator "[".

Note that here you assign the result to a new variable Y, which is a new dataframe that inherits the same columns names from the parent dataframe Mydf. You do this to avoid repeating the aggregate instruction within the call to the plotting for readability:


So you could have posed different questions to the data in a vectorized way like with aggregate(), and this you often do in conjunction with a convenient plotting system like ggplot2, so you get the jist.

Note that to get this, you only needed very few lines of code.

Vectorization as an Alternative to Loops and Apply Functions?

You have seen some variations on the same theme, which is “act on a structured set of data in a repetitive way”. In this sense, these functions can be seen not only as an alternative to loops but also as a vectorized form of doing things.

“Vectorized” here in the loose sense, we won’t enter the debate that asks whether – and which of the – apply() functions are indeed vectorized or not (see for example the discussion here).

In practice, in order to choose which apply() function to use, you need to consider the following:

  • The data type of the input: this is the object you will act upon (vector, matrix, array…, list, data frame or perhaps a combination of those)
  • What you intend to do: the FUN function you want to pass
  • The subsets of that data: rows, columns, or perhaps all?
  • What type of data do you want to get from the function? Because you might want to perform further operations on it (and do you want a new object, or do you want to transform the input object directly?)

These are quite general questions that you may ask for the related functions, of which we have considered aggregate(), by(), sweep(), etc.

But there are many more! Don’t stop exploring now! Check out DataCamp's Arrays in R tutorial.

As a follow-up to this tutorial, consider taking DataCamp’s Introduction to R or Intermediate R courses.

Learn more about R

Introduction to R

4 hr
Master the basics of data analysis in R, including vectors, lists, and data frames, and practice R with real data sets.
See DetailsRight Arrow
Start Course
See MoreRight Arrow

Inside Our Favorite DataFramed Episodes of 2022

An inside look at our favorite episodes of the DataFramed podcast of 2022
Adel Nehme's photo

Adel Nehme

2 min

[Infographic] Data Science Project Checklist

Use this checklist when planning your next data science project.
Adel Nehme's photo

Adel Nehme

Introduction to Probability Rules Cheat Sheet

Learn the basics of probability with our Introduction to Probability Rules Cheat Sheet. Quickly reference key concepts and formulas for finding probability, conditional probability, and more.
Richie Cotton's photo

Richie Cotton

1 min

Data Governance Fundamentals Cheat Sheet

Master the fundamentals of data governance with our Data Governance Fundamentals Cheat Sheet. Quickly reference key concepts, best practices, and key components of a data governance program.
Richie Cotton's photo

Richie Cotton

1 min

ChatGPT Cheat Sheet for Data Science

In this cheat sheet, gain access to 60+ ChatGPT prompts for data science tasks.
Travis Tang's photo

Travis Tang

10 min

T-tests in R Tutorial: Learn How to Conduct T-Tests

Determine if there is a significant difference between the means of the two groups using t.test() in R.
Abid Ali Awan's photo

Abid Ali Awan

10 min

See MoreSee More