Tutorial on the R Apply Family

In this tutorial, you'll learn about the use of the apply functions in R, its variants, and a few of its relatives applied to different data structures.

Actualizado 31 mar 2023 · 14 min de lectura

A note on the R Apply Family

The R Apply Family is a set of functions in R that allow users to apply a function to elements of a vector, list, or matrix. However, it is considered legacy functionality and should not be used for new code. Instead, it is recommended to use the purrr package for all looping in R. The purrr package provides a consistent syntax for working with functions that take multiple inputs and outputs, making it easier to write and read code. Additionally, purrr offers a suite of functions that are optimized for modern computing environments, making it faster and more efficient than the Apply Family. Overall, while the R Apply Family may still work for some applications, it is advisable to use the purrr package for all new R projects.

The Apply Functions as Alternatives to Loops

This post will show you how you can use the R apply() function, its variants such as mapply() and a few of apply()'s relatives, applied to different data structures. Of course, not all the variants can be discussed, but when possible, you will be introduced to the use of these functions in cooperation, via a couple of slightly more beefy examples.

Also, you might find it useful to look at this introduction to R course to better understand lists, vectors, arrays, and dataframes, though you don’t necessarily need to have completed the tutorial to follow this post!

The `apply()` Family

The apply() family pertains to the R base package and is populated with functions to manipulate slices of data from matrices, arrays, lists and dataframes in a repetitive way. These functions allow crossing the data in a number of ways and avoid explicit use of loop constructs. They act on an input list, matrix or array and apply a named function with one or several optional arguments.

The called function could be:

An aggregating function, like for example the mean, or the sum (that return a number or scalar);
Other transforming or subsetting functions; and
Other vectorized functions, which yield more complex structures like lists, vectors, matrices, and arrays.

The apply() functions form the basis of more complex combinations and helps to perform operations with very few lines of code. More specifically, the family is made up of the apply(), lapply() , sapply(), vapply(), mapply(), rapply(), and tapply() functions.

But how and when should we use these?

Well, this depends on the structure of the data that you want to operate on and the format of the output that you need.

How to Use `apply()` in R

Let’s start with the godfather of the family, apply(), which operates on arrays. For simplicity, the tutorial limits itself to 2D arrays, which are also known as matrices.

The R base manual tells you that it’s called as follows: apply(X, MARGIN, FUN, ...)

where:

X is an array or a matrix if the dimension of the array is 2;
MARGIN is a variable defining how the function is applied: when MARGIN=1, it applies over rows, whereas with MARGIN=2, it works over columns. Note that when you use the construct MARGIN=c(1,2), it applies to both rows and columns; and
FUN, which is the function that you want to apply to the data. It can be any R function, including a User Defined Function (UDF).

Now, beginners may have difficulties in visualizing what is happening, so a picture and some code will come in handy to help you to figure this out.

Let’s construct a 5 x 6 matrix and imagine you want to sum the values of each column.

You can write something like this:

# Construct a 5 x 6 matrix
my_matrix <- matrix(1:30, nrow = 5, ncol = 6)

# Calculate the sum of each column
col_sums <- apply(my_matrix, 2, sum)

# Print the result
print(col_sums)

The matrix() function creates a 5 x 6 matrix with values ranging from 1 to 30. We then use the apply() function to apply the sum() function to each column of the matrix (2 specifies that we want to apply the function to columns). The resulting vector of column sums is stored in the col_sums variable and printed to the console.

Remember that in R, a matrix can be seen as a collection of line vectors when you cross the matrix from top to bottom (along the vertical line 1, which specifies the dimension or margin 1), or as a list of columns vectors, spanning the matrix left to right along the dimension or margin 2.

That means that the instruction you have just entered, depicted in figure 1, translates into: “apply the function ‘sum’ to the matrix X along margin 2 (by column), summing up the values of each column.

Note that, to avoid cluttering the picture, just one of the columns is highlighted.

You end up with a line vector containing the sums of the values of each column.

The output of the above code, a line vector, would have also been given if you summed along the lines of the matrix. This is how R displays the result.

A note for the following: in most cases, R can return a value even if the latter has not been specified, or more precisely the return value of the function has not been assigned to a variable. R simply returns the last object evaluated. In practice, however, when you want to check the return value and when you need to do further operations on those return values, it is best to assign the results of a given function to a variable explicitly.

The `lapply()` Function

You want to apply a given function to every element of a list and obtain a list as a result. When you execute ?lapply, you see that the syntax looks like the apply() function.

The difference is that:

It can be used for other objects like dataframes, lists or vectors; and
The output returned is a list (which explains the “l” in the function name), which has the same number of elements as the object passed to it.

To see how this works, create a few matrices and extract from each a given column.

This is a quite common operation performed on real data when making comparisons or aggregations from different dataframes.

Our toy example, depicted in figure 2 can be coded as:

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IkEgPC0gbWF0cml4KDE6OSwgMywzKVxuQiA8LSBtYXRyaXgoNDoxNSwgNCwzKVxuQyA8LSBtYXRyaXgoODoxMCwgMywyKSIsInNhbXBsZSI6IiMgQ3JlYXRlIGEgbGlzdCBvZiBtYXRyaWNlc1xuTXlMaXN0IDwtIGxpc3QoQSxCLEMpXG5cbiMgRXh0cmFjdCB0aGUgMm5kIGNvbHVtbiBmcm9tIGBNeUxpc3RgIHdpdGggdGhlIHNlbGVjdGlvbiBvcGVyYXRvciBgW2Agd2l0aCBgbGFwcGx5KClgXG4uLi4uLi4oTXlMaXN0LFwiW1wiLCAsIDIpXG5cbiMgRXh0cmFjdCB0aGUgMXN0IHJvdyBmcm9tIGBNeUxpc3RgXG5sYXBwbHkoTXlMaXN0LFwiW1wiLCAxLCApIiwic29sdXRpb24iOiIjIENyZWF0ZSBhIGxpc3Qgb2YgbWF0cmljZXNcbk15TGlzdCA8LSBsaXN0KEEsQixDKVxuXG4jIEV4dHJhY3QgdGhlIDJuZCBjb2x1bW4gZnJvbSBgTXlMaXN0YCB3aXRoIHRoZSBzZWxlY3Rpb24gb3BlcmF0b3IgYFtgIHdpdGggYGxhcHBseSgpYFxubGFwcGx5KE15TGlzdCxcIltcIiwgLCAyKVxuXG4jIEV4dHJhY3QgdGhlIDFzdCByb3cgZnJvbSBgTXlMaXN0YFxubGFwcGx5KE15TGlzdCxcIltcIiwgMSwgKSIsInNjdCI6InRlc3Rfb2JqZWN0KFwiTXlMaXN0XCIpXG50ZXN0X2Z1bmN0aW9uKFwibGFwcGx5XCIsIGluZGV4ID0gMSlcbnRlc3RfZnVuY3Rpb24oXCJsYXBwbHlcIiwgaW5kZXggPSAyKVxudGVzdF9lcnJvcigpXG5zdWNjZXNzX21zZyhcIkF3ZXNvbWUhXCIpIn0=

The operation is shown in the left part of figure 2.

Again, you start by specifying the object of interest, the list Mylist. You use the standard R selection operator [ and then omit the first parameter (which therefore translates into “any”, that’s why you see the two commas).

Next, you specify the second parameter, which is 2: our margin is ‘column’. So you extract the second column from all the matrices within the list.

A few notes to the code above:

The [ notation is the select operator. Remember, for example, that to extract all the elements of the third line of B requires: B[3,];
The [[ ]] notation expresses the fact that we are dealing with lists: [[2]] means the second element of the list. This is also shown in the output given by R;
The output is a list with as many elements as the element in the input; and
Note that you could also have extracted a single element for each matrice, like this: lapply(MyList,"[", 1, 2)

In the right-hand side of figure 2, you can see an alternative extraction: this time you omit the first parameter, and you get the first row from each of the matrices.

Try it yourself! Select the second column from each matrix in the list:

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IkEgPC0gbWF0cml4KDE6OSwgMywzKVxuQiA8LSBtYXRyaXgoNDoxNSwgNCwzKVxuQyA8LSBtYXRyaXgoODoxMCwgMywyKVxuTXlMaXN0IDwtIGxpc3QoQSxCLEMpIiwic2FtcGxlIjoiIyBVc2UgYGxhcHBseSgpYCB0byBzZWxlY3QgdGhlIDJuZCBjb2x1bW4gZnJvbSBlYWNoIG1hdHJpeCBpbiBgTXlMaXN0YFxuLi4uLi4uKE15TGlzdCxcIltcIiwgMSwpIiwic29sdXRpb24iOiIjIFVzZSBgbGFwcGx5KClgIHRvIHNlbGVjdCB0aGUgMm5kIGNvbHVtbiBmcm9tIGVhY2ggbWF0cml4IGluIGBNeUxpc3RgXG5sYXBwbHkoTXlMaXN0LFwiW1wiLCAxLCkiLCJzY3QiOiJ0ZXN0X2Z1bmN0aW9uKFwibGFwcGx5XCIpXG50ZXN0X2Vycm9yKClcbnN1Y2Nlc3NfbXNnKFwiV2VsbCBkb25lIVwiKSJ9

The `sapply()` Function

The sapply() function works like lapply(), but it tries to simplify the output to the most elementary data structure that is possible. And indeed, sapply() is a ‘wrapper’ function for lapply().

An example may help to understand this: let’s say that you want to repeat the extraction operation of a single element as in the last example, but now take the first element of the second row (indexes 2 and 1) for each matrix.

Applying the lapply() function would give us a list unless you pass simplify=FALSE as a parameter to sapply(). Then, a list will be returned. See how it works in the code chunk below:

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IkEgPC0gbWF0cml4KDE6OSwgMywzKVxuQiA8LSBtYXRyaXgoNDoxNSwgNCwzKVxuQyA8LSBtYXRyaXgoODoxMCwgMywyKVxuTXlMaXN0IDwtIGxpc3QoQSxCLEMpIiwic2FtcGxlIjoiIyBSZXR1cm4gYSBsaXN0IHdpdGggYGxhcHBseSgpYFxubGFwcGx5KE15TGlzdCxcIltcIiwgMiwgMSApXG5cbiMgUmV0dXJuIGEgdmVjdG9yIHdpdGggYHNhcHBseSgpYFxuLi4uLi4uKE15TGlzdCxcIltcIiwgMiwgMSApXG5cbiMgUmV0dXJuIGEgbGlzdCB3aXRoIGBzYXBwbHkoKWBcbi4uLi4uLihNeUxpc3QsXCJbXCIsIDIsIDEsIHNpbXBsaWZ5PUYpXG5cbiMgUmV0dXJuIGEgdmVjdG9yIHdpdGggYHVubGlzdCgpYFxuLi4uLi4uKGxhcHBseShNeUxpc3QsXCJbXCIsIDIsIDEgKSkiLCJzb2x1dGlvbiI6IiMgUmV0dXJuIGEgbGlzdCB3aXRoIGBsYXBwbHkoKWBcbmxhcHBseShNeUxpc3QsXCJbXCIsIDIsIDEgKVxuXG4jIFJldHVybiBhIHZlY3RvciB3aXRoIGBzYXBwbHkoKWBcbnNhcHBseShNeUxpc3QsXCJbXCIsIDIsIDEgKVxuXG4jIFJldHVybiBhIGxpc3Qgd2l0aCBgc2FwcGx5KClgXG5zYXBwbHkoTXlMaXN0LFwiW1wiLCAyLCAxLCBzaW1wbGlmeT1GKVxuXG4jIFJldHVybiBhIHZlY3RvciB3aXRoIGB1bmxpc3QoKWBcbnVubGlzdChsYXBwbHkoTXlMaXN0LFwiW1wiLCAyLCAxICkpIiwic2N0IjoidGVzdF9mdW5jdGlvbihcImxhcHBseVwiKVxudGVzdF9mdW5jdGlvbihcInNhcHBseVwiLCBpbmRleCA9IDEpXG50ZXN0X2Z1bmN0aW9uKFwic2FwcGx5XCIsIGluZGV4PSAyKVxudGVzdF9mdW5jdGlvbihcInVubGlzdFwiLCBcInhcIilcbnRlc3RfZXJyb3IoKVxuc3VjY2Vzc19tc2coXCJHb29kIGpvYiFcIikifQ==

Conversely, a function like unlist(), can tell lappy() to give you a vector!

Anyway, to avoid confusion, it is best to use these functions in their ‘native format’ and avoid conversions unless strictly necessary.

The `rep()` Function

Something that is often used together with apply() functions is rep(). When you apply it to a vector or a factor x, the function replicates its values a specified number of times.

Let’s use one of the vectors that you generated above with lapply() into MyList.

This time, however, you only select the elements of the first line and first column from each elements of the list MyList (and you use sapply() to get a vector):

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IkEgPC0gbWF0cml4KDE6OSwgMywzKVxuQiA8LSBtYXRyaXgoNDoxNSwgNCwzKVxuQyA8LSBtYXRyaXgoODoxMCwgMywyKVxuTXlMaXN0IDwtIGxpc3QoQSxCLEMpIiwic2FtcGxlIjoiIyBJbml0aWFsaXplIGBaYFxuWiA8LSBzYXBwbHkoTXlMaXN0LFwiW1wiLCAxLDEgKVxuXG4jIFJldHVybiBgWmBcblpcblxuIyBSZXBsaWNhdGUgdGhlIHZhbHVlcyBvZiBgWmBcblogPC0gcmVwKFosYygzLDEsMikpXG5cbiMgUmV0dXJuIGBaYFxuWiJ9

You see that the code above replicates the values of Z a number of times as established by c(3,1,2): three times the first, one time the second and two times the third:

Handy, no?

The `mapply()` Function

The mapply() function stands for ‘multivariate’ apply. Its purpose is to be able to vectorize arguments to a function that is not usually accepting vectors as arguments.

In short, mapply() applies a Function to Multiple List or multiple Vector Arguments.

Let’s look at a mapply() example where you create a 4 x 4 matrix with a call to the rep() function repeatedly:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIENyZWF0ZSBhIDR4NCBtYXRyaXhcblExIDwtIC4uLi4uLihjKHJlcCgxLCA0KSwgcmVwKDIsIDQpLCByZXAoMywgNCksIHJlcCg0LCA0KSksNCw0KVxuXG4jIFByaW50IGBRMWBcbi4uLi4uKFExKVxuXG4jIE9yIHVzZSBgbWFwcGx5KClgXG5RMiA8LSAuLi4uLi4ocmVwLDE6NCw0KVxuXG4jIFByaW50IGBRMmBcbnByaW50KFEyKSIsInNvbHV0aW9uIjoiIyBDcmVhdGUgYSA0eDQgbWF0cml4XG5RMSA8LSBtYXRyaXgoYyhyZXAoMSwgNCksIHJlcCgyLCA0KSwgcmVwKDMsIDQpLCByZXAoNCwgNCkpLDQsNClcblxuIyBQcmludCBgUTFgXG5wcmludChRMSlcblxuIyBPciB1c2UgYG1hcHBseSgpYFxuUTIgPC0gbWFwcGx5KHJlcCwxOjQsNClcblxuIyBQcmludCBgUTJgXG5wcmludChRMikiLCJzY3QiOiJ0ZXN0X29iamVjdChcIlExXCIsIGluY29ycmVjdF9tc2c9XCJEaWQgeW91IGFkZCBgbWF0cml4KClgIHRvIHRoZSBjb2RlP1wiLCB1bmRlZmluZWRfbXNnPVwiRGlkIHlvdSBhZGQgYG1hdHJpeCgpYCB0byB0aGUgY29kZT9cIilcbnRlc3RfZnVuY3Rpb24oXCJwcmludFwiLCBcInhcIiwgaW5kZXg9MSwgaW5jb3JyZWN0X21zZz1cIkRpZCB5b3UgYWRkIGBwcmludCgpYCB0byB0aGUgY29kZT9cIiwgdW5kZWZpbmVkX21zZz1cIkRpZCB5b3UgZm9yZ2V0IHRvIGFkZCBgcHJpbnQoKWAgdG8gdGhlIGNvZGU/XCIpXG50ZXN0X29iamVjdChcIlEyXCIsIGluY29ycmVjdF9tc2c9XCJEaWQgeW91IGFkZCBgbWFwcGx5KClgIHRvIHRoZSBjb2RlP1wiLCB1bmRlZmluZWRfbXNnPVwiRGlkIHlvdSBhZGQgYG1hcHBseSgpYCB0byB0aGUgY29kZT9cIilcbnRlc3RfZnVuY3Rpb24oXCJwcmludFwiLCBcInhcIiwgaW5kZXg9MikifQ==

But you see that there is a more efficient way to bind the results of the rep() function instead of with c(): when you call mapply(), you vectorize the action of the function rep().

Functions Related to `apply()`

Similarly, structured functions are occasionally used in conjunction with the element of the apply() family: the tutorial will only give an overview of a few of these.

The `Sweep()` Function

The sweep() function is probably the closest to the apply() family. You use it when you want to replicate different actions on the MARGIN elements that you have chosen (limiting here to the matrix case).

A typical scenario occurs in clustering, where you may need to repetitively produce normalized and centered or “standardized” data.

What does this mean?

Assume you have a number of data points in a group of data. You first find the center of the data (“center of mass”) and look how disperse this data is with respect to this center. Two basic quantities will give you this information: the mean and the standard deviation.

Say your data points are the column vectors in a matrix of your data and let’s use the matrix B created at the start of this post, but now you have called it dataPoints.

You first find the means per column and the dispersion or standard deviation with the help of one of the apply() functions. And then you shift all the points with respect to their center. This means that the mean that you have found first will be used to normalize the data with respect to their standard deviation:

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImRhdGFQb2ludHMgPC0gbWF0cml4KDQ6MTUsIDQsIDMpIiwic2FtcGxlIjoiIyBGaW5kIG1lYW5zIHBlciBjb2x1bW4gd2l0aCBgYXBwbHkoKWBcbmRhdGFQb2ludHNfbWVhbnMgPC0gYXBwbHkoZGF0YVBvaW50cywgMiwgbWVhbilcblxuIyBGaW5kIHN0YW5kYXJkIGRldmlhdGlvbiB3aXRoIGBhcHBseSgpYFxuZGF0YVBvaW50c19zZGV2IDwtIGFwcGx5KGRhdGFQb2ludHMsIDIsIHNkKVxuXG4jIENlbnRlciB0aGUgcG9pbnRzIFxuZGF0YVBvaW50c19UcmFuczEgPC0gc3dlZXAoZGF0YVBvaW50cywgMiwgZGF0YVBvaW50c19tZWFucyxcIi1cIilcbnByaW50KGRhdGFQb2ludHNfVHJhbnMxKVxuXG4jIFJldHVybiB0aGUgcmVzdWx0XG5kYXRhUG9pbnRzX1RyYW5zMVxuXG4jIE5vcm1hbGl6ZVxuZGF0YVBvaW50c19UcmFuczIgPC0gc3dlZXAoZGF0YVBvaW50c19UcmFuczEsIDIsIGRhdGFQb2ludHNfc2RldiwgXCIvXCIpXG5cbiMgUmV0dXJuIHRoZSByZXN1bHRcbmRhdGFQb2ludHNfVHJhbnMyIn0=

You produced the centered points with one call to sweep(). This function expects the following elements:

an input array, which in this case is a matrix;
a MARGIN, 2 to indicate the columns;
a summary statistics (here mean); and
a function to be applied. You use the arithmetic operator “-” for subtraction.

This means: “take the elements of the columns of the dataset MyPoints, and subtract the mean, dataPoints_means, from each of them”.

Now, you call sweep() again to divide all the values that you have just found by their own standard deviation. This step is called “normalization”. Again, you select MARGIN=2, and then you provide the vector of the standard deviations, dataPoints_sdev as an operand. Next, you pass the ‘divide by’ operator, “/”.

What you are asking of R is the following: “take the elements of the columns of the new object you just created, dataPoints_Trans1, and divide these (“/”) by their standard deviation dataPoints_sdev.

Of course, you could have obtained the same result more rapidly and concisely (as often is the case in R!) and without using different names, with a single line of code that contains a nested call to sweep():

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImRhdGFQb2ludHMgPC0gbWF0cml4KDQ6MTUsIDQsIDMpXG5kYXRhUG9pbnRzX21lYW5zIDwtIGFwcGx5KGRhdGFQb2ludHMsIDIsIG1lYW4pXG5kYXRhUG9pbnRzX3NkZXYgPC0gYXBwbHkoZGF0YVBvaW50cywgMiwgc2QpIiwic2FtcGxlIjoiIyBOb3JtYWxpemUgdGhlIGRhdGEgd2l0aCBhIG5lc3RlZCBjYWxsXG5kYXRhUG9pbnRzX1RyYW5zIDwtIHN3ZWVwKHN3ZWVwKGRhdGFQb2ludHMsIDIsIGRhdGFQb2ludHNfbWVhbnMsXCItXCIpLCAyLCBkYXRhUG9pbnRzX3NkZXYsXCIvXCIpXG5cbiMgUmV0dXJuIHRoZSByZXN1bHRcbmRhdGFQb2ludHNfVHJhbnMifQ==

Statistically, you have just created a correlation matrix, and standardized data are at the base of several more advanced procedures on data (like dimensionality reductions via PCA, signal analysis and others).

The `aggregate()` Function

This function is contained in the stats package, and you use it like this: aggregate(x, by, FUN, ..., simplify = TRUE).

In other words, it works similarly to the apply() function: you specify the object, the function and you say whether you want to simplify, just like with the sapply() function. The critical difference is the use of the by clause, which sets the variable or dataframe field by which we want to perform the aggregation.

The next section will show you how this works.

An Example of `aggregate()`

Consider the toy dataset called Mydf, which contains data about the sales of a product and where some of the values of the variable DepPC column repeat.

This variable classifies the data on a geographical location, like the portion of a post code (here the numbers correspond to the departments of the Île de France, the region comprising Paris).

You want to do some stats on the sales columns. These are DProgr, a progressive number in increasing time order, and the sales of the product (the quantity Qty), plus a logical variable, Delivered, which is logical, telling us whether the product has been delivered (T) or not (F).

First, you can do a number of very simple things to get acquainted with the data set, other than showing it all, by just typing its name (here we only have 120 records, but imagine doing this for a real file with thousands of lines!).

Let’s explore the data:

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6Ik15ZGYgPC0gZGF0YS5mcmFtZShEZXBQQz1jKFwiOTBcIixcIjkxXCIsXCI5MlwiLFwiOTNcIixcIjk0XCIsXCI3NVwiKSwgRFByb2dyPWMoMToxMjApLCBRdHk9Yyg3OjMxLDk6MjMsOTk6MTI0LDI6MjgsMTQ6MTksMjE6MjksNCwzLDE6OSw2NiksIERlbGl2ZXJlZD1pZmVsc2Uocm5vcm0oMTIwKT4wLFRSVUUsRkFMU0UpKSIsInNhbXBsZSI6IiMgRXhwbG9yZSB0aGUgZmlyc3QgMTUgcmVjb3JkcyB3aXRoIGBoZWFkKClgXG4uLi4uKE15ZGYsMTUpXG5cbiMgU2VlIHRoZSBsYXN0IDUgcmVjb3JkcyB3aXRoIGB0YWlsKClgXG4uLi4uKE15ZGYsNSkgXG5cbiMgU2hvdyBkYXRhIHR5cGVzIGZvciBlYWNoIGNvbHVtblxuc2FwcGx5KE15ZGYsIGNsYXNzKVxuXG4jIFJldHVybiBudW1iZXIgb2Ygcm93cyBhbmQgY29sdW1ucyB3aXRoIGBkaW0oKWBcbi4uLihNeWRmKSBcblxuIyBIb3cgbWFueSBkZXBhcnRtZW50cz8gXG51bmlxdWUoTXlkZiREZXBQQykiLCJzb2x1dGlvbiI6IiMgRXhwbG9yZSB0aGUgZmlyc3QgMTUgcmVjb3JkcyB3aXRoIGBoZWFkKClgXG5oZWFkKE15ZGYsMTUpXG5cbiMgU2VlIHRoZSBsYXN0IDUgcmVjb3JkcyB3aXRoIGB0YWlsKClgXG50YWlsKE15ZGYsNSkgXG5cbiMgU2hvdyBkYXRhIHR5cGVzIGZvciBlYWNoIGNvbHVtblxuc2FwcGx5KE15ZGYsIGNsYXNzKVxuXG4jIFJldHVybiBudW1iZXIgb2Ygcm93cyBhbmQgY29sdW1ucyB3aXRoIGBkaW0oKWBcbmRpbShNeWRmKSBcblxuIyBIb3cgbWFueSBkZXBhcnRtZW50cz8gXG51bmlxdWUoTXlkZiREZXBQQykiLCJzY3QiOiJ0ZXN0X2Z1bmN0aW9uKFwiaGVhZFwiKVxudGVzdF9mdW5jdGlvbihcInRhaWxcIilcbnRlc3RfZnVuY3Rpb24oXCJzYXBwbHlcIilcbnRlc3RfZnVuY3Rpb24oXCJkaW1cIilcbnRlc3RfZnVuY3Rpb24oXCJ1bmlxdWVcIilcbnRlc3RfZXJyb3IoKVxuc3VjY2Vzc19tc2coXCJHb29kIGpvYiFcIikifQ==

Note that if you want to see the number of rows and columns that the dataframe contains, you could have also called nrow(Mydf) and ncol(Mydf).

Many other inquiries on the data are possible.

Here, you are interested in knowing where the product sells best in which department, for example. That’s why you should regroup the data by department, summing up the sales, Qty, for each department DepPC with the help of the aggregate() function:

So, aggregate() tells R that you wish to sum over all the Qty that belong to the same department.

Note that R assigned the sum to a variable ‘x’ because you didn’t say otherwise.

The output is quite readable as is, but for a higher number of departments, this might be less readable. In these cases, you can resort to some graphical output: you plot the results by using one of R’s graphical output systems together with the aggregate() function:

This gives us the sales for each department.

You might ask the same question, but only for the goods that were delivered. To do this, you first subset the data for which delivered is true (T) using the now familiar subsetting operator "[".

Note that here you assign the result to a new variable Y, which is a new dataframe that inherits the same columns names from the parent dataframe Mydf. You do this to avoid repeating the aggregate instruction within the call to the plotting for readability:

So you could have posed different questions to the data in a vectorized way like with aggregate(), and this you often do in conjunction with a convenient plotting system like ggplot2, so you get the jist.

Note that to get this, you only needed very few lines of code.

Vectorization as an Alternative to Loops and Apply Functions?

You have seen some variations on the same theme, which is “act on a structured set of data in a repetitive way”. In this sense, these functions can be seen not only as an alternative to loops but also as a vectorized form of doing things.

“Vectorized” here in the loose sense, we won’t enter the debate that asks whether – and which of the – apply() functions are indeed vectorized or not (see for example the discussion here).

In practice, in order to choose which apply() function to use, you need to consider the following:

The data type of the input: this is the object you will act upon (vector, matrix, array…, list, data frame or perhaps a combination of those)
What you intend to do: the FUN function you want to pass
The subsets of that data: rows, columns, or perhaps all?
What type of data do you want to get from the function? Because you might want to perform further operations on it (and do you want a new object, or do you want to transform the input object directly?)

These are quite general questions that you may ask for the related functions, of which we have considered aggregate(), by(), sweep(), etc.

But there are many more! Don’t stop exploring now! Check out DataCamp's Arrays in R tutorial.

As a follow-up to this tutorial, consider taking DataCamp’s Introduction to R or Intermediate R courses.

Temas

Data Science

Learn more about R

Curso

Introducción a R

4 h

Domina los fundamentos del análisis de datos en R, como vectores, listas y marcos de datos, y practica R con conjuntos de datos reales.

Ver detalles

Comienza el curso

Curso

R intermedio

6 h

655.7K

Continúa tu viaje para convertirte en un ninja de R aprendiendo sobre sentencias condicionales, bucles y funciones vectoriales.

Ver detalles

Comienza el curso

Curso

Limpieza de datos en R

4 h

58.5K

Aprende a limpiar los datos con rapidez y precisión para ayudar a tu empresa a pasar de datos sin procesar a información significativa.

Ver detalles

Comienza el curso

Relacionado

Tutorial

Arrays in R

Learn about Arrays in R, including indexing with examples, along with the creation and addition of matrices and the apply() function.

Olivia Smith

Tutorial

Utilities in R Tutorial

Learn about several useful functions for data structure manipulation, nested-lists, regular expressions, and working with times and dates in the R programming language.

Aditya Sharma

Tutorial

R Formula Tutorial

Discover the R formula and how you can use it in modeling- and graphical functions of well-known packages such as stats, and ggplot2.

Karlijn Willems

Tutorial

Using Functions in R Tutorial

Discover what R functions are, the different type of functions in R, and how to create your own functions in R.

Javier Canales Luna

Tutorial

Data Reshaping in R Tutorial

Learn about data reshaping in R, different functions like rbind(), cbind(), along with Melt(), Dcast(), and finally about the transpose function.

Olivia Smith

Tutorial

Matrices in R Tutorial

Learn all about R's matrix, naming rows and columns, accessing elements also with computation like addition, subtraction, multiplication, and division.

Olivia Smith

Ver más Ver más

A note on the R Apply Family

The Apply Functions as Alternatives to Loops

The apply() Family

How to Use apply() in R

The lapply() Function

The sapply() Function

The rep() Function

The mapply() Function

Functions Related to apply()

The Sweep() Function

The aggregate() Function

An Example of aggregate()