Tutorials
r programming
+1

A Tutorial on Loops in R - Usage and Alternatives

A tutorial on loops in R that looks at the constructs available in R for looping. Discover alternatives using R's vectorization feature.

This easy-to-follow R tutorial on loops will examine the constructs available in R for looping, when you should use these constructs, and how to make use of alternatives, such as R’s vectorization feature, to perform your looping tasks more efficiently.

The post will present a few looping examples to then criticize and deprecate these in favor of the most popular vectorized alternatives amongst the very many that are available in the rich set of libraries that R offers.

In general, the advice of this R tutorial on loops would be: learn about loops. They offer you a detailed view of what it is supposed to happen at the elementary level as well as they provide you with an understanding of the data that you’re manipulating.

And after you have gotten a clear understanding of loops, get rid of them.

Put your effort into learning about vectorized alternatives. It pays off in terms of efficiency.

(To practice interactively, try the chapter on loops in Datacamp's intermediate R course.)

What Are Loops?

“Looping”, “cycling”, “iterating” or just replicating instructions is an old practice that originated well before the invention of computers. It is nothing more than automating a multi-step process by organizing sequences of actions or ‘batch’ processes and by grouping the parts that need to be repeated.

All modern programming languages provide special constructs that allow for the repetition of instructions or blocks of instructions.

Broadly speaking, there are two types of these special constructs or loops in modern programming languages. Some loops execute for a prescribed number of times, as controlled by a counter or an index, incremented at each iteration cycle. These are part of the for loop family.

On the other hand, some loops are based on the onset and verification of a logical condition. The condition is tested at the start or the end of the loop construct. These variants belong to the while or repeat family of loops, respectively.

An Introduction To Loops in R

According to the R base manual, among the control flow commands, the loop constructs are for, while and repeat, with the additional clauses break and next.

Remember that control flow commands are the commands that enable a program to branch between alternatives, or to “take decisions”, so to speak.

You can always see these control flow commands by invoking ?Control at the RStudio command line.

Try it out for yourself in the DataCamp Light box below:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiI/Q29udHJvbCJ9


This flow chart shows the R loop structures:



For Loops in R

The next sections will take a closer look at each of these structures that are shown in the figure above. We will start our discussion with the structure on the left, and we will continue the next sections by gradually moving to the structures on the right.

(For a video introduction to for loops and a follow up exercise, try this part of Datacamp's intermediate R course.)

For Loops Explained

This loop structure, made of the rectangular box ‘init’ (or initialization), the diamond or rhombus decision, and the rectangular box i1 is executed a known number of times.

In flowchart terms, rectangular boxes mean something like “do something which does not imply decisions”. Rhombi or diamonds, on the other hand, are called “decision symbols” and therefore translate into questions which only have two possible logical answers, namely, True (T) or False (F).

Note that, to keep things simple, other possible symbols have been omitted from the figure.

One or more instructions within the initialization rectangle are followed by the evaluation of the condition on a variable which can assume values within a specified sequence. In the figure, this is represented by the diamond: the symbols mean “does the variable v’s current value belong to the sequence seq?”.

In other words, you are testing whether v’s current value is within a specified range. You typically define this range in the initialization, with something like 1:100 to ensure that the loop starts.

If the condition is not met and the resulting outcome is False, the loop is never executed. This is indicated by the loose arrow on the right of the for loop structure. The program will then execute the first instruction found after the loop block.

If the condition is verified, an instruction -or block of instructions- i1 is executed. And perhaps this block of instructions is another loop. In such cases, you speak of a nested loop.

Once this is done, the condition is evaluated again. This is indicated by the lines going from i1 back to the top, immediately after the initialization box. In R -and in Python, it is possible to express this in plain English, by asking whether our variable belongs to a range of values or not.

Note that in other languages, for example in C, the condition is made more explicit with the use of a logical operator, such as greater or less than, equal to, …

Here is an example of a simple for loop:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIgIyBDcmVhdGUgYSB2ZWN0b3IgZmlsbGVkIHdpdGggcmFuZG9tIG5vcm1hbCB2YWx1ZXNcbnUxIDwtIHJub3JtKDMwKVxucHJpbnQoXCJUaGlzIGxvb3AgY2FsY3VsYXRlcyB0aGUgc3F1YXJlIG9mIHRoZSBmaXJzdCAxMCBlbGVtZW50cyBvZiB2ZWN0b3IgdTFcIilcblxuIyBJbml0aWFsaXplIGB1c3FgXG51c3EgPC0gMFxuXG5mb3IoaSBpbiAxOjEwKSB7XG4gICMgaS10aCBlbGVtZW50IG9mIGB1MWAgc3F1YXJlZCBpbnRvIGBpYC10aCBwb3NpdGlvbiBvZiBgdXNxYFxuICB1c3FbaV0gPC0gdTFbaV0qdTFbaV1cbiAgcHJpbnQodXNxW2ldKVxufVxuXG5wcmludChpKSJ9


The for block is contained within curly braces. These may be placed either immediately after the test condition or beneath it, preferably followed by an indentation. None of this is compulsory, but the curly braces definitely enhance the readability of your code and allow to spot the loop block and potential errors within it easily.

Note that the vector of the squares, usq, is initialized. This would not be necessary in plain RStudio code, but in the markup version, knitr would not compile because a reference to the vector is not found before its use in the loop, thus throwing an error within RStudio. For more information on knitr, go to this page.

Nesting For Loops

Now that you know that for loops can also be nested, you’re probably wondering when and why you would be using this in your code.

Well, suppose you wish to manipulate a bi-dimensional array by setting its elements to specific values.

Then you might do something like this:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIENyZWF0ZSBhIDMwIHggMzAgbWF0cml4IChvZiAzMCByb3dzIGFuZCAzMCBjb2x1bW5zKVxubXltYXQgPC0gbWF0cml4KG5yb3c9MzAsIG5jb2w9MzApXG5cbiMgRm9yIGVhY2ggcm93IGFuZCBmb3IgZWFjaCBjb2x1bW4sIGFzc2lnbiB2YWx1ZXMgYmFzZWQgb24gcG9zaXRpb246IHByb2R1Y3Qgb2YgdHdvIGluZGV4ZXNcbmZvcihpIGluIDE6ZGltKG15bWF0KVsxXSkge1xuICBmb3IoaiBpbiAxOmRpbShteW1hdClbMl0pIHtcbiAgICBteW1hdFtpLGpdID0gaSpqXG4gIH1cbn1cblxuIyBKdXN0IHNob3cgdGhlIHVwcGVyIGxlZnQgMTB4MTAgY2h1bmtcbm15bWF0WzE6MTAsIDE6MTBdIn0=


Tip: for more information on the matrix() function, visit this page.

You have two nested for loops in the code chunk above and thus two sets of curly braces, each with its own block and governed by its own index. That is, i runs over the lines and j runs over the columns.

What did you produce?

Well, you made the all too familiar multiplication table that you should know by heart.

But note that here you are limited to the first 30 integers.

Tip: now you may even produce a sequence of multiplication tables with the help of the array() function. You should create a three-dimensional array with a ‘data’ attribute that’s a vector that goes from 1 to 20. The ‘dim’ attriubte should be a vector that gives the maximal indeces of 20 in the three dimensions. Try this as an exercise in the following DataCamp light box:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIENyZWF0ZSB5b3VyIHRocmVlLWRpbWVuc2lvbmFsIGFycmF5XG5teV9hcnJheSA8LSAuLlxuXG5mb3IgKGkgaW4gMTpkaW0obXlfYXJyYXkpWzFdKSB7XG4gIGZvciAoaiBpbiAxOmRpbShteV9hcnJheSlbMl0pIHtcbiAgICBmb3IgKGsgaW4gMTpkaW0obXlfYXJyYXkpWzNdKSB7XG4gICAgICBteV9hcnJheVtpLGosa10gPSBpKmoqa1xuICAgIH1cbiAgfVxufVxuXG4jIFNob3cgYSAxMHgxMHgxNSBjaHVuayBvZiB5b3VyIGFycmF5XG5teV9hcnJheVsuLiwgLi4sIC4uXSIsInNvbHV0aW9uIjoiIyBDcmVhdGUgeW91ciB0aHJlZS1kaW1lbnNpb25hbCBhcnJheVxubXlfYXJyYXkgPC0gYXJyYXkoMToyMCwgZGltPWMoMjAsIDIwLCAyMCkpXG5cbmZvciAoaSBpbiAxOmRpbShteV9hcnJheSlbMV0pIHtcbiAgZm9yIChqIGluIDE6ZGltKG15X2FycmF5KVsyXSkge1xuICAgIGZvciAoayBpbiAxOmRpbShteV9hcnJheSlbM10pIHtcbiAgICAgIG15X2FycmF5W2ksaixrXSA9IGkqaiprXG4gICAgfVxuICB9XG59XG5cbiMgU2hvdyBhIDEweDEweDE1IGNodW5rIG9mIHlvdXIgYXJyYXlcbm15X2FycmF5WzE6MTAsIDE6MTAsIDE6MTVdIiwic2N0IjoidGVzdF9vYmplY3QoXCJteV9hcnJheVwiLFxuICAgICAgICAgICAgdW5kZWZpbmVkX21zZyA9IFwiRGlkIHlvdSBwcm92aWRlIGEgZGF0YSBhbmQgYSBkaW1lbnNpb24gYXJndW1lbnQgdG8geW91ciBgYXJyYXkoKWAgZnVuY3Rpb24/XCIgLFxuICAgICAgICAgICAgaW5jb3JyZWN0X21zZyA9IFwiRGlkIHlvdSBjcmVhdGUgYW4gYXJyYXkgd2l0aCB0aGUgY29ycmVjdCBkYXRhIGFuZCBkaW1lbnNpb25zP1wiKVxudGVzdF9vdXRwdXRfY29udGFpbnMoXCJteV9hcnJheVsxOjEwLCAxOjEwLCAxOjE1XVwiLCBpbmNvcnJlY3RfbXNnID0gXCJEb24ndCBmb3JnZXQgdG8gb3V0cHV0IGEgMTB4MTB4MTUgY2h1bmsgb2YgeW91ciBhcnJheSFcIilcbnN1Y2Nlc3NfbXNnKFwiR3JlYXQhIFlvdSBzdWNjZXNzZnVsbHkgbWFkZSB5b3VyIG93biBzZXF1ZW5jZSBvZiBtdWx0aXBsaWNhdGlvbiB0YWJsZXMhIEtlZXAgb24gZ29pbmcgdG8gbGVhcm4gbW9yZSBhYm91dCBmb3IgbG9vcHMhXCIpIn0=


You can also choose an integer and then produce a table according to your choice: you can assign an integer to a variable if the table is square or to two variables if the table is rectangular. This variable will then serve as upper bounds to the indexes i and j.

You show it for the square case, by preceding the previous loop with the following code:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIEluc2VydCB5b3VyIG93biBpbnRlZ2VyIGhlcmVcbm15X2ludCA8LSAuLlxuXG5uciA8LSBhcy5pbnRlZ2VyKG15X2ludCkiLCJzb2x1dGlvbiI6IiMgSW5zZXJ0IHlvdXIgb3duIGludGVnZXIgaGVyZVxubXlfaW50IDwtIDRcblxubnIgPC0gYXMuaW50ZWdlcihteV9pbnQpIiwic2N0IjoibXNnMCA8LSBcIkRpZCB5b3UgaW5zZXJ0IGFuIGludGVnZXI/XCJcbnRlc3Rfb2JqZWN0KFwibXlfaW50XCIsIGluY29ycmVjdF9tc2cgPSBtc2cwLCB1bmRlZmluZWRfbXNnID0gbXNnMClcbnN1Y2Nlc3NfbXNnKFwiV2VsbCBkb25lIVwiKSJ9


Note that to prevent the user from deluging the screen with huge tables, you put a condition at the end to print the first 10 x 10 chunk, only if the user asked for an integer greater than 10. Else, an n x n chunk will be printed.

The complete code looks like this:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIEluc2VydCB5b3VyIG93biBpbnRlZ2VyIGhlcmVcbm15X2ludCA8LSAuLlxuXG5uciA8LSBhcy5pbnRlZ2VyKG15X2ludClcblxuIyBDcmVhdGUgYSBgbmAgeCBgbmAgbWF0cml4IHdpdGggemVyb2VzXG5teW1hdCA8LSBtYXRyaXgoMCwgbnIsIG5yKVxuXG4jIEZvciBlYWNoIHJvdyBhbmQgZm9yIGVhY2ggY29sdW1uLCBhc3NpZ24gdmFsdWVzIGJhc2VkIG9uIHBvc2l0aW9uXG4jIFRoZXNlIHZhbHVlcyBhcmUgdGhlIHByb2R1Y3Qgb2YgdHdvIGluZGV4ZXNcbmZvcihpIGluIDE6ZGltKG15bWF0KVsxXSkge1xuICBmb3IoaiBpbiAxOmRpbShteW1hdClbMl0pIHtcbiAgICBteW1hdFtpLGpdID0gaSpqXG4gIH1cbn1cblxuIyBTaG93IHRoZSBmaXJzdCAxMHgxMCBjaHVuayBvciB0aGUgZmlyc3QgYG5yYCB4IGBucmAgY2h1bmtcbmlmIChuciA+IDEwKSB7XG4gIG15bWF0WzE6MTAsIDE6MTBdXG59IGVsc2UgbXltYXRbMTpuciwgMTpucl0iLCJzb2x1dGlvbiI6IiMgSW5zZXJ0IHlvdXIgb3duIGludGVnZXIgaGVyZVxubXlfaW50IDwtIDRcblxubnIgPC0gYXMuaW50ZWdlcihteV9pbnQpXG5cbiMgY3JlYXRlIGEgYG5gIHggYG5gIG1hdHJpeCB3aXRoIHplcm9lc1xubXltYXQgPC0gbWF0cml4KDAsbnIsbnIpXG5cbiMgRm9yIGVhY2ggcm93IGFuZCBmb3IgZWFjaCBjb2x1bW4sIGFzc2lnbiB2YWx1ZXMgYmFzZWQgb24gcG9zaXRpb25cbiMgVGhlc2UgdmFsdWVzIGFyZSB0aGUgcHJvZHVjdCBvZiB0d28gaW5kZXhlc1xuZm9yKGkgaW4gMTpkaW0obXltYXQpWzFdKSB7XG4gIGZvcihqIGluIDE6ZGltKG15bWF0KVsyXSkge1xuICAgIG15bWF0W2ksal0gPSBpKmpcbiAgfVxufVxuXG4jIFNob3cgdGhlIGZpcnN0IDEweDEwIGNodW5rIG9yIHRoZSBmaXJzdCBgbnJgIHggYG5yYCBjaHVua1xuaWYgKG5yID4gMTApIHtcbiAgbXltYXRbMToxMCwgMToxMF1cbn0gZWxzZSBteW1hdFsxOm5yLCAxOm5yXSIsInNjdCI6Im1zZzAgPC0gXCJEaWQgeW91IGluc2VydCBhbiBpbnRlZ2VyP1wiXG50ZXN0X29iamVjdChcIm15X2ludFwiLCBpbmNvcnJlY3RfbXNnID0gbXNnMCwgdW5kZWZpbmVkX21zZyA9IG1zZzApXG5zdWNjZXNzX21zZyhcIk5pY2VseSBkb25lIVwiKSJ9


While Loops

The while loop, set in the middle of the figure above, is made of an initialization block as before, followed by a logical condition. This condition is typically expressed by the comparison between a control variable and a value, by using greater than, less than or equal to, but any expression that evaluates to a logical value, True or False, is legitimate.

If the result is False (F), the loop is never executed as indicated by the loose arrow on the right of the figure. The program will then execute the first instruction it finds after the loop block.

If it is True (T), the instruction or block of instructions i1 is executed next.

Note that an additional instruction or block of instructions i2 was added: this serves as an update for the control variable, which may alter the result of the condition at the start of the loop, but this is not necessary. Or maybe you want to add an increment to a counter to keep trace of the number of iterations executed. The iterations cease once the condition evaluates to false.

The format is while(cond) expr, where cond is the condition to test and expr is an expression.

For example, the following loop asks the user with a User Defined Function or UDF to enter the correct answer to the universe and everything question. It will then continue to do so until the user gets the answer right:

# Your User Defined Function
readinteger <- function(){
  n <- readline(prompt="Please, enter your ANSWER: ")
}

response <- as.integer(readinteger())

while (response!=42) {   
  print("Sorry, the answer to whatever the question MUST be 42");
  response <- as.integer(readinteger());
}

As a start, we use a user defined function to get the user input before entering the loop. This loop will continue as long as the answer is not the expected 42.

In other words, you do this because otherwise, R would complain about the missing expression that was supposed to provide the required True or False -and in fact, it does not know ‘response’ before using it in the loop. You also do this because, if you answer right at first attempt, the loop will not be executed at all.

Repeat Loops

The repeat loop is located at the far right of the flow chart that you find above. This loop is similar to the while loop, but it is made so that the blocks of instructions i1 and i2 are executed at least once, no matter what the result of the condition.

Adhering to other languages, one could call this loop “repeat until” to emphasize the fact that the instructions i1 and i2 are executed until the condition remains False (F) or, equivalently, becomes True (T), thus exiting; but in any case, at least once.

As a variation of the previous example, you may write:

readinteger <- function(){
  n <- readline(prompt="Please, enter your ANSWER: ") 
}

repeat {   
  response <- as.integer(readinteger());
  if (response == 42) {
    print("Well done!");
    break
  } else print("Sorry, the answer to whatever the question MUST be 42");
}

After the now familiar input function, you have the repeat loop whose block is executed at least once and that will terminate whenever the if condition is verified.

Note that you had to set a condition within the loop upon which to exit with the clause break. This clause introduces us to the notion of exiting or interrupting cycles within loops.

Interruption and Exit Loops in R

So how do you exit from a loop?

In other terms, aside from the “natural” end of the loop, which occurs either because you reached the prescribed number of iterations (for) or because you met a condition (while, repeat), can you stop or interrupt the loop?

And if yes, how?

The break statement responds to the first question: you have seen this in the last example.

Break Your Loops With break

When the R interpreter encounters a break, it will pass control to the instruction immediately after the end of the loop (if any). In the case of nested loops, the break will permit to exit only from the innermost loop.

Here’s an example.

This chunk of code defines an m x n matrix of zeros and then enters a nested for loop to fill the locations of the matrix, but only if the two indexes differ. The purpose is to create a lower triangular matrix, that is a matrix whose elements below the main diagonal are non-zero. The others are left untouched to their initialized zero value.

When the indexes are equal and thus the condition in the inner loop, which runs over the column index j is fulfilled, a break is executed and the innermost loop is interrupted with a direct jump to the instruction following the inner loop. This instruction is a print() instruction. Then, control gets to the outer for condition (over the rows, index i), which is evaluated again.

If the indexes differ, the assignment is performed and the counter is incremented by 1. At the end, the program prints the counter ctr, which contains the number of elements that were assigned.

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIE1ha2UgYSBsb3dlciB0cmlhbmd1bGFyIG1hdHJpeCAoemVyb2VzIGluIHVwcGVyIHJpZ2h0IGNvcm5lcilcbm09MTAgXG5uPTEwXG5cbiMgQSBjb3VudGVyIHRvIGNvdW50IHRoZSBhc3NpZ25tZW50XG5jdHI9MFxuXG4jIENyZWF0ZSBhIDEwIHggMTAgbWF0cml4IHdpdGggemVyb2VzIFxubXltYXQgPSBtYXRyaXgoMCxtLG4pXG5cbmZvcihpIGluIDE6bSkge1xuICBmb3IoaiBpbiAxOm4pIHsgICBcbiAgICBpZihpPT1qKSB7IFxuICAgICAgYnJlYWs7XG4gICAgfSBlbHNlIHtcbiAgICAgICAjIHlvdSBhc3NpZ24gdGhlIHZhbHVlcyBvbmx5IHdoZW4gaTw+alxuICAgICAgbXltYXRbaSxqXSA9IGkqalxuICAgICAgY3RyPWN0cisxXG4gICAgICB9XG4gIH1cbiAgcHJpbnQoaSpqKSBcbn1cblxuIyBQcmludCBob3cgbWFueSBtYXRyaXggY2VsbHMgd2VyZSBhc3NpZ25lZFxucHJpbnQoY3RyKSJ9


Note that it might be a bit over-cautious to put curly brackets even when they’re not strictly necessary. You usually do it to make sure that whatever is opened with a {, is also closed with a }. So if you notice unmatched numbers of { or }, you know there is an error, although the opposite is not necessarily true!

The Use of next in Loops

next discontinues a particular iteration and jumps to the next cycle. In fact, it jumps to the evaluation of the condition holding the current loop.

In other languages you may find the (slightly confusing) equivalent called “continue”, which means the same: wherever you are, upon the verification of the condition, jump to the evaluation of the loop.

A simpler example of keeping the loop ongoing while discarding a particular cycle upon the occurrence of a condition is:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtPTIwXG5cbmZvciAoayBpbiAxOm0pe1xuICBpZiAoIWsgJSUgMilcbiAgICBuZXh0XG4gICAgcHJpbnQoaylcbn0ifQ==


This piece of code prints all uneven numbers within the interval 1:m (here m=20). In other words, all integers except the ones with non zero remainder when divided by 2 (thus the use of the modulus operand %%), as specified by the if test, will be printed.

Numbers whose remainder is zero will not be printed, as the program jumps to the evaluation of the i in 1:m condition and ignores any instruction that might follow. In this case, print(k) is ignored.

Wrapping Up The Use of Loops in R

  1. Try to put as little code as possible within the loop by taking out as many instructions as possible. Remember, anything inside the loop will be repeated several times and perhaps it is not needed.
  2. Be careful when you use repeat: make sure that a termination is explicitly set by testing a condition or you can end up in an infinite loop.
  3. It is better to use one or more function calls within the loop if a loop is getting (too) big. The function calls will make it easier for other users to follow the code. But the use of a nested for loop to perform matrix or array operations is probably a sign that you didn’t implemented things in the best way for a matrix-based language like R.
  4. It is not recommended to “grow” variable or dataset by using an assignment on every iteration. In some languages like Matlab, a warning error is issued: you may continue, but you are invited to consider alternatives. A typical example will be shown in the next section.

When To Use R Loops

All this is good and well, but when do you want to use loops in R and when not?

Every time some operation(s) has to be repeated, a loop may come in handy.

You only need to specify how many times or upon which conditions those operations need execution: you assign initial values to a control loop variable, perform the loop and then, once the loop has finished, you typically do something with the results.

But when are you supposed to use loops?

Couldn’t you just replicate the desired instruction for the sufficient number of times?

Well, a rule of thumb could be that if you need to perform an action (say) three times or more, then a loop would serve you better. It makes the code more compact, readable and maintainable and you may save some typing: let’s say you discover that a certain instruction needs to be repeated once more than initially foreseen: instead of re-writing the full instruction, you may just alter the value of a variable in the test condition.

Yet, the peculiar nature of R suggests not to use loops at all(!) whenever alternatives exist.

Luckily, there are some alternatives!

R enjoys a feature that few programming languages do, which is called vectorization…

The Alternatives To Loops in R

What is Vectorization?

As the word suggest, vectorization is the operation of converting repeated operations on simple numbers (“scalars”) into single operations on vectors or matrices. You have seen several examples of this in the subsections above.

Now, a vector is the elementary data structure in R and is “a single entity consisting of a collection of things”, according to the R base manual.

So, a collection of numbers is a numeric vector.

If you combine vectors (of the same length), you obtain a matrix. You can do this vertically or horizontally, using different R instructions. Thus in R, a matrix is seen as a collection of horizontal or vertical vectors. By extension, you can vectorize repeated operations on vectors.

Many of the above loop constructs can be made implicit by using vectorization.

I say “implicit”, because they do not really disappear. At a lower level, the alternative vectorized form translates into code which will contain one or more loops in the lower level language the form was implemented and compiled (Fortran, C, or C++ ).

These are hidden to the user and are usually faster than the equivalent explicit R code, but unless you’re planning to implement your own R functions using one of those languages, this is totally transparent to you.

The most elementary case you can think of is the addition of two vectors v1 and v2 into a vector v3, which can be done either element-by-element with a for loop:

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6InYxIDwtIGMoMiwgMywgNSkgXG52MiA8LSBjKDQsIDUsIDYpXG52MyA8LSBjKDAsIDAsIDApXG5uIDwtIDMiLCJzYW1wbGUiOiJmb3IgKGkgaW4gMTpuKSB7IFxuXHR2M1tpXSA8LXYxW2ldICsgdjJbaV0gXG59XG52MyJ9


Or you can also use the “native” vectorized form:

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6InYxIDwtIGMoMiwgMywgNSkgXG52MiA8LSBjKDQsIDUsIDYpIiwic2FtcGxlIjoidjMgPSB2MSArIHYyXG52MyJ9


Note that you say “native” because R can recognize all the arithmetic operators as acting on vectors and matrices.

Similarly, for two matrices A and B, instead of adding the elements of A[i,j] and B[i,j] in corresponding positions, for which you need to take care of two indexes i and j, you tell R to do the following:

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IkEgPSBtYXRyaXgoIGMoMSwgMiwgMywgNCwgNSwgNiksIG5yb3c9MywgbmNvbD0yKSBcbkIgPSBtYXRyaXgoIGMoMiwgNCwgMywgMSwgNSwgNyksIG5yb3c9MywgbmNvbD0yKSAiLCJzYW1wbGUiOiJDPSBBICsgQlxuQyJ9


And this is very simple indeed!

Vectorization in R Explained

Why would vectorization run faster, given that the number of elementary operations is seemingly the same?

This is best explained by looking at the internal nuts and bolts of R, which would demand a separate post, but succinctly: in the first place, R is an interpreted language and as such, all the details about variable definition are taken care by the interpreter. You do not need to specify that a number is a floating point or allocate memory using a pointer in memory, for example.

The R interpreter “understands” these issues from the context as you enter your commands, but it does so on a command-by-command basis. It will therefore need to deal with such definitions every time you issue a given command, even if you just repeat it.

A compiler, instead, solves literally all the definitions and declarations at compilation time over the entire code; the latter is translated into a compact and optimized binary code, before you try to execute anything. Now, as R functions are written in one of these lower-level languages, they are more efficient.

In practice, if one looked at the low level code, one would discover calls to C or C++, usually implemented within what is called a wrapper code.

Secondly, in languages supporting vectorization (like R or Matlab), every instruction making use of a numeric datum, acts on an object which is natively defined as a vector, even if only made of one element. This is the default when you define, for example, a single numeric variable: its inner representation in R will always be a vector, despite it being made of one number only.

The loops continue to exist under the hood, but at the lower and much faster C/C++ compiled level. The advantage of having a vector means that the definitions are solved by the interpreter only once, on the entire vector, irrespective of its size, in contrast to a loop performed on a scalar, where definitions, allocations, …, need to be done on a element by element basis, and this is slower.

Finally, dealing with native vector format allows to utilize very efficient Linear Algebra routines (like BLAS or Basic Linear Algebra Subprograms), so that when executing vectorized instructions, R leverages on these efficient numerical routines. So the message would be, if possible, process whole data structures within a single function call to avoid all the copying operations that are executed.

But enough with digressions, let’s make a general example of vectorization, and then in the next subsection, you’ll dive into more specific and popular R vectorized functions to substitute loops.

An Example of Vectorization in R

Let’s go back to the notion of “growing data”, a typically inefficient way of “updating” a data frame.

First you create an m x n matrix with replicate(m, rnorm(n)) with m=10 column vectors of n=10 elements each, constructed with rnorm(n), which creates random normal numbers.

Then you transform it into a data frame (thus 10 observations of 10 variables) and perform an algebraic operation on each element using a nested for loop: at each iteration, a sinusoidal function increments every element that is referred by the two indexes.

The following example is a bit artificial, but it could represent the addition of a signal to some random noise:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIFRoaXMgaXMgYSBiYWQgbG9vcCB3aXRoICdncm93aW5nJyBkYXRhXG5zZXQuc2VlZCg0Milcbm09MTBcbm49MTBcblxuIyBDcmVhdGUgbWF0cml4IG9mIG5vcm1hbCByYW5kb20gbnVtYmVyc1xubXltYXQgPC0gcmVwbGljYXRlKG0sIHJub3JtKG4pKVxuXG4jIFRyYW5zZm9ybSBpbnRvIGRhdGEgZnJhbWVcbm15ZGZyYW1lIDwtIGRhdGEuZnJhbWUobXltYXQpXG5cbmZvciAoaSBpbiAxOm0pIHtcbiAgZm9yIChqIGluIDE6bikge1xuICAgIG15ZGZyYW1lW2ksal08LW15ZGZyYW1lW2ksal0gKyAxMCpzaW4oMC43NSpwaSlcbiAgICBwcmludChteWRmcmFtZSlcbiAgfVxufSJ9


Here, most of the execution time consists of copying and managing the loop.

Let’s see how a vectorized solution looks like:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzZXQuc2VlZCg0Milcbm09MTBcbm49MTBcbm15bWF0IDwtIHJlcGxpY2F0ZShtLCBybm9ybShuKSkgXG5teWRmcmFtZSA8LSBkYXRhLmZyYW1lKG15bWF0KVxubXlkZnJhbWUgPC0gbXlkZnJhbWUgKyAxMCpzaW4oMC43NSpwaSlcbm15ZGZyYW1lIn0=


This looks simpler: the last line takes the place of the nested for loop. Note the use of the set.seed() to ensure that the two implementations give exactly the same result.

Let’s now quantify the execution time for the two solutions.

You can do this by using R system commands, like system.time() to which a chunk of code can be passed like this:

Tip: just put the code you want to evaluate in between the parentheses of the system.time() function.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6InNldC5zZWVkKDQyKVxubT0xMFxubj0xMFxubXltYXQ8LXJlcGxpY2F0ZShtLCBybm9ybShuKSkgXG5teWRmcmFtZT1kYXRhLmZyYW1lKG15bWF0KVxubXlkZnJhbWU8LW15ZGZyYW1lICsgMTAqc2luKDAuNzUqcGkpIiwic2FtcGxlIjoiIyBJbnNlcnQgYHN5c3RlbS50aW1lKClgIHRvIG1lYXN1cmUgbG9vcCBleGVjdXRpb25cbiAgZm9yIChpIGluIDE6bSkge1xuICAgIGZvciAoaiBpbiAxOm4pIHtcbiAgICAgIG15ZGZyYW1lW2ksal0gPC0gbXlkZnJhbWVbaSxqXSArIDEwKnNpbigwLjc1KnBpKVxuICAgIH1cbiAgfVxuXG4jIEFkZCBgc3lzdGVtLnRpbWUoKWAgdG8gbWVhc3VyZSB2ZWN0b3JpemVkIGV4ZWN1dGlvblxuICBteWRmcmFtZTwtbXlkZnJhbWUgKyAxMCpzaW4oMC43NSpwaSkgICIsInNvbHV0aW9uIjoiIyBJbnNlcnQgYHN5c3RlbS50aW1lKClgIHRvIG1lYXN1cmUgbG9vcCBleGVjdXRpb25cbnN5c3RlbS50aW1lKFxuICBmb3IgKGkgaW4gMTptKSB7XG4gICAgZm9yIChqIGluIDE6bikge1xuICAgICAgbXlkZnJhbWVbaSxqXSA8LSBteWRmcmFtZVtpLGpdICsgMTAqc2luKDAuNzUqcGkpXG4gICAgfVxuICB9XG4pXG5cbiMgQWRkIGBzeXN0ZW0udGltZSgpYCB0byBtZWFzdXJlIHZlY3Rvcml6ZWQgZXhlY3V0aW9uXG5zeXN0ZW0udGltZShcbiAgbXlkZnJhbWUgPC0gbXlkZnJhbWUgKyAxMCpzaW4oMC43NSpwaSkgIFxuKSIsInNjdCI6InRlc3Rfb3V0cHV0X2NvbnRhaW5zKGdldF9zb2x1dGlvbl9jb2RlKCksIGluY29ycmVjdF9tc2cgPSBcIkRpZCB5b3UgYWRkIGBzeXN0ZW0udGltZSgpYCB0byB0aGUgY29kZT9cIilcbnRlc3Rfb3V0cHV0X2NvbnRhaW5zKGdldF9zb2x1dGlvbl9jb2RlKCksIGluY29ycmVjdF9tc2cgPSBcIkRpZCB5b3UgYWRkIGBzeXN0ZW0udGltZSgpYCB0byB0aGUgY29kZT9cIilcbnN1Y2Nlc3NfbXNnKFwiRmFudGFzdGljISBMb29rIGF0IHRoZSByZXN1bHRzLiBJcyB0aGUgZXhlY3V0aW9uIG9mIHRoZSB2ZWN0b3JpemVkIHZlcnNpb24gcmVhbGx5IGZhc3Rlcj9cIikifQ==


In the code chunk above, you do the job of choosing m and n, the matrix creation and its transformation into a data frame only once at the start, and then evaluate the for chunk against the “one-liner” of the vectorized version with the two separate call to system.time().

You see that already with a minimal setting of m=n=10 the vectorized version is 7 time faster, although for such low values, it is barely important for the user.

Differences become noticeable (at the human scale) if you put m=n=100, whereas increasing to 1000 causes the for loop look like hanging for several tens of seconds, whereas the vectorized form still performs in a blink of an eye.

For m=n=10000 the for loop hangs for more than a minute while the vectorized requires 2.54 sec. Of course, these measures should be taken lightly and will depend on the hardware and software configuration, possibly avoiding overloading your laptop with a few dozens of open tabs in your internet browser, and several applications running in the background; but these measures are illustrative of the differences.

In fairness, the increase of m and n severely affects also the matrix generation as you can easily see by placing another system.time() call around the replication function.

You are invited to play around with m and n to see how the execution time changes, by plotting the execution time as a function of the product m x n. This is the relevant indicator, because it expresses the dimension of the matrices created. It thus also quantifies the number of iterations necessary to complete the task via the nset-ed for loop.

So this is an example of vectorization. But there are many others. In R News, a newsletter for the R project at page 46, there are very efficient functions for calculating sums and means for certain dimensions in arrays or matrices, like: rowSums(), colSums(), rowMeans(), and colMeans().

Furthermore, the newsletter also mentions that “…the functions in the ‘apply’ family, named [s,l,m,t]apply, are provided to apply another function to the elements/dimensions of objects. These ‘apply’ functions provide a compact syntax for sometimes rather complex tasks that is more readable and faster than poorly written loops.”.

The apply family: just hidden loops?

The very rich and powerful family of apply functions is made of intrinsically vectorized functions. If at first sight these do not appear to contain any loop, this feature does become manifest when you carefully look under the hood.

The apply command or rather family of commands, pertains to the R base package. It is populated with a number of functions (the [s,l,m,r, t,v]apply) to manipulate slices of data in the form of matrices or arrays in a repetitive way, allowing to cross or traverse the data and avoiding explicit use of loop constructs. The functions act on an input matrix or array and apply a chosen named function with one or several optional arguments .

Note that that’s why they pertain to the so-called ‘functionals’ (as in Hadley Wickham’s advanced R page.

The called function could be an aggregating function, like a simple mean, or another transforming or sub-setting function.

It should be noted that the application of these functions does not necessarily lead to faster execution: the differences are not huge; it rather avoids the coding of cumbersome loops, reducing the chance of errors.

The functions within the family are: apply(), sapply(), lapply(), mapply(), rapply(), tapply(), vapply().

But when and how should you use these?

Well, it is worth noting that a package like plyr covers the functionality of the family; although remembering them all without resorting to the official documentation might feel difficult, still these functions form the basic of more complex and useful combinations.

The first three are the most frequently used:

  1. apply()

You want to apply a given function to the rows (index “1”) or columns (index “2”) of a matrix. See this page for more information.

  1. lapply()

You want to apply a given function to every element of a list and obtain a list as a result (which explains the “l” in the function name). Read up on this function here.

  1. sapply()

You want to apply a given function to every element of a list but you wish to obtain a vector rather than a list. Do you need to know more? Visit this page.

Related functions are sweep(), by() and aggregate() and are occasionally used in conjunction with the elements of the apply() family.

I limit the discussion of this blog post to the apply() function (a more extended discussion on this topic might be the focus of a future post).

Given a matrix M, the call apply(M,1,fun) or apply(M, 2,fun) will apply the specified function fun to the rows of M, if 1 is specified; or to the columns of M, when 2, is specified. This numeric argument is called “margin” and it is limited to the values 1 and 2 because the function operates on a matrix. However, you could have an array with up to 8 dimensions instead.

The function may be any valid R function, but it could be a User Defined Function (UDF), even coded inside the apply(), which is handy.

apply(): an example

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIGRlZmluZSBtYXRyaXggYG15bWF0YCBieSByZXBsaWNhdGluZyB0aGUgc2VxdWVuY2UgYDE6NWAgZm9yIGA0YCB0aW1lcyBhbmQgdHJhbnNmb3JtaW5nIGludG8gYSBtYXRyaXhcbm15bWF0PC1tYXRyaXgocmVwKHNlcSg1KSwgNCksIG5jb2wgPSA1KVxuXG4jIGBteW1hdGAgc3VtIG9uIHJvd3NcbmFwcGx5KG15bWF0LCAxLCBzdW0pXG5cbiMgYG15bWF0YCBzdW0gb24gY29sdW1uc1xuYXBwbHkobXltYXQsIDIsIHN1bSlcblxuIyBXaXRoIHVzZXIgZGVmaW5lZCBmdW5jdGlvbiB3aXRoaW4gdGhlIGFwcGx5IHRoYXQgYWRkcyBhbnkgbnVtYmVyIGB5YCB0byB0aGUgc3VtIG9mIHRoZSByb3cgXG4jIGB5YCBpcyBzZXQgYXQgYDQuNWAgXG5hcHBseShteW1hdCwgMSwgZnVuY3Rpb24oeCwgeSkgc3VtKHgpICsgeSwgeT00LjUpXG5cbiMgT3IgcHJvZHVjZSBhIHN1bW1hcnkgY29sdW1uIHdpc2UgZm9yIGVhY2ggY29sdW1uXG5hcHBseShteW1hdCwgMiwgZnVuY3Rpb24oeCwgeSkgc3VtbWFyeShteW1hdCkpIn0=


You often use data frames: in this case, you must ensure that the data have the same type or else, forced data type conversions may occur, which probably is not what you want. For example, in a mixed text and number data frame, numeric data will be converted to strings or characters.

Final Considerations To The Use and Alternatives To Loops in R

So now, this journey brought us from the fundamental loop constructs used in programming to the (basic) notion of vectorization and to an example of the use of one of the apply() family of functions, which come up frequently in R.

In terms of code flow control, you dealt only with loops: for, while, repeat and the way to interrupt and exit these.

As the last subsections hint that loops in R should be avoided, you may ask why on earth should you learn about them.

Now, in my opinion, you should learn these programming structures because:

  1. It is likely that R will not be your only language in data science or elsewhere, and grasping general constructs like loops is a useful thing to put in your own skills bag. The loop syntax may vary depending on the language, but once you master those in one, you’ll readily apply them to any other language you come across.
  2. The R universe is huge and it is very difficult, if not impossible, to be wary of all R existing functions. There are many ways to do things, some more efficient or elegant than others and your learning curve will be incremental; When you use them, you will start asking yourself about more efficient ways to do things and you will eventually land on functions that you have never heard of before. By the time you read this, a few new libraries will be developed, others released, and so the chance of knowing them all is slim, unless you spend your entire time as an R specialist.
  3. Finally, at least when not dealing with particularly highly dimensional datasets, a loop solution would be easy to code and read. And perhaps, as a data scientist, you may be asked to prototype a one-off job that just works. Perhaps, you are not interested in sophisticated or elegant ways of getting a result. Perhaps, in the data analysis workflow, you just need to show domain expertise and concentrate on the content. Somebody else, most likely a back-end specialist, will take care of the production phase (and perhaps he or she might be coding it in a different language!)

If for loops in R prove no challenge to you anymore after reading this tutorial, you might consider taking our Intermediate R - Practice course. This course will strengthen your knowledge of the topics in Intermediate R with a bunch of new and fun exercises. If, however, loops hold no secrets for you any longer, our Writing Functions in R course, taught by Hadley and Charlotte Wickham could interest you.