Start Learning for Free

Join over 500,000 other Data Science learners and start one of our interactive tutorials today!

Topic r small

A Tutorial on Using Functions in R!

August 20th, 2015 in R Programming

In a previous post, we covered part of the R language control flow, the cycles or loop structures. In a subsequent one, we showed how to avoid 'looping' by means of functions, that act on compound data in repetitive ways (the apply family of functions). Here, we introduce the notion of function from the R programmer point of view and illustrate the range of action that functions have within the R code ('scope'). The post will cover:

(To practice, try Datacamp's Writing Functions in R course.)

What is a function?

In programming, we use functions to incorporate sets of instructions that we want to use repeatedly or that, because of their complexity, are better self-contained in a sub program and called when needed. A function is a piece of code written to carry out a specified task; it may accept arguments or parameters (or not) and it may return one or more values (or not!). Now then how generic is that! In fact, there are several possible formal definitions of 'function' spanning from mathematics to computer science. Generically, its arguments constitute the input and their return values their output. Here we'll use a simple definition dropping the math restriction that the property that each input is related to exactly one output . In fact, we will see that there are functions that operate on some (e.g. not all) of the input values, perhaps giving multiple results, depending on how they are internally constructed.

Functions in R

There exist a number of terms to define and express functions, subroutines, procedures, method etc , but for the purposes of this post, we will ignore this distinction, which is often semantic and reminiscent of other older programming languages. We'll denote each of those constructs generically as 'functions', especially because in R we just have…functions! (For the horrified reader, here's a link: semantics.) In R, according to the base docs, you define a function with the construct:

function ( arglist )  {body}

where the code in between the curly braces is the body of the function. Note that by using build-in functions, the only thing you need to worry about is how to effectively communicate the correct input arguments (arglist) and manage the return value/s (if any).

What are the most popular functions in R?

Now, given the enormous number of functions and libraries in R, how do we orient ourselves to decide which are the ones to learn and master? And because many functions appear in distinct packages (libraries), shouldn't we also know which libraries to use? Resorting to data science, we see somebody has already considered this by ranking functions based on download tracking although not very up to date; or by listing functions per category; or further creating and enlisting cheat sheets; and, finally (and let's stop here!) by devising Google like algorithms for package dependencies to find out, based on the dependencies among packages, which are the most important ones. So up to this point, we've only learned that the are a lot of R functions organized in a multitude of packages and the hardest job is to correctly determine which parameters to pass (the arguments or args), and how to handle their return values. So, the best way to learn more about the inner workings of functions, is to write our own ones.

User Defined Functions (UDF)

Whether we need to accomplish a particular task and are not aware that a dedicated function or library exists already; or because by the time we spend googling for some existing solution we may have already come out with our own (if not too complicated), we will find ourselves at some time typing something like:

function.name <- function(arguments)
{
  computations on the arguments
  some other code
}

So, a function has a name (with exceptions, see anonymous functions), some arguments used as input to the function, within the () following the keyword 'function'; a body, which is the code within the curly braces {}, where we carry out the computation; and may have one or more return values (the output). We define the function similarly to variables, by “assigning” the directive function(arguments) to the variable function.name, followed by the rest. As a little note, a decent way to ensure that the name we choose for the function is not an R reserved word (for example the name of an existing function!) is to use the help system. If by entering {r eval=FALSE} ? OurFunctionName, we get some information, then it is better not to use that name [Still, it is possible although generally not recommended, to use homonimy, provided we know how to hide these one from another]. Once the definition of the function is done, somewhere else in the code, we call the function, aka we use it. The following code defines a function that computes the square of the argument and then calls it after assigning a value for its argument

# define a simple function
myFirstFun<-function(n)
{
  n*n  #  compute the square of integer n
}
# define a value
k<-10
# call the function with that value
m<-myFirstFun(k)

Points to note

A few comments are necessary to illustrate its working:

  • We first define the function as a variable, myFirstFun, using the keyword function, which also receives n as argument (no type specification). The latter will exist within the function. We used an integer, but 'n' could also be a vector or a matrix or a string: R handles all this nicely for us
  • In our snippet, when we call the function, we assign it to a variable m. This is not necessary per se, because R will always print the last evaluation done, but we do this for clarity and perhaps because we may want to re-use the result later [if we don't, R will have forgotten this evaluation by the time the next command is run]
  • When we call the function, we may use an arbitrary variable, here k, to which we assign an integer value. We do this to illustrate that the variable does not need to have the same name (and the same type, we'll see this later) because it is a different object, so
  • We could have used the same name, n; however note that this n is not the same we used within the function body. In fact, if we do:
  •  
n<-12
m<-myFirstFun(n)

a print of the three variables:

print(k);
print (m);
print(n);

shows that k and n remain at their initially defined value. Actually, hadn't we defined the variable n before the last call, R would have thrown an error to us, like this (before trying, remember to clean the workspace: if you work in RStudio, click the brush in the environment window, or uncomment the first line in the following snippet, or else R will remember the previous values):

# rm(list=ls())  # clean the workspace, uncomment to show the error
myFirstFun<-function(n)
{
  n*n  #  compute the square of integer n
}
# call the function with argument n
u<-myFirstFun(n)

[alternatively, to remove specific elements from the workspace, you may use the function rm(x,y,z…) to remove the objects x,y,z from the environment. RThese may be variables, datasets and also functions] So you get “Error in myFirstFun(n) : object 'n' not found” And for that matter, any other previously undefined variable would cause the same error.

R performs a lazy evaluation (thus not type control) that is, it checks only when needed at execution. So if we defined the argument as a character, k='a' we would get the error: Error in n * n : non-numeric argument to binary operator

It also means that would have we defined a second argument without passing a value for it, R would complain only when necessary: e.g. where a reference to it is made, without a value being provided (more on this in the section about arguments).

So we have seen a first example of scoping, that is, visibility of variables.Functions in R - Scoping

As shown in the figure above, a very important feature of functions is that the variables used within are local, e.g. their scope lies within - and is limited to - the function itself and are therefore invisible outside the function body. Clearly, functions need a way to communicate to the external world, typically the piece of code that calls them, by means of one or more arguments (the 'input') and one or more values that the function returns to the caller (the 'output'). In our example, the function return value is contained in the variable m. Note that because all the objects within the function are local they will not show up in your workspace. To make them accessible externally to the function body, you need to use return (see other examples below). Thus environments in R are nested. They are organized as a tree structure which reflects the way R operates when it encounters a symbol: it starts bottom up : when a symbol is not found in the current function environment, it looks up the next level up to the global environment. Eventually, if the symbol is not found, R will give an error. This is the case when trying to intercept a variable defined within a function, for example when debugging; if a symbol with the same name exist in the script environment it is displayed however, it is NOT the variable within the function: this remains invisible to the RStudio environment. So in order to inspect a variable within a function, a print statement may help (see the section on arguments for an example).

How can you see your R function in RStudio?

When developing your function and you can see it in the RStudio environment. An easy way to visualize its code is to type its name without the parentheses (). When you exit Rstudio without closing the function script file, and you saved your environment upon exit, you'll find it again in your workspace among the script files that may have been there once you exited. However, during the development of a slightly larger project, it is very likely that you wrote your function as an R script and saved it somewhere.

Calling R functions defined in other scripts

Perhaps you planned a library of utility functions and wish to call one or more of these from another script you are developing. How does this work? First, note the simple way a function is loaded and executed in R. This might not be visible in the Rstudio console, but it is in any R console. If the function code snippet myFirstFun seen above was saved into an R script file, say myIndepFun.R you can load the function with the command source():

source("myIndepFun.R")

And this command also works from a script. However, you may want to find a specific function (say our myFirstFun) within a script file, called (say) MyUtils.R, containing other utility functions. In this case, the 'source' command will load the function once you've found it (and explicitly asked to find a function) with the call to the function exists():

if(exists("myFirstFun", mode = "function"))
    source("MyUtils.R")

If you are unsuccessful (perhaps you misspell or forgot how you called your file, you can use sapply (see the apply family of functions) to retrieve a list of filenames with extension .R, with their full name, from your directory, say “/R/MyFiles”, and of course load them:

sapply(list.files(pattern="[.]R$", path="R/MyFiles/", full.names=TRUE), source);

More examples: return and nested function calls in R

The return statement is not required in a function, however it is advisable to use it when the function performs several computations or when we want the value (not the object that contains it!) to be accessible outside of the function body, which as seen, is not the default behavior. Note that as the name says, it has the effect of ending the function execution and return control to the code which called it. Now consider the arguments: these can be of any type, can have default values inside the function (so the latter provides an output even when explicit values are not passed to it). Finally, we can call another function within a function. Let's see these points in detail through the following examples. First we define a vector v that we will use in the following:

# define a numeric vector  v1 of 4 elements
v<-c(1, 3, 0.2, 1.5, 1.7)
# define a matrix M
M<-cbind( c(0.2, 0.9, 1), c(1.0, 5.1, 1), c(6, 0.2, 1), c(2.0, 9, 1))

Then we show an example of a function calling the first function we made above. Note we may pass one argument only in the call, even if the function was defined with two arguments. This time we also use return()

### passing only 1 argument, nested call and return
mySecFun<-function(v,M)
{
  # compute the square of each element of v into u
  u=c(0,0,0,0)
  for(i in 1:length(v))
    {
      u[i]=myFirstFun(v[i]);
    }
  return(u)
}

Sqv=mySecFun(v)
Sqv

If we forget the latter, like this:

### passing only 1 argument, nested call and no return: output unaccessible
mySecFun<-function(v,M)
{
  # compute the square of each element of v into u
  u=c(0,0,0,0)
  for(i in 1:length(v))
  {
    u[i]=myFirstFun(v[i]) # call our first function
  }
}

Sqv=mySecFun(v)
Sqv

We will never be able to access the output. In fact, as shown by the last command, the output is NULL, simply because even if the internal function return values fill the vector u, the latter remain confined within the second function because it does not return any value!

Arguments and their default

We have seen that function arguments are specified within the (). Let's see a sequence of examples to compute some power of a value n passed as an argument, with few variations on arguments management:

# we define the function and specify the exponent, second argument directly
MyFourthFun <- function(n, y = 2) # sets default of exponent to 2 (we just square)
{
  n^y  #  compute the power of n to the y
}
MyFourthFun(2,3) # specify both args
MyFourthFun(2)   # or just first'
# MyFourthFun()    # or none: error!

In this case, we see that if we specify both arguments, the function just computes 23=8 When we pass only the first, our n, the function uses the default y=2, to carry out the computation. If we omit the arguments, R throws an error (to see this uncomment the line) Here we specify the second argument, our exponent as a list of values, so to compute the powers of the given n with exponent less or equal to 1

#  with variable exponent from 0.05 to 1 in steps of 0.01
MyFourthFun <- function(n, y = seq(0.05, 1, by = 0.01))
{
  n^y  #  compute the power of n to the y
}
MyFourthFun(2,3) # as before
MyFourthFun(2)   # computes ALL possible according to given default
# MyFourthFun()    # or none: error!

Here, specifying just n (2 in the snippet) causes the function to compute ALL the powers according to the list of exponents specified. The following is equivalent: here we did not default the values as above, but check its existence with an if test on the argument via the function missing():

# equivalent alternative:
MyFourthFun <- function(n, y)
 { if(missing(y))
   {
    y <- seq(0.05, 1, by = 0.01)
 }
 return(n^y)
}
MyFourthFun(2,3)
MyFourthFun(2)   # computes ALL possible according to given default
# MyFourthFun()    # or none: error!

Ok, but we can do better and use the default list as a checker for the user input, that is to validate the input:

# wish to check y value is within the list: if yes, perform the power, else do the default or throw an error
MyFourthFun <- function(n, y)
{
  # print(n); uncomment to chek passed values
  # print(y);  "            "     "
  if(missing(n)) n=2;
  if(missing(y)) y=0.05;
  if(!y %in% seq(0.05, 1, by = 0.02)) print("value must be <= 1)")
  else return(n^y)
}
MyFourthFun(2,0.07); # will carry out the calculation
MyFourthFun(2,3);   # will print our error, because y is not in the allowed list
MyFourthFun(2);    # passes n and not y, uses y default
MyFourthFun();     # none: both n and y defaults are used

We just added to prints to check the input values passed to it (because we won't be able to see these from our workspace, so you uncomment these to do your checks). The first call does what expected, the second does not and complains that the exponent is not in the list; the third will use a default exponent, and the fourth will use both defaults. there are many possible variations on this theme, but you have got the spirit of this!

Anonymous functions in R

When you don't give a name to a function, you are creating an anonymous function. How is this possible? This is because in R a function (any object in fact) is evaluated without the need to assign it or its result to any named variable (we already noted this above, see the second note after the first example) and may in fact apply to any standard R function. The syntax is slightly different form the ordinary UDF seen above because now you have a different parentheses approach: First, you employ () as usual, to denote a call to a function, immediately after the keyword 'function': this may specify the argument, in the example 'x'; Secondly, a ( ) couple encircles the function(x) declaration and body Third, after the previous construct you specify the argument passed in the call It works like this:

# Anonymous function syntax
(function(x) x * 10)(10)

It's normal equivalent look like this:

# equivalent (normal) way
fun<-function(x) x * 10
fun(10)

Why or when would you use an anonymous function? As the syntax above indicates, you are doing everything in one shot: the declaration and the call in one line statement. So, despite not transparent when reading it, it is self-contained and you use it because you don't want to define yet another function somewhere else in your current script (or in an external script): you are dealing with a simple calculation when the need arises and you probably will not use it anywhere else in your code, thus not worth remembering it.

Functions and functional programming in R

How could we end this post without mentioning the important facts that R is a functional programming language? Yes, you read it right, though people usually associate the 'functional' attribute to trendy languages like Scala. Here is a link to authoritative Hadley Wickham's post on R Hadley Wickham's post on R and in his words You can do anything with functions that you can do with vectors: you can assign them to variables, store them in lists, pass them as arguments to other functions, create them inside functions, and even return them as the result of a function. A very interesting bit out if this reading, is the concept of closures. These are functions written by functions and their main use is in the accessibility of the environment. As we have seen above, a potential tricky matter is the visibility or not of the variables when a function terminates its job. A closure is made of a function and its environment (thus the data) and makes it possible to access the caller function environment as well this in more advanced post.

Conclusion

We have seen that functions constitute the most important programming construct in R, which is in fact a functional language. We may develop functions on our own, that we called User Defined Functions (UDF) and the first example introduced us to the notion of functions and variable visibility ('scoping') across environments. In practice, when developing our own functions, here are a few hints on how to avoid scoping problems and maintain clean code:

  • Similarly to sourcing functions from libraries, you can to load and execute a function using the source() function
  • The function environment (variables, other nested functions) is only accessible via the arguments passed to - and the return values obtained
  • Whenever possible, do name functions (“assign” them to a name, which behaves like a variable): this may permit not to use the return statement, although the presence of the latter makes clear where the exit point of the function is located
  • Anonymous functions may be useful, but if you think you will carry out more than a simple calculation, and you plan to use the function again, just make a new named function
  • In the same spirit, if a function is used repeatedly and has a general usage, perhaps it is worth putting it into a dedicated script (R file) together with its similar sister functions.

And perhaps after playing a bit with this, you may decide that it is worth developing your own library of functions!

 

About the author: Carlo Fanara

Carlo Fanara – After a career in IT, Carlo spent 20 years in physics, first gaining an Msc in Nuclear Physics in Turin (Italy), and then a PhD in plasma physics in the United Kingdom. After several years in academia, in 2008 Carlo moved to France to work on R&D projects in private companies and more recently as a freelance. His interests include technological innovation and programming for Data Mining and Data Science.

Comments

jimmymeraz89
Given so much info in it, These type of articles keeps the users interest in the website, I found it here, today I am feeling glad after finding this precious post.Thank u http://essaywriting.com.pk/case-study-writing/
04/05/16 9:08 AM |
mistychipps345
The article interesting but the information is having more brief. I can collect more information from here (http://www.essayscouncil.com)
03/30/16 9:46 AM |