Course
Utilities in R Tutorial
In this tutorial, you are going to learn about several functions and utilities that are easy and often used in the R programming language. First, we'll discuss some mathematical related functions. Then you will look at functions that relate more closely to R's data structure like manipulating nested-list, regular expressions, time, and dates.
Mathematical Functions
Let's look at the following cells of code that uses several mathematical functions.
You will first define two vectors, namely x
and y
that comprises of both positive and negative values.
x <- c(1.1, -2.3, -4.5)
y <- c(2.4,-44, -2.2)
print(x)
[1] 1.1 -2.3 -4.5
print(y)
[1] 2.4 -44.0 -2.2
Let's take these two vectors x
and y
:
- Take the absolute values of them,
- Round them up to zero decimal places,
- Sum them up each and
- Finally, take the average of both.
mean(c(sum(round(abs(x))),sum(round(abs(y)))))
27.5
How about you break the above line of code into small pieces and take a microscopic look into each function.
- The
abs()
function simply considers the positive or an absolute value of the elements of the vectorx
andy
.
For example, on applying the abs()
function of x
and y
, you would expect all positive values, as shown below.
abs(x)
- 1.1
- 2.3
- 4.5
abs(y)
- 2.4
- 44
- 2.2
- The next function in the list is the
round()
function. That does nothing but rounds the input. It takes an extra argument, where you can specify how many decimal places you would like the input to be rounded to. In a general setting, theround()
function rounds the input to zero decimal places.
For example, after you apply the round()
function to the vectors x
and y
, the output would look the one shown below:
round(x)
- 1
- -2
- -4
round(y)
- 2
- -44
- -2
- While the
sum()
function will simply compute the sum of the elements of the vector or matrix. For example, if you pass a row-vector as an argument to the sum function, the sum of all the row-vector elements will be returned as a scalar.
In this case, vector x
and y
are passed to the sum()
function. Hence, R simply calculates the sum of the vector elements and returns a scalar as an output.
sum(abs(round(x)))
7
sum(abs(round(y)))
48
- Finally, the
mean()
or theaverage()
function will calculate the arithmetic mean. It will take the average of a set of numerical values, add them together, and divide them by the number of terms in the set.
In this case, the input to the mean()
function is a row vector of length two containing numbers 7
and 48
. So, the mean of these two values would be the sum of these values divided by the number of elements, i.e., 2.
mean(c(7,48))
27.5
This was still pretty easy. Isn't it?
Let's now move onto the next and most interesting segment of this tutorial!
Functions for Data Structures
Now let's look at some of the data structures like list and vectors. Different ways in which you can operate on list
data structure, reversing a list
, and how you can convert a list to a vector and vice-versa.
The below function is a list()
, or you can say it is a list of lists
, also known as a nested-list. It creates a list of elements that can comprise of logical values, numerical values, and strings.
In the below example, there are three lists inside a list, namely log
, ch
, and int_vec
. Each of them has a data type or an R object of logical, numeric, and string/character.
list_define <- list(log = TRUE, ch = "hello_datacamp", int_vec = sort(rep(seq(8,2, by = -2), times = 2)))
list_define
- $log
- TRUE
- $ch
- 'hello_datacamp'
- $int_vec
-
- 2
- 2
- 4
- 4
- 6
- 6
- 8
- 8
log
is simply a logical operator, TRUE or FALSEch
is a character stringhello_datacamp
int_vec
is a sequence of numerical values.
Since the log
and ch
are pretty straightforward, let's have a closer look at int_vec
.
int_vec = sort(rep(seq(8,2, by = -2), times = 2))
int_vec
- 2
- 2
- 4
- 4
- 6
- 6
- 8
- 8
Let's understand the above expression step-by-step.
The seq
function that produces a sequence of numbers in descending order ranging from 8 to 2.
Syntax of seq()
function is given as: seq(x1,x2, by = y)
The first two arguments x1
and x2
tells R the range of the sequence, i.e., where to start and end the sequence. The by
argument specifies the amount of increment or decrement of the sequence at each interval.
For example, the below line of code will generate a sequence starting from 100 to 200 with an increment step of 20.
seq(100,200, by = 20)
- 100
- 120
- 140
- 160
- 180
- 200
In our example, the sequence function will output a sequence from 8 till 2 (inclusive) with a decrement step of 2, which returns a vector of length 4.
a = seq(8,2, by = -2)
a
- 8
- 6
- 4
- 2
Let's now understand the rep
function.
It can repeat the input argument, which is usually a vector or a list using the times
function, which takes an integer as an argument and repeats that many times through the input or the sequence. The rep
function takes in two arguments: the input and the number of times you want the input to repeat or replicated.
Applying the rep
function on our example, which is a vector of length 4 yields an output vector of length 8.
b = rep(a, times = 2)
If you want each element of the vector or list to be repeated instead of the complete vector, then there is an alternative to times
, i.e., by using the each
argument.
The apparent difference by using times
and each
is that pattern in which each element occurs is not the same.
rep(a, each = 2)
- 8
- 8
- 6
- 6
- 4
- 4
- 2
- 2
Last but not least, the sort()
function. It is a self-explanatory and a generic function used for sorting many data structures like a vector or list. It is not limited to only numerical values but can also be used on logical values, and characters. By default, it sorts the elements in ascending order.
Let's put the output of the rep
function to the sort
function to arrive at the final output.
sort(b)
- 2
- 2
- 4
- 4
- 6
- 6
- 8
- 8
Great! So you were successful in solving the lengthy-expression int_vec
, which was inside the list list_define
in such an easy manner.
Further, let's find out the contents of the list list_define
for which you will make use of the str()
function. The str()
function in R allows you to display the structure of R objects.
str(list_define)
List of 3
$ log : logi TRUE
$ ch : chr "hello_datacamp"
$ int_vec: num [1:8] 2 2 4 4 6 6 8 8
Let's look at some of the cool R expressions:
is
function can be used to check the type of your data structure, which returns a logical and can come handy when dealing with conditional statements.
is.list(list_define) #returns true if the argument passed is a list.
TRUE
Whereas, it returns FALSE if a vector, which is not a list, is passed as shown in the cell below.
is.list(c(1,2,3)) #returns false since you passed a vector.
FALSE
- Converting a vector to a list is so simple in R. All you need to do is use the
as
function followed by.list()
, and pass the vector as an argument. That's all it takes to convert a vector to a list.
vec_to_list <- as.list(c(1,2,3))
is.list(vec_to_list) #verify it with is.list()
TRUE
- On the other hand, a list can be unrolled into a vector by using the
unlist
function. R does this conversion by simply flattening the entire list structure and finally outputting a single vector.
list_to_vec <- unlist(vec_to_list)
list_to_vec
- 1
- 2
- 3
Similar to is.list()
, you can use is.vector()
to find out whether the given argument is a vector or not.
is.vector(list_to_vec)
TRUE
Let's convert the big list list_define
to a vector. Point to note here is that a vector can only contain a single data type or R object. Hence, both the logical
as well as numerical
values will be converted to strings.
unlist(list_define)
- log
- 'TRUE'
- ch
- 'hello_datacamp'
- int_vec1
- '2'
- int_vec2
- '2'
- int_vec3
- '4'
- int_vec4
- '4'
- int_vec5
- '6'
- int_vec6
- '6'
- int_vec7
- '8'
- int_vec8
- '8'
Before moving onto the next topic, let's look at the append()
and rev()
function.
- As the name suggests,
append()
function allows you to append or add two or more vector or a list to an existing or a new vector or list.
Let's try out the append()
function on the list_define
list. You would notice that the list will now consist of 6 elements instead of 3 since you appended the same list with itself.
str(append(list_define, list_define))
List of 6
$ log : logi TRUE
$ ch : chr "hello_datacamp"
$ int_vec: num [1:8] 2 2 4 4 6 6 8 8
$ log : logi TRUE
$ ch : chr "hello_datacamp"
$ int_vec: num [1:8] 2 2 4 4 6 6 8 8
- Finally, let's reverse or change the order of the
list_define
list with the help of therev()
function in R.
str(rev(list_define))
List of 3
$ int_vec: num [1:8] 2 2 4 4 6 6 8 8
$ ch : chr "hello_datacamp"
$ log : logi TRUE
Regular Expressions
A lot of people find regular expression a complex topic to learn. However, it is an essential topic not only in R but across various programming languages like Python. Many programming languages, including R, provide in-built regular expression capability.
A regular expression can be used in so many applications, and it comes in handy, especially when you want to preprocess text data. It is used in various Natural Language Processing (NLP) problems.
It is also used in query search engines and text editors.
Regular expressions, also known as regex
or regexp
or rational operators
, are a sequence of characters that define a search pattern. Generally, these search patterns are used by various string searching algorithms for finding a pattern or finding and replacing the pattern or filtering a matched pattern.
Let's start this topic by understanding the use of grep()
and grepl()
function.
For simplicity, let's define a row-vector animals_regex
of length 5 on which you will learn to apply regex patterns.
animals_regex <- c('cat','dog','cheetah','lion','mice')
First, let's understand the grepl()
function. The grepl()
function returns a logical output meaning that if the string matches the pattern, then it returns TRUE else FALSE.
Below is a straightforward and intuitive syntax of the grepl()
function where the first argument is the pattern you want to match while the second argument is the string or the input from which you want to find or filter the pattern. You can ignore the remaining arguments for now. grepl(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE) Let's find out which out of the above five animals have a c
in them with the help of the grepl()
function.
In this case, you are looking for the animals which have c
in them, which directly means that the pattern here is nothing but c
itself.
grepl(pattern = 'c', x = animals_regex)
- TRUE
- FALSE
- TRUE
- FALSE
- TRUE
From the above output, you can observe that, since there is a c
in cat, cheetah, and mice, so a TRUE is returned for those indices in the vector animals_regex
, while FALSE was returned for the ones which did not match with the pattern c
.
Let's find out the elements that start with c
and not just have a character c
in their name.
To achieve this, all you need to do is use a $^$ (caret) sign at the beginning of the pattern you would like to find.
grepl(pattern = '^c', x = animals_regex) #only cat and cheetah start with `c`.
- TRUE
- FALSE
- TRUE
- FALSE
- FALSE
So from the above output, you can see that since only cat and cheetah start with a c
, hence, only those positions are returned as TRUE.
Similar to the $^$ sign, the $
sign can be used at the end of the pattern you would like to find to match the elements that end with the specified pattern. To find out an animal that ends with an n
, you can simply use n
, followed by the $
sign.
grepl(pattern = 'n$', x = animals_regex) #only lion ends with an `n`.
- FALSE
- FALSE
- FALSE
- TRUE
- FALSE
Note: To learn more about regular expressions, simply type ?regex
in jupyter notebook code cell and documentation on regex will pop-up. Also, if you want to learn more about regular expressions, you could go through this source, which provides you a tool to design your search patterns and then allows you to test it on your input strings.
- Similar to the
grepl()
function, there is agrep()
function, which instead of the logical output, returns the index of the vector/matrix that matches the given pattern.
The syntax of grep()
function is exactly same as the grepl()
function and is given as: grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE) Let's take the same example as above but this time apply the grep()
function on it!
grep(pattern = 'c', x = animals_regex)
- 1
- 3
- 5
As you would expect, the above output returns the index of the elements cat, cheetah, and mice, and not TRUE/FALSE. And that's pretty much about it!
Let's use the which
function to compare the grep()
and grepl()
function. The which()
function simply returns the indices of the vector for the TRUE
indices of a logical object.
Now if you connect the dots, you would have understood that since the grepl()
function has the capability to return a logical object, it will be simply passed to the which()
function which will then convert the output similar to what you would expect from a grep()
function.
which(grepl(pattern = 'c', x = animals_regex))
- 1
- 3
- 5
Similar to the grepl()
function, the grep()
function also knows how to handle different types of regular expression patterns.
If you apply the grep()
function to find out the elements in animals_regex vector that end with n
, you would expect an output of 4 since only lion
ends with n
as shown below.
grep(pattern = 'n$', x = animals_regex)
4
Well done!
You have learned some basics of regular expressions like how you can filter out the elements from a vector that matches the given pattern. However, R is not limited to just pattern matching. It has a handful of functions, and out of which sub()
function is one of them.
The sub()
function, instead of filtering the matched pattern, replaces the matches with other strings. Let's understand it more deeply!
- The
sub()
function primarily takes three arguments as an input that are:- pattern which you would like to match or the regular expression,
- replacement value which will be placed at the matched element of the vector and,
- x the input vector string on which you will apply the regex.
The syntax is given as: sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,fixed = FALSE, useBytes = FALSE) This time also let's take the same example as above to understand its functionality!
sub(pattern = 'c', replacement = 'a', x = animals_regex)
- 'aat'
- 'dog'
- 'aheetah'
- 'lion'
- 'miae'
From the above output, you can observe that the cat
string gets replaced with aat
, the cheetah
string gets replaced to aheetah
, and the mice
string gets converted to miae
. For these elements, the pattern was successfully matched.
Also, note that the sub()
looks for only the first match in the string, which means that if there are two c
in a string, only the first occurrence of c
will be replaced with a
while the second one will remain unchanged.
If you still want to replace every single match of a pattern in a vector string better, use the gsub()
function, which is out of the scope of this tutorial!
Before moving on to the next topic, let's try one more interesting expression.
This time you will make use of the |
(or) operator which will try to match any of the defined patterns and if it matches, replace it with -
. Remember, since you use gsub()
function, it will replace every single match of a pattern in a string.
gsub(pattern = 'c|d|l', replacement = '-', x = animals_regex) #animals with `c`,`d`,`l` gets replaced with `-`.
- '-at'
- '-og'
- '-heetah'
- '-ion'
- 'mi-e'
Let's move onto the final topic of today's tutorial, i.e., Time and Dates!
Time and Dates
Time and date information can be quite useful in various scenarios. For example, let's say you are working on a Computer Vision related problem and you would like to find out the FPS (frames per second) at which your algorithm is running. In such a use-case, you could use the Time object to find out the processing speed of your computer vision algorithm. For other specific problems like time-series forecasting and seasonality studies, R's potential can be used to the full extent.
For starters, let's quickly print today's date using R with a simple command Sys.Date()
. Here Sys refers to the system, which means it returns systems approximation of date.
Sys.Date()
Simple, isn't it?
R's time and dates belong to Date
object, or you can say that the data type is Date
. It can be verified using the class
function that you learned in Data Types in R tutorial.
Similar to the Date function, you have a time()
function which returns the systems current time, in fact, it returns both the time and date as an output.
Sys.time()
[1] "2020-02-04 02:05:04 IST"
Creating Date Objects
You learned how to get the current date and time. Let's now find out how you can create dates for other days by passing a mere string as an argument.
To create a date object for 10th May 1993, you will use the following syntax:
date_may <- as.Date('1993-05-10') #converts character string to a date object
date_may
class(date_may)
'Date'
One important point to note here is that the R's Date
function by default expects you to enter the date in YYYY-MM-DD
format if you try to interchange the year with the month or day it would result in an error. Let's try it out!
date_may <- as.Date('05-1993-10') #R follows the ISO date format by default
Error in charToDate(x): character string is not in a standard unambiguous format
Traceback:
1. as.Date("05-1993-10")
2. as.Date.character("05-1993-10")
3. charToDate(x)
4. stop("character string is not in a standard unambiguous format")
But the good thing is, you could change the format explicitly by passing an argument format
and customize it accordingly.
date_may <- as.Date('05-1993-10', format = '%m-%Y-%d')
The as.Date()
function will accept different date formats, but at the end, it will convert it back to the ISO date format, you can see that by printing the date_may
variable.
date_may
date_may <- as.Date('05-10-1993', format = '%m-%d-%Y')
date_may
Date Arithmetic
Wouldn't it be awesome if you could apply mathematical operations like addition and subtraction to the Date objects in R?
Let's add 1 to the date_may
variable, and you would observe that it will show you one day later date.
date_may + 1
Great, so as you can see from the above output, adding one changed the date to 11th May 1993 from 10th May 1993. Similarly, you could subtract one from the date.
Let's say you want to find out the time difference between you and your elder sibling's date of birth.
elder_sib <- as.Date('1989-03-21')
date_may - elder_sib
Time difference of 1511 days
Conclusion
Congratulations on finishing the tutorial.
This tutorial was a good starting point for beginners who are eager to learn about various utility functions in R. As a good exercise, you might want to learn more about Regular Expressions as they are used in a variety of applications and indeed are a very powerful tool when it comes to cleaning or preprocessing the text data.
Please feel free to ask any questions related to this tutorial in the comments section below.
If you would like to learn more about R, take DataCamp's Intermediate R course and check out our Using Functions in R tutorial.
Learn more about R
Course
Joining Data with data.table in R
Course
Introduction to R
cheat-sheet
Dates and Times in R Cheat Sheet
tutorial
Using Functions in R Tutorial
tutorial
Tutorial on the R Apply Family
tutorial
R Formula Tutorial
tutorial
A Guide to R Regular Expressions
tutorial
Strings in R Tutorial
Olivia Smith
8 min