Utilities in R
In this tutorial, you are going to learn about several functions and utilities that are easy and often used in the R programming language. First, we'll discuss some mathematical related functions. Then you will look at functions that relate more closely to R's data structure like manipulating nested-list, regular expressions, time, and dates.
Let's look at the following cells of code that uses several mathematical functions.
You will first define two vectors, namely
y that comprises of both positive and negative values.
x <- c(1.1, -2.3, -4.5) y <- c(2.4,-44, -2.2)
 1.1 -2.3 -4.5
 2.4 -44.0 -2.2
Let's take these two vectors
- Take the absolute values of them,
- Round them up to zero decimal places,
- Sum them up each and
- Finally, take the average of both.
How about you break the above line of code into small pieces and take a microscopic look into each function.
abs()function simply considers the positive or an absolute value of the elements of the vector
For example, on applying the
abs() function of
y, you would expect all positive values, as shown below.
- The next function in the list is the
round()function. That does nothing but rounds the input. It takes an extra argument, where you can specify how many decimal places you would like the input to be rounded to. In a general setting, the
round()function rounds the input to zero decimal places.
For example, after you apply the
round() function to the vectors
y, the output would look the one shown below:
- While the
sum()function will simply compute the sum of the elements of the vector or matrix. For example, if you pass a row-vector as an argument to the sum function, the sum of all the row-vector elements will be returned as a scalar.
In this case, vector
y are passed to the
sum() function. Hence, R simply calculates the sum of the vector elements and returns a scalar as an output.
- Finally, the
average()function will calculate the arithmetic mean. It will take the average of a set of numerical values, add them together, and divide them by the number of terms in the set.
In this case, the input to the
mean() function is a row vector of length two containing numbers
48. So, the mean of these two values would be the sum of these values divided by the number of elements, i.e., 2.
This was still pretty easy. Isn't it?
Let's now move onto the next and most interesting segment of this tutorial!
Functions for Data Structures
Now let's look at some of the data structures like list and vectors. Different ways in which you can operate on
list data structure, reversing a
list, and how you can convert a list to a vector and vice-versa.
The below function is a
list(), or you can say it is a
list of lists, also known as a nested-list. It creates a list of elements that can comprise of logical values, numerical values, and strings.
In the below example, there are three lists inside a list, namely
int_vec. Each of them has a data type or an R object of logical, numeric, and string/character.
list_define <- list(log = TRUE, ch = "hello_datacamp", int_vec = sort(rep(seq(8,2, by = -2), times = 2)))
logis simply a logical operator, TRUE or FALSE
chis a character string
int_vecis a sequence of numerical values.
ch are pretty straightforward, let's have a closer look at
int_vec = sort(rep(seq(8,2, by = -2), times = 2))
Let's understand the above expression step-by-step.
seq function that produces a sequence of numbers in descending order ranging from 8 to 2.
seq() function is given as: seq(x1,x2, by = y)
The first two arguments
x2 tells R the range of the sequence, i.e., where to start and end the sequence. The
by argument specifies the amount of increment or decrement of the sequence at each interval.
For example, the below line of code will generate a sequence starting from 100 to 200 with an increment step of 20.
seq(100,200, by = 20)
In our example, the sequence function will output a sequence from 8 till 2 (inclusive) with a decrement step of 2, which returns a vector of length 4.
a = seq(8,2, by = -2)
Let's now understand the
It can repeat the input argument, which is usually a vector or a list using the
times function, which takes an integer as an argument and repeats that many times through the input or the sequence.
rep function takes in two arguments: the input and the number of times you want the input to repeat or replicated.
rep function on our example, which is a vector of length 4 yields an output vector of length 8.
b = rep(a, times = 2)
If you want each element of the vector or list to be repeated instead of the complete vector, then there is an alternative to
times, i.e., by using the
The apparent difference by using
each is that pattern in which each element occurs is not the same.
rep(a, each = 2)
Last but not least, the
sort() function. It is a self-explanatory and a generic function used for sorting many data structures like a vector or list. It is not limited to only numerical values but can also be used on logical values, and characters. By default, it sorts the elements in ascending order.
Let's put the output of the
rep function to the
sort function to arrive at the final output.
Great! So you were successful in solving the lengthy-expression
int_vec, which was inside the list
list_define in such an easy manner.
Further, let's find out the contents of the list
list_define for which you will make use of the
str() function. The
str() function in R allows you to display the structure of R objects.
List of 3 $ log : logi TRUE $ ch : chr "hello_datacamp" $ int_vec: num [1:8] 2 2 4 4 6 6 8 8
Let's look at some of the cool R expressions:
isfunction can be used to check the type of your data structure, which returns a logical and can come handy when dealing with conditional statements.
is.list(list_define) #returns true if the argument passed is a list.
Whereas, it returns FALSE if a vector, which is not a list, is passed as shown in the cell below.
is.list(c(1,2,3)) #returns false since you passed a vector.
- Converting a vector to a list is so simple in R. All you need to do is use the
asfunction followed by
.list(), and pass the vector as an argument. That's all it takes to convert a vector to a list.
vec_to_list <- as.list(c(1,2,3))
is.list(vec_to_list) #verify it with is.list()
- On the other hand, a list can be unrolled into a vector by using the
unlistfunction. R does this conversion by simply flattening the entire list structure and finally outputting a single vector.
list_to_vec <- unlist(vec_to_list)
is.list(), you can use
is.vector() to find out whether the given argument is a vector or not.
Let's convert the big list
list_define to a vector. Point to note here is that a vector can only contain a single data type or R object. Hence, both the
logical as well as
numerical values will be converted to strings.
Before moving onto the next topic, let's look at the
- As the name suggests,
append()function allows you to append or add two or more vector or a list to an existing or a new vector or list.
Let's try out the
append() function on the
list_define list. You would notice that the list will now consist of 6 elements instead of 3 since you appended the same list with itself.
List of 6 $ log : logi TRUE $ ch : chr "hello_datacamp" $ int_vec: num [1:8] 2 2 4 4 6 6 8 8 $ log : logi TRUE $ ch : chr "hello_datacamp" $ int_vec: num [1:8] 2 2 4 4 6 6 8 8
- Finally, let's reverse or change the order of the
list_definelist with the help of the
rev()function in R.
List of 3 $ int_vec: num [1:8] 2 2 4 4 6 6 8 8 $ ch : chr "hello_datacamp" $ log : logi TRUE
A lot of people find regular expression a complex topic to learn. However, it is an essential topic not only in R but across various programming languages like Python. Many programming languages, including R, provide in-built regular expression capability.
A regular expression can be used in so many applications, and it comes in handy, especially when you want to preprocess text data. It is used in various Natural Language Processing (NLP) problems.
It is also used in query search engines and text editors.
Regular expressions, also known as
rational operators, are a sequence of characters that define a search pattern. Generally, these search patterns are used by various string searching algorithms for finding a pattern or finding and replacing the pattern or filtering a matched pattern.
Let's start this topic by understanding the use of
For simplicity, let's define a row-vector
animals_regex of length 5 on which you will learn to apply regex patterns.
animals_regex <- c('cat','dog','cheetah','lion','mice')
First, let's understand the
grepl() function. The
grepl() function returns a logical output meaning that if the string matches the pattern, then it returns TRUE else FALSE.
Below is a straightforward and intuitive syntax of the
grepl() function where the first argument is the pattern you want to match while the second argument is the string or the input from which you want to find or filter the pattern. You can ignore the remaining arguments for now.
grepl(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
Let's find out which out of the above five animals have a
c in them with the help of the
In this case, you are looking for the animals which have
c in them, which directly means that the pattern here is nothing but
grepl(pattern = 'c', x = animals_regex)
From the above output, you can observe that, since there is a
c in cat, cheetah, and mice, so a TRUE is returned for those indices in the vector
animals_regex, while FALSE was returned for the ones which did not match with the pattern
Let's find out the elements that start with
c and not just have a character
c in their name.
To achieve this, all you need to do is use a $^$ (caret) sign at the beginning of the pattern you would like to find.
grepl(pattern = '^c', x = animals_regex) #only cat and cheetah start with `c`.
So from the above output, you can see that since only cat and cheetah start with a
c, hence, only those positions are returned as TRUE.
Similar to the $^$ sign, the
$ sign can be used at the end of the pattern you would like to find to match the elements that end with the specified pattern. To find out an animal that ends with an
n, you can simply use
n, followed by the
grepl(pattern = 'n$', x = animals_regex) #only lion ends with an `n`.
Note: To learn more about regular expressions, simply type
?regex in jupyter notebook code cell and documentation on regex will pop-up.
Also, if you want to learn more about regular expressions, you could go through this source, which provides you a tool to design your search patterns and then allows you to test it on your input strings.
- Similar to the
grepl()function, there is a
grep()function, which instead of the logical output, returns the index of the vector/matrix that matches the given pattern.
The syntax of
grep() function is exactly same as the
grepl() function and is given as:
grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE)
Let's take the same example as above but this time apply the
grep() function on it!
grep(pattern = 'c', x = animals_regex)
As you would expect, the above output returns the index of the elements cat, cheetah, and mice, and not TRUE/FALSE. And that's pretty much about it!
Let's use the
which function to compare the
grepl() function. The
which() function simply returns the indices of the vector for the
TRUE indices of a logical object.
Now if you connect the dots, you would have understood that since the
grepl() function has the capability to return a logical object, it will be simply passed to the
which() function which will then convert the output similar to what you would expect from a
which(grepl(pattern = 'c', x = animals_regex))
Similar to the
grepl() function, the
grep() function also knows how to handle different types of regular expression patterns.
If you apply the
grep() function to find out the elements in animals_regex vector that end with
n, you would expect an output of 4 since only
lion ends with
n as shown below.
grep(pattern = 'n$', x = animals_regex)
You have learned some basics of regular expressions like how you can filter out the elements from a vector that matches the given pattern. However, R is not limited to just pattern matching. It has a handful of functions, and out of which
sub() function is one of them.
sub() function, instead of filtering the matched pattern, replaces the matches with other strings. Let's understand it more deeply!
sub()function primarily takes three arguments as an input that are:
- pattern which you would like to match or the regular expression,
- replacement value which will be placed at the matched element of the vector and,
- x the input vector string on which you will apply the regex.
The syntax is given as: sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,fixed = FALSE, useBytes = FALSE) This time also let's take the same example as above to understand its functionality!
sub(pattern = 'c', replacement = 'a', x = animals_regex)
From the above output, you can observe that the
cat string gets replaced with
cheetah string gets replaced to
aheetah, and the
mice string gets converted to
miae. For these elements, the pattern was successfully matched.
Also, note that the
sub() looks for only the first match in the string, which means that if there are two
c in a string, only the first occurrence of
c will be replaced with
a while the second one will remain unchanged.
If you still want to replace every single match of a pattern in a vector string better, use the
gsub() function, which is out of the scope of this tutorial!
Before moving on to the next topic, let's try one more interesting expression.
This time you will make use of the
| (or) operator which will try to match any of the defined patterns and if it matches, replace it with
-. Remember, since you use
gsub() function, it will replace every single match of a pattern in a string.
gsub(pattern = 'c|d|l', replacement = '-', x = animals_regex) #animals with `c`,`d`,`l` gets replaced with `-`.
Let's move onto the final topic of today's tutorial, i.e., Time and Dates!
Time and Dates
Time and date information can be quite useful in various scenarios. For example, let's say you are working on a Computer Vision related problem and you would like to find out the FPS (frames per second) at which your algorithm is running. In such a use-case, you could use the Time object to find out the processing speed of your computer vision algorithm. For other specific problems like time-series forecasting and seasonality studies, R's potential can be used to the full extent.
For starters, let's quickly print today's date using R with a simple command
Sys.Date(). Here Sys refers to the system, which means it returns systems approximation of date.
Simple, isn't it?
R's time and dates belong to
Date object, or you can say that the data type is
Date. It can be verified using the
class function that you learned in Data Types in R tutorial.
Similar to the Date function, you have a
time() function which returns the systems current time, in fact, it returns both the time and date as an output.
 "2020-02-04 02:05:04 IST"
Creating Date Objects
You learned how to get the current date and time. Let's now find out how you can create dates for other days by passing a mere string as an argument.
To create a date object for 10th May 1993, you will use the following syntax:
date_may <- as.Date('1993-05-10') #converts character string to a date object
One important point to note here is that the R's
Date function by default expects you to enter the date in
YYYY-MM-DD format if you try to interchange the year with the month or day it would result in an error. Let's try it out!
date_may <- as.Date('05-1993-10') #R follows the ISO date format by default
Error in charToDate(x): character string is not in a standard unambiguous format Traceback: 1. as.Date("05-1993-10") 2. as.Date.character("05-1993-10") 3. charToDate(x) 4. stop("character string is not in a standard unambiguous format")
But the good thing is, you could change the format explicitly by passing an argument
format and customize it accordingly.
date_may <- as.Date('05-1993-10', format = '%m-%Y-%d')
as.Date() function will accept different date formats, but at the end, it will convert it back to the ISO date format, you can see that by printing the
date_may <- as.Date('05-10-1993', format = '%m-%d-%Y')
Wouldn't it be awesome if you could apply mathematical operations like addition and subtraction to the Date objects in R?
Let's add 1 to the
date_may variable, and you would observe that it will show you one day later date.
date_may + 1
Great, so as you can see from the above output, adding one changed the date to 11th May 1993 from 10th May 1993. Similarly, you could subtract one from the date.
Let's say you want to find out the time difference between you and your elder sibling's date of birth.
elder_sib <- as.Date('1989-03-21')
date_may - elder_sib
Time difference of 1511 days
Congratulations on finishing the tutorial.
This tutorial was a good starting point for beginners who are eager to learn about various utility functions in R. As a good exercise, you might want to learn more about Regular Expressions as they are used in a variety of applications and indeed are a very powerful tool when it comes to cleaning or preprocessing the text data.
Please feel free to ask any questions related to this tutorial in the comments section below.
If you would like to learn more about R, take DataCamp's Intermediate R course.