Track
40 R Programming Interview Questions & Answers For All Levels
Being well-prepared for an R programming interview is a crucial factor for succeeding in it. This success has effectively two sides: for a job hunter, it means to be employed by the company, while for the company itself, it means to find a perfect fit for their job position.
To increase your chances to succeed in an R interview, it's always beneficial to know in advance what potential questions you can be asked if you're a job hunter—or what questions you can ask a candidate if you're a hiring manager or a recruiter.
This article discusses 40 fundamental R programming interview questions and answers to them for all levels of seniority, as well as some general interview questions. For convenience, all technical questions are divided into three levels: entry-level, intermediate, and advanced questions.
As additional resources for your R programming interview preparation, consider the following helpful resources:
- Practicing Statistics Interview Questions in R
- Data Science Interview Preparation
- 21 Top Data Scientist Interview Questions
General R Programming Interview Questions
At the beginning of an R interview, an interviewer may ask a candidate some general, non-technical questions about their overall experience with R. For example:
- How long have you been working in R?
- What kind of tasks do you perform in R?
- How do you estimate your level of proficiency in R?
If you're a job hunter, you should think in advance about these and similar questions and prepare your answers. Don't worry if you haven't had any real working experience in R yet: describing your internship in R programming or your individual or group R projects that you completed during your studies works just fine.
Besides, if you're interviewing for an entry-level position, your interviewer doesn't necessarily expect from you an extensive (or even any) work experience in R. Remember that since you were invited to this interview, the company found your resume attractive anyway.
Entry-Level R Programming Interview Questions
Let’s start with some of the basic technical R interview questions that you might face from your potential employer. These require you to have mastered the basics and have some practical experience of using R.
1. What is R, and what are its main characteristics?
R is a programming language and environment widely used for solving data science problems and particularly designed for statistical computing and data visualization. Its main characteristics include:
- Open source
- Interpreted (i.e., it supports both functional and object-oriented programming)
- Highly extensible due to its large collection of data science packages
- Functional and flexible (users can define their own functions, as well as tune various parameters of existing functions)
- Compatible with many operating systems
- Can be easily integrated with other programming languages and frameworks
- Allows powerful statistical computing
- Offers a variety of data visualization tools for creating publication-quality charts
- Equipped with the command-line interface
- Supported by a strong online community
2. What are some disadvantages of using R?
- Non-intuitive syntax and hence a steep learning curve, especially for beginners in programming
- Relatively slow
- Inefficient memory usage
- Inconsistent and often hard-to-read documentation of packages
- Some packages are of low quality or poorly-maintained
- Potential security concerns due to its open-source nature
3. List and define some basic data types in R.
- Numeric—decimal numbers.
- Integer—whole numbers.
- Character—a letter, number, or symbol, or any combination of them, enclosed in regular or single quotation marks.
- Factor—categories from a predefined set of possible values, often with an intrinsic order.
- Logical—the Boolean values TRUE and FALSE, represented under the hood as 1 and 0, respectively.
4. List and define some basic data structures in R.
- Vector—a one-dimensional data structure used for storing values of the same data type.
- List—a multi-dimensional data structure used for storing values of any data type and/or other data structures.
- Matrix—a two-dimensional data structure used for storing values of the same data type.
- Data frame—a two-dimensional data structure used for storing values of any data type, but each column must store values of the same data type.
5. How to import data in R?
The base R provides essential functions for importing data:
read.table()
—the most general function of the base R for importing data, takes in tabular data with any kind of field separators, including specific ones, such as |.read.csv()
—comma-separated values (CSV) files with.
as the decimal separator.read.csv2()
—semicolon-separated values files with,
as the decimal separator.read.delim()
—tab-separated values (TSV) files with.
as the decimal separator.read.delim2()
—tab-separated values (TSV) files with,
as the decimal separator.
In practice, any of these functions can be used to import tabular data with any kind of field and decimal separators: using them for the specified formats of files is only the question of convention and default settings. For example, here is the syntax of the first function: read.table(file, header = FALSE, sep = "", dec = ".")
. The other functions have the same parameters with different default settings that can always be explicitly overwritten.
The tidyverse packages readr and readxl provide some other functions for importing specific file formats. Each of those functions can be further fine-tuned by setting various optional parameters.
readr
read_tsv()
—tab-separated values (TSV) files.read_fwf()
—fixed-width files.read_log()
—web log files.read_table()
,read_csv()
,read_csv2()
, andread_delim()
—equivalent to the base R functions.
readxl
read_excel()
—Excel files.read_csv()
—equivalent to the function from the base R functions.
To dive deeper into data loading in R, you can go through the tutorial on How to Import Data Into R.
6. What is a package in R, and how do you install and load packages?
An R package is a collection of functions, code, data, and documentation, representing an extension of the R programming language and designed for solving specific kinds of tasks. R comes with a bunch of preinstalled packages, and other packages can be installed by users from repositories. The most popular centralized repository storing thousands of various R packages is called Comprehensive R Archive Network (CRAN).
To install an R package directly from CRAN, we need to pass the package name enclosed in quotation marks to the install.packages()
function, as follows: install.packages("package_name")
. To install more than one package from CRAN in one go, we need to use a character vector containing the package names enclosed in quotation marks, as follows: install.packages(c("package_name_1", "package_name_2")
. To install an R package manually, we need first to download the package as a zip file on our computer and then run the install.packages() function
:
install.packages("path_to_the_locally_stored_zipped_package_file", repos=NULL, type="source")
To load an installed R package in the working R environment, we can use either library()
or require()
functions. Each of them takes in the package name without quotation marks and loads the package, e.g., library(caret)
. However, the behavior of these functions is different when they can't find the necessary package: library()
throws an error and stops the program execution, while require()
outputs a warning and continues the program execution.
7. How to create a data frame in R?
1. From one or more vectors of the same length—by using the data.frame()
function:
df <- data.frame(vector_1, vector_2)
2. From a matrix—by using the data.frame()
function:
df <- data.frame(my_matrix)
3. From a list of vectors of the same length—by using the data.frame()
function:
df <- data.frame(list_of_vectors)
4. From other data frames:
- To combine the data frames horizontally (only if the data frames have the same number of rows, and the records are the same and in the same order) —by using the
cbind()
function:
df <- cbind(df1, df2)
- To combine the data frames vertically (only if they have an equal number of identically named columns of the same data type and appearing in the same order) —by using the
rbind()
function:
df <- rbind(df1, df2)
8. How do you add a new column to a data frame in R?
- Using the $ symbol:
df <- data.frame(col_1=10:13, col_2=c("a", "b", "c", "d"))
print(df)
df$col_3 <- c(5, 1, 18, 16)
print(df)
Output:
col_1 col_2
1 10 a
2 11 b
3 12 c
4 13 d
col_1 col_2 col_3
1 10 a 5
2 11 b 1
3 12 c 18
4 13 d 16
- Using square brackets:
df <- data.frame(col_1=10:13, col_2=c("a", "b", "c", "d"))
print(df)
df["col_3"] <- c(5, 1, 18, 16)
print(df)
Output:
col_1 col_2
1 10 a
2 11 b
3 12 c
4 13 d
col_1 col_2 col_3
1 10 a 5
2 11 b 1
3 12 c 18
4 13 d 16
- Using the
cbind()
function:
df <- data.frame(col_1=10:13, col_2=c("a", "b", "c", "d"))
print(df)
df <- cbind(df, col_3=c(5, 1, 18, 16))
print(df)
Output:
col_1 col_2
1 10 a
2 11 b
3 12 c
4 13 d
col_1 col_2 col_3
1 10 a 5
2 11 b 1
3 12 c 18
4 13 d 16
In each of the three cases, we can assign a single value or a vector or calculate the new column based on the existing columns of that data frame or other data frames.
9. How to remove columns from a data frame in R?
1. By using the select()
function of the dplyr package of the tidyverse collection. The name of each column to delete is passed in with a minus sign before it:
df <- select(df, -col_1, -col_3)
If, instead, we have too many columns to delete, it makes more sense to keep the rest of the columns rather than delete the columns in interest. In this case, the syntax is similar, but the names of the columns to keep aren't preceded with a minus sign:
df <- select(df, col_2, col_4)
2. By using the built-in subset() function of the base R. If we need to delete only one column, we assign to the select parameter of the function the column name preceded with a minus sign. To delete more than one column, we assign to this parameter a vector containing the necessary column names preceded with a minus sign:
df <- subset(df, select=-col_1)
df <- subset(df, select=-c(col_1, col_3))
If, instead, we have too many columns to delete, it makes more sense to keep the rest of the columns rather than delete the columns in interest. In this case, the syntax is similar, but no minus sign is added:
df <- subset(df, select=col_2)
df <- subset(df, select=c(col_2, col_4))
10. What is a factor in R?
A factor in R is a specific data type that accepts categories (aka levels) from a predefined set of possible values. These categories look like characters, but under the hood, they are stored as integers. Often, such categories have an intrinsic order. For example, a column in a data frame that contains the options of the Likert scale for assessing views ("strongly agree," "agree," "somewhat agree," "neither agree nor disagree," "somewhat disagree," "disagree," "strongly disagree") should be of factor type to capture this intrinsic order and adequately reflect it on the categorical types of plots.
11. What is RStudio?
RStudio is an open-source IDE (integrated development environment) that is widely used as a graphical front-end for working with the R programming language starting from version 3.0.1. It has many helpful features that make it very popular among R users:
- User-friendly
- Flexible
- Multifunctional
- Allows creating reusable scripts
- Tracks operational history
- Autocompletes the code
- Offers detailed and comprehensive help on any object
- Provides easy access to all imported data and built objects
- Makes it easy to switch between terminal and console
- Allows plot previewing
- Supports efficient project creation and sharing
- Can be used with other programming languages (Python, SQL, etc.)
To learn more about what RStudio is and how to install it and begin using it, you can follow the RStudio Tutorial.
12. What is R Markdown?
R Markdown is a free and open-source R package that provides an authoring framework for building data science projects. Using it, we can write a single .rmd file that combines narrative, code, and data plots, and then render this file in a selected output format. The main characteristics of R Markdown are:
- The resultant documents are shareable, fully reproducible, and of publication quality.
- A wide range of static and dynamic outputs and formats, such as HTML, PDF, Microsoft Word, interactive documents, dashboards, reports, articles, books, presentations, applications, websites, reusable templates, etc.
- Easy version control tracking.
- Multiple programming languages are supported, including R, Python, and SQL.
13. How to create a user-defined function in R?
To create a user-defined function in R, we use the keyword function
and the following syntax:
function_name <- function(parameters){
function body
}
- Function name—the name of the function object that will be used for calling the function after its definition.
- Function parameters—the variables separated with a comma and placed inside the parentheses that will be set to actual argument values each time we call the function.
- Function body—a chunk of code in the curly brackets containing the operations to be performed in a predefined order on the input arguments each time we call the function. Usually, the function body contains the
return()
statement (or statements) that returns the function output, or theprint()
statement (or statements) to print the output.
An example of a simple user-defined function in R:
my_function <- function(x, y){
return(x + y)
}
14. List some popular data visualization packages in R.
- ggplot2—the most popular R data visualization package allowing the creation of a wide variety of plots.
- Lattice—for displaying multivariate data as a tiled panel (trellis) of several plots.
- Plotly—for creating interactive, publication-quality charts.
- highcharter—for easy dynamic plotting, offers many flexible features, plugins, and themes; allows charting different R objects with one function.
- Leaflet—for creating interactive maps.
- ggvis—for creating interactive and highly customizable plots that can be accessed in any browser by using Shiny's infrastructure.
- patchwork—for combining several plots, usually of various types, on the same graphic.
Intermediate R Programming Interview Questions
For more experienced practitioners, it’s likely that the interviewer will ask some questions that require more detailed knowledge of R. Here are some to prepare for:
15. How to assign a value to a variable in R?
- Using the assignment operator
<-
, e.g.,my_var <- 1—
the most common way of assigning a value to a variable in R. - Using the equal operator
=
, e.g.,my_var = 1
—for assigning values to arguments inside a function definition. - Using the rightward assignment operator
->
, e.g.,my_var -> 1
—can be used in pipes. - Using the global assignment operators, either leftward (
<<-
) or rightward (->>
), e.g.,my_var <<- 1
—for creating a global variable inside a function definition.
16. What are the requirements for naming variables in R?
- A variable name can be a combination of letters, digits, dots, and underscores. It can't contain any other symbols, including white spaces.
- A variable name must start with a letter or a dot.
- If a variable name starts with a dot, this dot can't be followed by a digit.
- Reserved words in R (
TRUE
,for
,NULL
, etc.) can't be used as variable names. - Variable names are case-sensitive.
In the course Writing Efficient R Code, you'll find further best practices for writing code in R.
17. What types of loops exist in R, and what is the syntax of each type?
1. For loop—iterates over a sequence the number of times equal to its length (unless the statements break
and/or next
are used) and performs the same set of operations on each item of that sequence. This is the most common type of loops. The syntax of a for loop in R is the following:
for (variable in sequence) {
operations
}
2. While loop—performs the same set of operations until a predefined logical condition (or several logical conditions) is met—unless the statements break
and/or next
are used. Unlike for loops, we don't know in advance the number of iterations a while loop is going to execute. Before running a while loop, we need to assign a variable (or several variables) and then update its value inside the loop body at each iteration. The syntax of a while loop in R is the following:
variable assignment
while (logical condition) {
operations
variable update
}
3. Repeat loop—repeatedly performs the same set of operations until a predefined break condition (or several break conditions) is met. To introduce such a condition, a repeat loop has to contain an if-statement code block, which, in turn, has to include the break
statement in its body. Like while loops, we don't know in advance the number of iterations a repeat loop is going to execute. The syntax of a repeat loop in R is the following:
repeat {
operations
if(break condition) {
break
}
}
You can read more about Loops in R with our separate tutorial.
18. How to aggregate data in R?
To aggregate data in R, we use the aggregate()
function. This function has the following essential parameters, in this order:
x
—the data frame to aggregate.by
—a list of the factors to group by.FUN
—an aggregate function to compute the summary statistics for each group (e.g.,mean
,max
,min
,count
,sum
).
19. How to merge data in R?
1. Using the cbind()
function—only if the data frames have the same number of rows, and the records are the same and in the same order:
df <- cbind(df1, df2)
2. Using the rbind()
function to combine the data frames vertically—only if they have an equal number of identically named columns of the same data type and appearing in the same order:
df <- rbind(df1, df2)
3. Using the merge()
function to merge data frames by a column in common, usually an ID column:
- Inner join:
df <- merge(df1, df2, by="ID")
- Left join:
df <- merge(df1, df2, by="ID", all.x=TRUE)
- Right join:
df <- merge(df1, df2, by="ID", all.y=TRUE)
- Outer join:
df <- merge(df1, df2, by="ID", all=TRUE)
4. Using the join()
function of the dplyr package to merge data frames by a column in common, usually an ID column:
df <- join(df1, df2, by="ID", type="type_of_join")
The type
parameter takes in one of the following values: "inner", "left", "right", or "full".
20. How to concatenate strings in R?
We can concatenate two or more strings in R by using the paste()
or cat()
functions. The first approach is more popular. Both functions take in any number of strings to be concatenated and can also take in an optional parameter sep
(along with some other optional parameters)—a character or a sequence of characters that will separate attached strings in the resulting string (a white space by default).
21. How to transpose two-dimensional data in R?
We can transpose a data frame or a matrix in R so that the columns become the rows and vice versa. For this purpose, we need to use the t()
function of the base R. For example:
df <- data.frame(col_1=c(10, 20, 30), col_2=c(11, 22, 33))
print(df)
transposed_df <- t(df)
print(transposed_df)
Output:
col_1 col_2
1 10 11
2 20 22
3 30 33
[,1] [,2] [,3]
col_1 10 20 30
col_2 11 22 33
22. How to chain several operations together in R?
We can chain several operations in R by using the pipe operator (%>%
) provided by the tidyverse collection. Using this operator allows creating a pipeline of functions where the output of the first function is passed as the input into the second function and so on, until the pipeline ends. This eliminates the need for creating additional variables and significantly enhances the overall code readability.
An example of using the pipe operator on a data frame:
df <- data.frame(a=1:4, b=11:14, c=21:24)
print(df)
df_new <- df %>% select(a, b) %>% filter(a > 2)
print(df_new)
Output:
a b c
1 1 11 21
2 2 12 22
3 3 13 23
4 4 14 24
a b
1 3 12
2 4 13
23. What types of data plots can be created in R?
Being data visualization one of the strong sides of the R programming languages, we can create all types of data plots in R:
- Common types of data plots:
- Bar plot—shows the numerical values of categorical data.
- Line plot—shows a progression of a variable, usually over time.
- Scatter plot—shows the relationships between two variables.
- Area plot—based on a line plot, with the area below the line colored or filled with a pattern.
- Pie chart—shows the proportion of each category of categorical data as a part of the whole.
- Box plot—shows a set of descriptive statistics of the data.
- Advanced types of data plots:
- Violin plot—shows both a set of descriptive statistics of the data and the distribution shape for that data.
- Heatmap—shows the magnitude of each numeric data point within the dataset.
- Treemap—shows the numerical values of categorical data, often as a part of the whole.
- Dendrogram—shows an inner hierarchy and clustering of the data.
- Bubble plot—shows the relationships between three variables.
- Hexbin plot—shows the relationships of two numerical variables in a relatively large dataset.
- Word cloud—shows the frequency of words in an input text.
- Choropleth map—shows aggregate thematic statistics of geodata.
- Circular packing chart—shows an inner hierarchy of the data and the values of the data points
- etc.
The skill track Data Visualization with R will help you broaden your horizons in the field of R graphics. If you prefer to learn data visualization in R in a broader context, explore a thorough and beginner-friendly career track Data Scientist with R.
24. What is vector recycling in R?
If we try to perform some operation on two R vectors with different lengths, the R interpreter detects under the hood the shorter one, recycles its items in the same order until the lengths of the two vectors match, and only then performs the necessary operation on these vectors. Before starting vector recycling, though, the R interpreter throws a warning message about the initial mismatch of the vectors' lengths.
For example, if we try to run the following addition:
c(1, 2, 3, 4, 5) + c(1, 2, 3)
The second vector, due to the vector recycling, will actually be converted into c(1, 2, 3, 1, 2)
. Hence, the final result of this operation will be c(2, 4, 6, 5, 7)
.
While sometimes vector recycling can be beneficial (e.g., when we expect the cyclicity of values in the vectors), more often, it's inappropriate and misleading. Hence, we should be careful and mind the vectors' lengths before performing operations on them.
25. What is the use of the next
and break
statements in R?
The next
statement is used to skip a particular iteration and jump to the next one if a certain condition is met. The break
statement is used to stop and exit the loop at a particular iteration if a certain condition is met. When used in one of the inner loops of a nested loop, this statement exits only that inner loop.
Both next
and break
statements can be used in any type of loops in R: for loops, while loops, and repeat loops. They can also be used in the same loop, e.g.:
for(i in 1:10) {
if(i < 5)
next
if(i == 8)
break
print(i)}
Output:
[1] 5
[1] 6
[1] 7
26. What is the difference between the str()
and summary()
functions in R?
The str()
function returns the structure of an R object and the overall information about it, the exact contents of which depend on the data structure of that object. For example, for a vector, it returns the data type of its items, the range of item indices, and the item values (or several first values, if the vector is too long). For a data frame, it returns its class (data.frame), the number of observations and variables, the column names, the data type of each column, and several first values of each column.
The summary()
function returns the summary statistics for an R object. It's mostly applied to data frames and matrices, for which it returns the minimum, maximum, mean, and median values, and the 1st and 3rd quartiles for each numeric column, while for the factor columns, it returns the count of each level.
27. What is the difference between the subset()
and sample()
functions n R?
The subset()
function in R is used for extracting rows and columns from a data frame or a matrix, or elements from a vector, based on certain conditions, e.g.: subset(my_vector, my_vector > 10)
.
Instead, the sample()
function in R can be applied only to vectors. It extracts a random sample of the predefined size from the elements of a vector, with or without replacement. For example, sample(my_vector, size=5, replace=TRUE)
Advanced R Programming Interview Questions
28. How to create a new column in a data frame in R based on other columns?
1. Using the transform()
and ifelse()
functions of the base R:
df <- data.frame(col_1 = c(1, 3, 5, 7), col_2 = c(8, 6, 4, 2))
print(df)
# Adding the column col_3 to the data frame df
df <- transform(df, col_3 = ifelse(col_1 < col_2, col_1 + col_2, col_1 * col_2))
print(df)
Output:
col_1 col_2
1 1 8
2 3 6
3 5 4
4 7 2
col_1 col_2 col_3
1 1 8 9
2 3 6 9
3 5 4 20
4 7 2 14
2. Using the with()
and ifelse()
functions of the base R:
df <- data.frame(col_1 = c(1, 3, 5, 7), col_2 = c(8, 6, 4, 2))
print(df)
# Adding the column col_3 to the data frame df
df["col_3"] <- with(df, ifelse(col_1 < col_2, col_1 + col_2, col_1 * col_2))
print(df)
Output:
col_1 col_2
1 1 8
2 3 6
3 5 4
4 7 2
col_1 col_2 col_3
1 1 8 9
2 3 6 9
3 5 4 20
4 7 2 14
3. Using the apply()
function of the base R:
df <- data.frame(col_1 = c(1, 3, 5, 7), col_2 = c(8, 6, 4, 2))
print(df)
# Adding the column col_3 to the data frame df
df["col_3"] <- apply(df, 1, FUN = function(x) if(x[1] < x[2]) x[1] + x[2] else x[1] * x[2])
print(df)
Output:
col_1 col_2
1 1 8
2 3 6
3 5 4
4 7 2
col_1 col_2 col_3
1 1 8 9
2 3 6 9
3 5 4 20
4 7 2 14
4. Using the mutate()
function of the dplyr package and the ifelse()
function of the base R:
df <- data.frame(col_1 = c(1, 3, 5, 7), col_2 = c(8, 6, 4, 2))
print(df)
# Adding the column col_3 to the data frame df
df <- mutate(df, col_3 = ifelse(col_1 < col_2, col_1 + col_2, col_1 * col_2))
print(df)
Output:
col_1 col_2
1 1 8
2 3 6
3 5 4
4 7 2
col_1 col_2 col_3
1 1 8 9
2 3 6 9
3 5 4 20
4 7 2 14
29. How to parse a date from its string representation in R?
To parse a date from its string representation in R, we should use the lubridate package of the tidyverse collection. This package offers various functions for parsing a string and extracting the standard date from it based on the initial date pattern in that string. These functions are ymd()
, ymd_hm()
, ymd_hms()
, dmy()
, dmy_hm()
, dmy_hms()
, mdy()
, mdy_hm()
, mdy_hms()
, etc., where y, m, d, h, m, and s correspond to year, month, day, hours, minutes, and seconds, respectively.
For example, if we run the dmy()
function passing to it any of the strings "05-11-2023", "05/11/2023" or "05.11.2023", representing the same date, we'll receive the same result: 2023-11-05
. This is because in all three cases, despite having different dividing symbols, we actually have the same pattern: the day followed by the month followed by the year.
30. What is the use of the switch()
function in R?
The switch()
function in R is a multiway branch control statement that evaluates an expression against items of a list. It has the following syntax:
switch(expression, case_1, case_2, case_3....)
The expression passed to the switch()
function can evaluate to either a number or a character string, and depending on this, the function behavior is different.
1. If the expression evaluates to a number, the switch()
function returns the item from the list based on positional matching (i.e., its index is equal to the number the expression evaluates to). If the number is greater than the number of items in the list, the switch()
function returns NULL
. For example:
switch(2, "circle", "triangle", "square")
Output:
"triangle"
2. If the expression evaluates to a character string, the switch()
function returns the value based on its name:
switch("red", "green"="apple", "orange"="carot", "red"="tomato", "yellow"="lemon")
Output:
"tomato"
If there are multiple matches, the first matched value is returned. It's also possible to add an unnamed item as the last argument of the switch()
function that will be a default fallback option in the case of no matches. If this default option isn't set, and if there are no matches, the function returns NULL
.
The switch()
function is an efficient alternative to long if-else statements since it makes the code less repetitive and more readable. Typically, it's used for evaluating a single expression. We can still write more complex nested switch constructs for evaluating multiple expressions. However, in this form, the switch()
function quickly becomes hard to read and hence loses its main advantage over if-else constructs.
31. What is the difference between the functions apply()
, lapply()
, sapply()
, and tapply()
?
While all these functions allow iterating over a data structure without using loops and perform the same operation on each element of it, they are different in terms of the type of input and output and the function they perform.
apply()
—takes in a data frame, a matrix, or an array and returns a vector, a list, a matrix, or an array. This function can be applied row-wise, column-wise, or both.lapply()
—takes in a vector, a list, or a data frame and always returns a list. In the case of a data frame as an input, this function is applied only column-wise.sapply()
—takes in a vector, a list, or a data frame and returns the most simplified data structure, i.e., a vector for an input vector, a list for an input list, and a matrix for an input data frame.tapply()
—calculates summary statistics for different factors (i.e., categorical data).
32. List and define the control statements in R.
There are three groups of control statements in R: conditional statements, loop statements, and jump statements.
Conditional statements:
if
—tests whether a given condition is true and provides operations to perform if it's so.if-else
—tests whether a given condition is true, provides operations to perform if it's so and another set of operations to perform in the opposite case.if... else if... else
—tests a series of conditions one by one, provides operations to perform for each condition if it's true, and a fallback set of operations to perform if none of those conditions is true.switch
—evaluates an expression against the items of a list and returns a value from the list based on the results of this evaluation.
Loop statements:
for
—in for loops, iterates over a sequence.while
—in while loops, checks if a predefined logical condition (or several logical conditions) is met at the current iteration.repeat
—in repeat loops, continues performing the same set of operations until a predefined break condition (or several break conditions) is met.
Jump statements:
next
—skips a particular iteration of a loop and jumps to the next one if a certain condition is met.break
—stops and exits the loop at a particular iteration if a certain condition is met.return
—exits a function and returns the result.
33. What are regular expressions, and how do you work with them in R?
A regular expression, or regex, in R or other programming languages, is a character or a sequence of characters that describes a certain text pattern and is used for mining text data. In R, there are two main ways of working with regular expressions:
- Using the base R and its functions (such as
grep()
,regexpr()
,gsub()
,regmatches()
, etc.) to locate, match, extract, and replace regex. - Using a specialized stringr package of the tidyverse collection. This is a more convenient way to work with R regex since the functions of stringr have much more intuitive names and syntax and offer more extensive functionality.
A Guide to R Regular Expressions provides more detail about how to work with regex in R.
34. What packages are used for machine learning in R?
- caret—for various classification and regression algorithms.
- e1071—for support vector machines (SVM), naive Bayes classifier, bagged clustering, fuzzy clustering, and k-nearest neighbors (KNN).
- kernlab—provides kernel-based methods for classification, regression, and clustering algorithms.
- randomForest—for random forest classification and regression algorithms.
- xgboost—for gradient boosting, linear regression, and decision tree algorithms.
- rpart—for recursive partitioning in classification, regression, and survival trees.
- glmnet—for lasso and elastic-net regularization methods applied to linear regression, logistic regression, and multinomial regression algorithms.
- nnet—for neural networks and multinomial log-linear algorithms.
- tensorflow—the R interface to TensorFlow, for deep neural networks and numerical computation using data flow graphs.
- Keras—the R interface to Keras, for deep neural networks.
35. How to select features for machine learning in R?
Let's consider three different approaches and how to implement them in the caret package.
- By detecting and removing highly correlated features from the dataset.
We need to create a correlation matrix of all the features and then identify the highly correlated ones, usually those with a correlation coefficient greater than 0.75:
corr_matrix <- cor(features)
highly_correlated <- findCorrelation(corr_matrix, cutoff=0.75)
print(highly_correlated)
- By ranking the data frame features by their importance.
We need to create a training scheme to control the parameters for train, use it to build a selected model, and then estimate the variable importance for that model:
control <- trainControl(method="repeatedcv", number=10, repeats=5)
model <- train(response_variable~., data=df, method="lvq", preProcess="scale", trControl=control)
importance <- varImp(model)
print(importance)
- By automatically selecting the optimal features.
One of the most popular methods provided by caret for automatically selecting the optimal features is a backward selection algorithm called Recursive Feature Elimination (RFE).
We need to compute the control using a selected resampling method and a predefined list of functions, apply the RFE algorithm passing to it the features, the target variable, the number of features to retain, and the control, and then extract the selected predictors:
control <- rfeControl(functions=caretFuncs, method="cv", number=10)
results <- rfe(features, target_variable, sizes=c(1:8), rfeControl=control)
print(predictors(results))
If you need to strengthen your machine learning skills in R, here is a solid and comprehensive resource: Machine Learning Scientist with R.
36. What are correlation and covariance, and how do you calculate them in R?
Correlation is a measure of the strength and direction of the linear relationships between two variables. It takes values from -1 (a perfect negative correlation) to 1 (a perfect positive correlation). Covariance is a measure of the degree of how two variables change relative to each other and the direction of the linear relationships between them. Unlike correlation, covariance doesn't have any range limit.
In R, to calculate the correlation, we need to use the cor()
function, to calculate the covariance—the cov()
function. The syntax of both functions is identical: we need to pass in two variables (vectors) for which we want to calculate the measure (e.g., cor(vector_1, vector_2)
or cov(vector_1, vector_2)
), or the whole data frame, if we want to calculate the correlation or covariance between all the variables of that data frame (e.g., cor(df) or cov(df)
). In the case of two vectors, the result will be a single value, in the case of a data frame, the result will be a correlation (or covariance) matrix.
37. List and define the various approaches to estimating model accuracy in R.
Below are several approaches and how to implement them in the caret package of R.
- Data splitting—the entire dataset is split into a training dataset and a test dataset. The first one is used to fit the model, the second one is used to test its performance on unseen data. This approach works particularly well on big data. To implement data splitting in R, we need to use the
createDataPartition()
function and set the p parameter to the necessary proportion of data that goes to training. - Bootstrap resampling—extracting random samples of data from the dataset and estimating the model on them. Such resampling iterations are run many times and with replacement. To implement bootstrap resampling in R, we need to set the
method
parameter of thetrainControl()
function to"boot"
when defining the training control of the model. - Cross-validation methods
- k-fold cross-validation —the dataset is split into k-subsets. The model is trained on k-1 subsets and tested on the remaining one. The same process is repeated for all subsets, and then the final model accuracy is estimated.
- Repeated k-fold cross-validation —the principle is the same as for the k-fold cross-validation, only that the dataset is split into k-subsets more than one time. For each repetition, the model accuracy is estimated, and then the final model accuracy is calculated as the average of the model accuracy values for all repetitions.
- Leave-one-out cross-validation (LOOCV) —one data observation is put aside and the model is trained on all the other data observations. The same process is repeated for all data observations.
To implement these cross-validation methods in R, we need to set the method
parameter of the trainControl()
function to "cv"
, "repeatedcv"
, or "LOOCV"
respectively, when defining the training control of the model.
38. What is the chi-squared test, and how do you perform it in R?
The chi-squared statistical hypothesis test is a technique used to determine if two categorical variables are independent or if there is a correlation between them. To perform the chi-squared test in R, we need to use the chisq.test()
function of the stats package. The steps are as follows:
1. Create a contingency table with the categorical variables in interest using the table()
function of the base R:
table = table(df["var_1"], df["var_2"])
2. Pass the contingency table to the chisq.test()
function:
chisq.test(table)
You refresh you knowledge of chi-squared tests and other hypothesis tests in our Hypothesis Testing in R course.
39. What is Shiny in R?
Shiny is an open-source R package that allows the easy and fast building of fully interactive web applications and webpages for data science using only R, without any knowledge of HTML, CSS, or JavaScript. Shiny in R offers numerous basic and advanced features, widgets, layouts, web app examples, and their underlying code to build upon and customize, as well as user showcases from various fields (technology, sports, banking, education, etc.) gathered and categorized by the Shiny app developer community.
40. What is the difference between the with()
and within()
functions?
The with()
function evaluates an R expression on one or more variables of a data frame and outputs the result without modifying the data frame. The within()
function evaluates an R expression on one or more variables of a data frame, modifies the data frame, and outputs the result. Below we can see how these functions work using a sample data frame as an example:
df <- data.frame(a = c(1, 2, 3), b = c(10, 20, 30))
print(df)
with(df, a * b)
print(within(df, c <- a * b))
Output:
a b
1 1 10
2 2 20
3 3 30
10 40 90
a b c
1 1 10 10
2 2 20 40
3 3 30 90
When using the within()
function, to save the modifications, we need to assign the output of the function to a variable.
Conclusion
To conclude, in this article, we considered the 40 most common R programming interview questions and what answers are expected for each of them. Hopefully, with this information in hand, you feel more confident and ready for a successful R interview, whether you're looking for a job in R or the right candidate for an open position in your company.
To get some hands-on experience answering questions, check out our Practicing Statistics Interview Questions in R course.
IBM Certified Data Scientist (2020), previously Petroleum Geologist/Geomodeler of oil and gas fields worldwide with 12+ years of international work experience. Proficient in Python, R, and SQL. Areas of expertise: data cleaning, data manipulation, data visualization, data analysis, data modeling, statistics, storytelling, machine learning. Extensive experience in managing data science communities and writing/reviewing articles and tutorials on data science and career topics.
Start Your R Journey Today!
Track
Associate Data Scientist
Course
Introduction to R
blog
Top 20 Git Interview Questions and Answers for All Levels
blog
80 Top SQL Interview Questions and Answers for Beginners & Intermediate Practitioners
blog
28 Top Data Scientist Interview Questions For All Levels
blog
Top 25 Excel Interview Questions For All Levels
blog
27 Essential Power BI Interview Questions For Every Level
tutorial
R Packages: A Beginner's Tutorial
DataCamp Team
23 min