Saltar al contenido principal

Data Types in R

Learn about data types and their importance in a programming language. More specifically, learn how to use various data types like vector, matrices, lists, and dataframes in the R programming language.
22 ene 2020  · 12 min de lectura

r data structures diagram

Before we start with the introduction and learn about various data types in R, let's quickly set up the R environment both on the Terminal and Jupyter Notebook.

The following command is for Mac operating system, which will install R on your terminal.

brew install r --build-from-source

To verify the installation has been successful, just type R (upper-case) in the terminal, and you will enter into an R session, as shown below.

r session

For installation on other operating systems, feel free to check this tutorial.

Now let's add R programming language as a kernel on jupyter notebook. Make sure you have jupyter notebook already installed on your system.

Go to your terminal and open the R session and enter the below two commands, which will add the R kernel to your jupyter notebook.

install.packages('IRkernel')
IRkernel::installspec()

Once the above two commands are successful, run jupyter from the terminal and open a notebook with R kernel as shown below:

r kernel

Now you are all set to write your first R code on jupyter notebook.

Introduction

To make use of R to the fullest, it is very important to know and understand various data types and data structures that exist in R and how they function. They play a key role in almost all problems and especially when you are working on machine learning problems, which are very data-centric.

In a programming language, we usually need variables to store information, which can be an integer, character, floating-point, boolean, etc. The type of the variable is purely based on which kind of information it holds. If it is assigned an integer, then the variable has a data type as int. Variables are merely reserved memory locations at which values are stored. As soon as you create a variable, some memory space is reserved for it.

Based on the data type of a variable, some memory will be allocated by the operating system. For example, in R programming, a variable that holds an integer will reserve a memory of 4 bytes and 1 byte for a character.

Programming languages like C, C++, and Java, variables are declared as data type; however, in Python and R, the variables are an object. Objects are nothing but a data structure having few attributes and methods which are applied to its attributes.

There are various kinds of R-objects or data structures which will be discussed in this tutorial like:

Let's first understand some of the basic datatypes on which the R-objects are built like Numeric, Integer, Character, Factor, and Logical.

  • Numeric: Numbers that have a decimal value or are a fraction in nature have a data type as numeric.
num <- 1.2
print(num)
[1] 1.2

You can check the data type of a using keyword class().

class(num)
'numeric'
  • Integer: Numbers that do not contain decimal values have a data type as an integer. However, to create an integer data type, you explicitly use as.integer() and pass the variable as an argument.
int <- as.integer(2.2)
print(int)
[1] 2
class(int)
'integer'
  • Character: As the name suggests, it can be a letter or a combination of letters enclosed by quotes is considered as a character data type by R. It can be alphabets or numbers.
char <- "datacamp"
print(char)
[1] "datacamp"
class(char)
'character'
char <- "12345"
print(char)
[1] "12345"
class(char)
'character'
  • Logical: A variable that can have a value of True and False like a boolean is called a logical variable.
log_true <- TRUE
print(log_true)
[1] TRUE
class(log_true)
'logical'
log_false <- FALSE
print(log_false)
[1] FALSE
class(log_false)
'logical'
  • Factor: They are a data type that is used to refer to a qualitative relationship like colors, good & bad, course or movie ratings, etc. They are useful in statistical modeling.

To achieve this, you will make use of the c() function, which returns a vector (one-dimensional) by combining all the elements.

fac <- factor(c("good", "bad", "ugly","good", "bad", "ugly"))
print(fac)
[1] good bad  ugly good bad  ugly
Levels: bad good ugly
class(fac)
'factor'

The fac factor has three levels as good, bad, and ugly, which can be checked using the keyword levels, and the type of level will be a character.

levels(fac)
  1. 'bad'
  2. 'good'
  3. 'ugly'
nlevels(fac)
3
class(levels(fac))
'character'

Before moving forward, let us understand a couple of important tips that can come in handy!

  • Always remember that R programming language is case-sensitive. All of the objects that are defined above should be used in the same manner, be it upper or lower, as shown in the example below.
Num
Error in eval(expr, envir, enclos): object 'Num' not found
Traceback:
  • In R, you can check all the variables or objects that have been defined by you in the working environment by using keyword the ls(), as shown below.
ls()
  1. 'char'
  2. 'int'
  3. 'num'

Lists

List indexing in Python
(Source)

Unlike vectors, a list can contain elements of various data types and is often known as an ordered collection of values. It can contain vectors, functions, matrices, and even another list inside it (nested-list).

Lists in R are one-indexed, i.e., the index starts with one.

Let's understand the concept of lists with a quick example that will have three different types of data types stored in one list.

lis1 <- 1:5  # Integer Vector
lis1
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
lis2 <- factor(1:5)  # Factor Vector
lis2
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
Levels:
  1. '1'
  2. '2'
  3. '3'
  4. '4'
  5. '5'
lis3 <- letters[1:5]  # Character Vector
lis3
  1. 'a'
  2. 'b'
  3. 'c'
  4. 'd'
  5. 'e'
combined_list <- list(lis1, lis2, lis3)
combined_list
    1. 1
    2. 2
    3. 3
    4. 4
    5. 5
    1. 1
    2. 2
    3. 3
    4. 4
    5. 5
  1. Levels:
    1. '1'
    2. '2'
    3. '3'
    4. '4'
    5. '5'
    1. 'a'
    2. 'b'
    3. 'c'
    4. 'd'
    5. 'e'

Let's access each vector in the list separately. To achieve this, you will use double square brackets since the three vectors are placed on one level inside the list. python combined_list[[1]]

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5

python combined_list[[2]]

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
Levels:
  1. '1'
  2. '2'
  3. '3'
  4. '4'
  5. '5'
combined_list[[3]]
  1. 'a'
  2. 'b'
  3. 'c'
  4. 'd'
  5. 'e'

Now, let us try to access the fifth element from the third vector, which gives the letter e.

combined_list[[3]][5]
'e'

Finally, let's try to flatten the list. One important thing to remember is that since combined_list is a combination of character and numeric datatype, the character data type will get the precedence, and the data type of complete list will become a character.

flat_list <- unlist(combined_list)
class(flat_list)
'character'
flat_list
  1. '1'
  2. '2'
  3. '3'
  4. '4'
  5. '5'
  6. '1'
  7. '2'
  8. '3'
  9. '4'
  10. '5'
  11. 'a'
  12. 'b'
  13. 'c'
  14. 'd'
  15. 'e'
length(flat_list)
15

Vectors

vector matrix
(Source)

Vectors are an object which is used to store multiple information or values of the same data type. A vector can not have a combination of both integer and character. For example, if you want to store 100 students' total marks, instead of creating 100 different variables for each student, you would create a vector of length 100, which will store all the student marks in it.

A vector can be created with a function c(), which will combine all the elements and return a one-dimensional array.

Let's create a vector marks with data of five students of class numeric.

marks <- c(88,65,90,40,65)
class(marks)
'numeric'

Let us check the length of the vector, which should return the number of elements contained in it.

length(marks)
5

Now, let's try to access a specific element by its index.

marks[4]
40
marks[5]
65
marks[6] #returns NA since there is no sixth element in the vector

<NA>

  • Slicing: Similar to Python, the concept of slicing can be applied in R as well.

    Let's try to access elements from second to fifth using slicing.

marks[2:5]
  1. 65
  2. 90
  3. 40
  4. 65

Let's now create a character vector that is similar to creating a numeric character.

char_vector <- c("a", "b", "c")
print(char_vector)
[1] "a" "b" "c"
class(char_vector)

'character'

length(char_vector)
3
char_vector[1:3]
  1. 'a'
  2. 'b'
  3. 'c'

If we create a vector that has both numeric and character values, the numeric values will get converted to a character data type.

char_num_vec <- c(1,2, "a")
char_num_vec
  1. '1'
  2. '2'
  3. 'a'
class(char_num_vec)
'character'

Let's create a vector with 1024 numeric values with the help of a slicing concept.

vec <- c(1:1024)

Now, try to access the middle and the last element. To do that, you will use the length function.

vec[length(vec)]
1024
vec[length(vec)/2]
512
  • How do you create a vector of odd numbers?

To create a vector of odd numbers, you can use the function seq, which takes in three parameters: start, end, and step size.

seq(1,10, by = 2)
  1. 1
  2. 3
  3. 5
  4. 7
  5. 9

Matrix

matrix in r
(Source)

Similar to a vector, a matrix is used to store information about the same data type. However, unlike vectors, matrices are capable of holding two-dimensional information inside it.

The syntax of defining a matrix is:

M <- matrix(vector, nrow=r, ncol=c, byrow=FALSE, dimnames=list(char_vector_rownames, char_vector_colnames))

byrow=TRUE signifies that the matrix should be filled by rows. byrow=FALSE indicates that the matrix should be filled by columns (the default).

Let's quickly define a matrix M of shape $2\times3$.

M = matrix( c('AI','ML','DL','Tensorflow','Pytorch','Keras'), nrow = 2, ncol = 3, byrow = TRUE)
print(M)
     [,1]         [,2]      [,3]   
[1,] "AI"         "ML"      "DL"   
[2,] "Tensorflow" "Pytorch" "Keras"

Let's use the slicing concept and fetch elements from a row and column.

M[1:2,1:2] #the first dimension selects both rows while the second dimension will select
#elements from 1st and 2nd column
A matrix: 2 × 2 of type chr
AI ML
Tensorflow Pytorch

DataFrame

data frames
(Source)

Unlike a matrix, Data frames are a more generalized form of a matrix. It contains data in a tabular fashion. The data in the data frame can be spread across various columns, having different data types. The first column can be a character while the second column can be an integer, and the third column can be logical.

The variables or features are in columnar fashion, also known as a header, while the observations are in rows with the first element being the name of the row followed by the actual data, also known as data rows.

DataFrame can be created using the data.frame() function.

DataFrame has been widely used in the reading comma-separated files (CSV), text files. Their use is not only limited to reading the data, but you can also use them for machine learning problems, especially when dealing with numerical data. DataFrames can be useful for understanding the data, data wrangling, plotting and visualizing.

Let's create a dummy dataset and learn some data frame specific functions.

dataset <- data.frame(
   Person = c("Aditya", "Ayush","Akshay"),
   Age = c(26, 26, 27),
   Weight = c(81,85, 90),
   Height = c(6,5.8,6.2),
   Salary = c(50000, 80000, 100000)
)
print(dataset)
  Person Age Weight Height Salary
1 Aditya  26     81    6.0  5e+04
2  Ayush  26     85    5.8  8e+04
3 Akshay  27     90    6.2  1e+05
class(dataset)
'data.frame'
nrow(dataset) # this will give you the number of rows that are there in the dataset dataframe
3
ncol(dataset) # this will give you the number of columns that are there in the dataset dataframe
5
df1 = rbind(dataset, dataset) # a row bind which will append the arguments in row fashion.
df1
A data.frame: 6 × 5
Person Age Weight Height Salary
<fct> <dbl> <dbl> <dbl> <dbl>
Aditya 26 81 6.0 5e+04
Ayush 26 85 5.8 8e+04
Akshay 27 90 6.2 1e+05
Aditya 26 81 6.0 5e+04
Ayush 26 85 5.8 8e+04
Akshay 27 90 6.2 1e+05
df2 = cbind(dataset, dataset) # a column bind which will append the arguments in column fashion.
df2
A data.frame: 3 × 10
Person Age Weight Height Salary Person Age Weight Height Salary
<fct> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
Aditya 26 81 6.0 5e+04 Aditya 26 81 6.0 5e+04
Ayush 26 85 5.8 8e+04 Ayush 26 85 5.8 8e+04
Akshay 27 90 6.2 1e+05 Akshay 27 90 6.2 1e+05

Let's look at the head function which is very useful when you have millions of records and you want to look at only the first few rows of your data. Similarly, the tail function will output the last few rows of your data.

head(df1,3) # here only three rows will be printed
A data.frame: 3 × 5
  Person Age Weight Height Salary
  <fct> <dbl> <dbl> <dbl> <dbl>
1 Aditya 26 81 6.0 5e+04
2 Ayush 26 85 5.8 8e+04
3 Akshay 27 90 6.2 1e+05
str(dataset) #this returns the individual class or data type information for each column.
'data.frame':    3 obs. of  5 variables:
 $ Person: Factor w/ 3 levels "Aditya","Akshay",..: 1 3 2
 $ Age   : num  26 26 27
 $ Weight: num  81 85 90
 $ Height: num  6 5.8 6.2
 $ Salary: num  5e+04 8e+04 1e+05

Now let's look at the summary() function, which comes in handy when you want to understand the statistics of your dataset. As shown below, it divides your data into three quartiles, based on which you can get some intuition about the distribution of your data. It also shows if there are any missing values in your dataset.

summary(dataset)
    Person       Age            Weight          Height        Salary      
 Aditya:1   Min.   :26.00   Min.   :81.00   Min.   :5.8   Min.   : 50000  
 Akshay:1   1st Qu.:26.00   1st Qu.:83.00   1st Qu.:5.9   1st Qu.: 65000  
 Ayush :1   Median :26.00   Median :85.00   Median :6.0   Median : 80000  
            Mean   :26.33   Mean   :85.33   Mean   :6.0   Mean   : 76667  
            3rd Qu.:26.50   3rd Qu.:87.50   3rd Qu.:6.1   3rd Qu.: 90000  
            Max.   :27.00   Max.   :90.00   Max.   :6.2   Max.   :100000  

Conclusion

Congratulations on finishing the tutorial.

This tutorial was a good starting point for beginners who are curious to learn the R programming language. As a good exercise, feel free to check out more helper functions related to each data type.

There is a lot of information related to R that remains unraveled like Conditionals and Control Flow in R, Utilities in R, and the most exciting one Machine Learning using R, which will be covered in the future tutorials, so stay tuned!

Please feel free to ask any questions related to this tutorial in the comments section below.

If you would like to learn more about R, take DataCamp's Intermediate R course and check out the Introduction to Data frames in R tutorial.

Temas

Learn more about R

Certificación disponible

curso

Introduction to R

4 hr
2.7M
Master the basics of data analysis in R, including vectors, lists, and data frames, and practice R with real data sets.
Ver detallesRight Arrow
Comienza El Curso
Ver másRight Arrow
Relacionado

tutorial

Matrices in R Tutorial

Learn all about R's matrix, naming rows and columns, accessing elements also with computation like addition, subtraction, multiplication, and division.

Olivia Smith

7 min

tutorial

Utilities in R Tutorial

Learn about several useful functions for data structure manipulation, nested-lists, regular expressions, and working with times and dates in the R programming language.
Aditya Sharma's photo

Aditya Sharma

18 min

tutorial

Operators in R

Learn how to use arithmetic and logical operators in R. These binary operators work on vectors, matrices, and scalars.
DataCamp Team's photo

DataCamp Team

4 min

tutorial

Sorting Data in R

How to sort a data frame in R.
DataCamp Team's photo

DataCamp Team

2 min

tutorial

Introduction to Data frames in R

This tutorial takes course material from DataCamp's Introduction to R course and allows you to practice data frames.
Ryan Sheehy's photo

Ryan Sheehy

5 min

tutorial

Arrays in R

Learn about Arrays in R, including indexing with examples, along with the creation and addition of matrices and the apply() function.

Olivia Smith

8 min

See MoreSee More