R data frames regularly create somewhat of a furor on public forums like Stack Overflow and Reddit. Starting R users often experience problems with this particular data structure and it doesn’t always seem to be straightforward. But does it really need to be so?
Well, not necessarily.
With today’s post, DataCamp wants to show you that these R data structures don’t need to be hard: we offer you 15 easy, straightforward solutions to the most frequently occuring problems with
data.frame. These issues have been selected from the most recent and sticky or upvoted Stack Overflow posts.
The Root: What’s an R Data Frame Exactly?
With the data frame, R offers you a great first step by allowing you to store your data in overviewable, rectangular grids. Each row of these grids corresponds to measurements or values of an instance, while each column is a vector containing data for a specific variable.
This means that a data frame’s rows do not need to contain, but can contain, the same type of values: they can be numeric, character, logical, etc.;
As you can see below, each instance, listed in the first unnamed column with a number, has certain characteristics that are spread out over the remaining three columns. Each column needs to consist of values of the same type, since they are data vectors: as such, the
breaks column only contains numerical values, while the
tension columns have characters as values that are stored as factors.
In case you’re wondering, this data is about the number of breaks in yarn during weaving :).
Remember that factors are variables that can only contain a limited number of different values. As such, they are often called categorical variables.
Maybe you will have already noticed that this data structure ressembles that of matrices, except for the fact that their data values don’t need to be of the same type, while matrices do require this.
Data frames also have similarities with lists, which are basically collections of components. However, it’s a list with vector structures of the same length. As such, they can actually be seen as special types of lists and can be accessed as either a matrix or a list.
If you want more information or if you just want to review and take a look at a comparison of the five general data structures in R, watch the small video below:
As you can see, there are different data structures that impose different requirements on how the data is stored. Data frames are particularly handy to store multiple data vectors, which makes it easier to organize your data, to apply functions to it and to save your work.
It’s almost similar to having a single spreadsheet with elements that all have equal lengths!
The Basics: Questions and Solutions
How to Create a Simple Data Frame in R
Even though looking at built-in examples of this data structure, such as
esoph, is interesting, it can easily get more exciting!
By practising with your own examples, of course! You can do this very easily by making some vectors first:
Died.At <- c(22,40,72,41) Writer.At <- c(16, 18, 36, 36) First.Name <- c("John", "Edgar", "Walt", "Jane") Second.Name <- c("Doe", "Poe", "Whitman", "Austen") Sex <- c("MALE", "MALE", "MALE", "FEMALE") Date.Of.Death <- c("2015-05-10", "1849-10-07", "1892-03-26","1817-07-18")
Next, you just combine the vectors that you made with the
Remember that this type of data structure requires variables of the same length. Check if you have put an equal number of arguments in all
c() functions that you assign to the vectors and that you have indicated strings of words with
Also, note that when you use the
data.frame() function, character variables are imported as factors or categorical variables. Use the
str() function to get to know more about
However, if you’re more interested in inspecting the first and the last lines of
writers_df, you can use the
tail() funtions, respectively.
You see that the
Date.Of.Death variables of
writers_df have all been read in as factors.
But do you really want this?
For the variables
Second.Name, you don’t want this. You can use the
I()function to insulate them. This function inhibits the interpretation of its arguments. In other words, by just slightly changing the definitions of the vectors
Second.Namewith the addition of the
I()function, you can make sure that the proper names are not interpreted as factors.
You can keep the
Sexvector as a factor, because there are only a limited amount of possible values that this variable can have.
Also for the variable
Date.of.Deathyou don’t want to have a factor. It would be better if the values are registered as dates. You can add the
as.Date()function to this variable to make sure this happens.
If you use other functions such as
read.table() or other functions that are used to input data, such as
read.delim(), you’ll get back a data frame as the result. This way, files that look like this one below or files that have other delimiters, will be converted once they are read into R with these functions.
22, 16, John, Doe, MALE, 2015-05-10 40, 18, Edgar, Poe, MALE, 1849-10-07 72, 36, Walt, Whitman, MALE, 1892-03-26 41, 36, Jane, Austen, FEMALE, 1817-07-18
How to Change a Data Frame’s Row and Column Names
Data frames can also have a
names attribute, by which you can see the names of the variables that you have included. In other words, you can also set a header.
You already did this before when making
You see that the names of the variables
Now that you see the names of
writers_df, you’re not so sure if these are efficient or correct. To change the names that appear, you can easily continue using the
Make sure, though, that you have a number of arguments in the
c() function that is equal to the number of variables that you have included into
In this case, since there are six variables
Death, you want six arguments in the
Otherwise, the other variables will be interpreted as “NA”.
Note also how the arguments of the
c() function are inputted as strings!
Tip: try to leave out the two last arguments from the
c() function and see what happens!
Note that you can also access and change the column and row names with the functions
As you already know, this data structure has similarities to matrices; This means that the size is determined by how many rows and columns you have combined into it.
To check how many rows and columns you have in
writers_df, you can use the
The result of this function is represented as
 4 6. Just like a matrix, the dimensions are defined by the number of rows, followed by the number of columns.
Note that you can also just retrieve the number of rows or columns by adding a
 with an index to your
You can also retrieve the number of rows and columns
writers_df by using the functions
ncol(), to retrieve the number of rows or columns, respectively:
Note that, since the data structure is also similar to a list, you could also use the
length() function to retrieve the number of columns.
How to Access and Change a Data Frame’s Values
There are two main ways in which you can access and change these values. In this section, you’ll see and practice both of them!
….Through the Variable Names
Now that you have retrieved and set the names of
writers_df, you want to take a closer look at the values that are actually stored in it.
There are two straightforward ways that you can access these values.
First, you can try to access them by just entering the data frame’s name in combination with the variable name:
Note that if you change one of the values in the vector
Age that this change will not be incorporated into
In the end, with this method of accessing the values, you just create a copy of a certain variable!
That’s why any changes to the variables do not change the data frame’s variables.
…Through the [,] and $ Notations
You can also access
writers_df’s values by using the [,] notation:
An alternative to the [,] notation is a notation with
$, just like this:
Note also that you can also change the values by simply using these notations to perform mathematical operations.
If you really want to make your hands dirty some more and change some of the values of
writers_df, you can use the [,] notation to actually change the values one by one:
Why and how to Attach Data Frames
$ notation is pretty handy, but it can become very annoying when you have to type it each time that you want to work with your data.
attach() function offers a solution to this: it takes a data frame as an argument and places it in the search path at position 2.
So unless there are variables in position 1 that are exactly the same as the ones from the data frame that you have inputted, the variables are considered as variables that can be immediately called on.
Note that the search path is in fact the order in which R accesses files. You can look this up by entering the
Note that you can alternatively use the
with() function to attach
writers_df, but this requires you to specify some more arguments.
You get an error that tells you that “The following objects are masked by .GlobalEnv:”.
This is because you have objects in your global environment that have the same name as your data frame. Those objects could be the vectors that you created above, if you didn’t change their names.
You have two solutions to this:
- You just don’t create any objects with those names in your global environment. This is more a solution for those of you who imported their data through
read.delim(), but not really appropriate for this case.
- You rename the objects in the data frame so that there’s no conflict. This is the solution that was applied in this tutorial. So, rename your columns with the
Note that if all else fails, you can just remember to always refer to your column names with the $ notation!
Now that you have unmasked the object, you can now safely execute the following command and you can actually access/change the values of all
Age.At.Death <- Age.At.Death-1 Age.At.Death
How to Apply Functions to Data Frames
Now that you have successfully made and modified
writers_df by putting a header in place, you can start applying functions to it!
In some cases where you want to calculate stuff, you might want to put the numeric data in a separate data frame:
Only then can you start to get, for example, the mean and the median of your numeric data.
You can do this with the
apply() function. The first argument of this function should be your smaller data frame, in this case,
Ages. The second argument designates what data you want to consider for the calculations of the mean or median: columns or rows.
In this case, you want to calculate the median and mean of the variables
Age.As.Writer, which designate columns in
The last argument then specifies the exact calculations that you want to do on your data:
Do you want to know more about the
apply() function and how to use it?
DataCamp’s Intermediate R course, which teaches you, amongst other things, how to make your R code more efficient and readable using this function, along with the rest of the
apply() family of functions.
Surpassing the Basics: More Questions, More Answers
Now that you have been introduced to the basic pitfalls, it’s time to look at some problems, questions or difficulties that you might have already had while working with these data structures more intensively. If you’re new to this topic, the following section will allow you to step up your data frame game.
All the more reason to get started now!
How to Create an Empty Data Frame
The easiest way to create an empty data frame is probably by just assigning a
data.frame() function without any arguments to a vector:
You can then start filling your
abup by using the [,] notation.
Be careful, however, because it’s easy to make errors while doing this!
Note how you don’t see any column names in this empty data set. If you do want to have those, you can just initialize empty vectors in
ab, like this:
How to Extract Rows and Columns, Subsetting your Data Frame
Subsetting or extracting specific rows and columns is an important skill in order to surpass the basics that have been introduced in step two, because it allows you to easily manipulate smaller sets of your original data.
You basically extract those values from the rows and columns that you need in order to optimize the data analyses you make.
It’s easy to start subsetting with the [,] notation that was described in step two:
Note that you can also define this subset with the variable names.
Tip: be careful when you are subsetting just one column!
R has the tendency to simplify your results, which means that it will read your subset as a vector, which normally, you don’t want to get.
To make sure that this doesn’t happen, you can add the argument
In a next step, you can try subsetting with the
Note that you can also turn to
grep() to subset. In the DataCamp Light chunk above, you used
grep() to get the job done. You isolated the rows in the column
Age.At.Death that have values that contain “4”.
Note that by subsetting, you basically stop considering certain values. This might mean that you remove certain features of a factor, by, for example, only considering the
MALE members of
Notice how all factor levels of this column still remain present, even though you have created a subset:
You can use
factor() to remove the factor levels that are no longer present, you can enter the following line of code.
How to Remove Columns and Rows from a Data Frame
If you want to remove values or entire columns, you can assign a
NULL value to the desired unit:
To remove rows, the procedure is a bit more complicated. You define a new vector in which you list for every row whether to have it included or not.
Then, you apply this vector to
Note that you can also do the opposite by just adding
!, stating that the reverse is true. Also note that you can also work with tresholds. In the code chunk above, you specified that you only want to keep all writers that were older than forty when they died.
How to add Rows and Columns to a Data Frame
Much in the same way that you used the [,] and $ notations to access and change single values, you can also easily add columns to
Appending rows to an existing data frame is somewhat more complicated.
To easily do this by first making a new row in a vector, respecting the column variables that have been defined in
writers_df and by then binding this row to the original data frame with the
Why and how to Reshape an R Data Frame from Wide to Long Format and Vice Versa
When you have multiple values, spread out over multiple columns, for the same instance, your data is in the “wide” format.
On the other hand, when your data is in the “long” format if there is one observation row per variable. You therefore have multiple rows per instance.
Let’s illustrate this with an example. Long data looks like this:
As you can see, there is one row for each value that you have in the
Type variable. A lot of statistical tests favor this format.
The data would look like the following in the wide format:
You see that each column represents a unique pairing of the various factors with the values.
Since different functions may require you to input your data either in “long” or “wide” format, you might need to reshape your data set.
There are two main options that you can choose here: you can use the
stack() function or you can try using the
The former is preferred when you work with simple data frames, while the latter is more often used on more complex ones, mostly because there’s a difference in the possibilities that both functions offer.
Make sure to keep on reading to know more about the differences in possibilities between the
stack() for Simply Structured Data Frames
stack() function basically concatenates or combines multiple vectors into a single vector, along with a factor that indicates where each observation originates from.
To go from wide to long format, you will have to stack your observations, since you want one observation row per variable, with multiple rows per variable.
In this case, you want to merge the columns
Listen together, qua names and qua values:
To go from long to wide format, you will need to unstack your data, which makes sense because you want to have one row per instance with each value present as a different variable.
Note here that you want to disentangle the
reshape() for Complex Data Frames
This function is part of the
stats package. This function is similar to the
stack() function, but is a little bit more elaborate. Read and see for yourself how reshaping your data works with the
To go from a wide to a long data format, you can first start off by entering the
The first argumnet should always be your original wide data set.
In this case, you can specify that you want to input the
observations_wide to be converted to a long data format.
Then, you start adding other argumnets to the
- Include a list of variable names that define the different measurements through
varying. In this case, you store the scores of specific tests in the columns “Read”, “Write” and “Listen”.
- Next, add the argument
v.namesto specify the name that you want to give to the variable that contains these values in your long dataset. In this case, you want to combine all scores for all reading, writing and listening tests into one variable
- You also need to give a name to the variable that describes the different measurements that are inputted with the argument
timevar. In this case, you want to give a name to the column that contains the types of tests that you give to your students. That’s why this column’s name should be called “Test”.
- Then, you add the argument the argument
times, because you need to specify that the new column “Test” can only take three values, namely, the test components that you have stored: “Read”, “Write”, “Listen”.
- You’re finally there! Give in the end format for the data with the argument
- Additionally, you can specify new row names with the argument
Tip: try leaving out this last argument and see what happens!
From long to wide, you take sort of the same steps. First, you take the
reshape() function and give it its first argument, which is the data set that you want to reshape. The other arguments are as follows:
timevarallows you to specify that the variable
Test, which describes the different tests that you give to your students, should be decomposed.
- You also specify that the
reshape()function shouldn’t take into account the variables
Genderof the original data set. You put these column names into
- By not naming the variable
reshape()function will know that both
Resultshould be recombined.
- You specify the direction of the reshaping, which is in this case,
Note that if you want you can also rename or sort the results of these new long and wide data formats! You can find detailed instructions below.
Reshaping Data Frames with
This package allows you to “easily tidy data with the
gather() functions” and that’s exactly what you’re going to do if you use this package to reshape your data!
If you want to convert from wide to long format, the principle stays similar to the one that of
reshape(): you use the
gather() function and you start specifying its arguments: 1. Your data set is the first argument to the
gather() function 2. Then, you specify the name of the column in which you will combine the the values of
Listen. In this case, you want to call it something like
Test.Type. 3. You enter the name of the column in which all the values of the
Listen columns are listed. 4. You indicate which columns are supposed to be combined into one. In this case, that will be the columns from
Note how this the last argument specifies the columns in the same way as you did to subset
writers_df or to select your
writers_df’s columns in which you wanted to perform mathematical operations.
You can also just specify the columns individually like this:
long_tidyr <- gather(observations_wide, Test, Result, Read, Write, Listen) long_tidyr
Tip: try changing the code in the DataCamp Light box above to test this out!
The opposite direction, from long to wide format, is very similar to the function above, but this time with the
Again, you take as the first argument your data set. Then, you specify the column that contains the new column names.
In this case, that is
Lastly, you input the name of the column that contains the values that should be put into the new columns.
Reshaping Data Frames with
This package allows you to “flexibly reshape data”. To go from a wide to a long data format, you use its
This function is pretty easy, since it just takes your data set and the
id.vars argument, which you may already know from the
reshape() function. This argument allows you to specify which columns should be left alone by the function.
But, as you will have noted, there are a couple more arguments specified in the code chunk above:
measure.varsis there to name the destination column that will combine the original columns. If you leave out this argument, the
melt()function will use all other variables as the
variable.namespecifies how you want to name that destination column. If you don’t specify this argument, you will have a column named “variable” in your result.
value.nameallows you to input the name of the column in which the values or test results will be stored. If you leave out this argument, this column will be named “measurement”.
You can also go from a long to a wide format with the
reshape2 package with the
This is fairly easy: you first give in your data set, as always. Then, you combine the columns which you don’t want to be touched;
In this case, you want to keep
Gender as they are. The column
Test however, you want to split! So, that is the second part of your second argument, indicated by a
~. The last argument of this function is
value.var, which holds the values of the different tests. You want to name this new column
How to Sort a Data Frame
Sorting by columns might seem tricky, but this can be made easy by either using R’s built-in
order() function or by using a package.
You can for example sort by one of the dataframe’s columns. You order the rows according to the values that are stored in the variable
If you want to sort the values starting from high to low, you can just add the extra argument
decreasing, which can only take logical values.
Remember that logical values are
Another way is to add the function
rev() so that it includes the
order() function. As the function’s name suggests, it provides a way to give you the reversed version of its argument, which is
order(Name) in this case:
You can also add a
- in front of the numeric variable that you have given to order on.
Note that variables with other data types such as factors require you to convert them to characters or numeric before you can actually sort them:
You can also sort on two variables. In that case,
order() needs to have two arguments, so that you first sort by the first argument of the
order() function and then on the second argument.
You’ll see an example of this further on in the tutorial.
dplyr package, known for its abilities to manipulate data, has a specific function that allows you to sort rows by variables.
Dplyr’s function to make this happen is
The first argument of this function is the data set that you want to sort, while the second and third arguments are the variables that you choose to sort.
In this case you sort first on the variable
Age.At.Death and then on
You can also use an approach where you use the
with() function to get the same result.
Note that if you want to sort columns in descending order, you can add the function
desc() to the variables.
Are you interested in doing much more with the dplyr package? Check out our Data Manipulation in R with dplyr course, which will teach you how to to perform sophisticated data manipulation tasks using dplyr!
Also, check out our data manipulation with dplyr cheat sheet.
Other Options to Sort Data Frames
There are also many other packages that offer sorting functions. This section will only give a short overview of the packages that exist. Firstly, the taRifx package offers
sort.data.frame(), by which the values in
Age.At.Death can again be sorted in decreasing order:
library(taRifx) sorted_data <- sort(writers_df, decreasing=TRUE, ~Age.At.Death) sorted_data
Thirdly, there’s also the package doBy that offers the function
orderBy(). In this case, you want to order the values of
Age.At.Death from high to low first, and then on the values of the
library(doBy) sorted_data_two <- orderBy(~-Age.At.Death+Age.As.Writer, writers_df) sorted_data_two
How to Merge Data Frames
Merging Data Frames on Column Names
You can use the
merge() function to join two, but only two, data frames.
Let’s say you have
data2, which has the same values stored in a variable
Age.At.Death, which you also find in
writers_df, with exactly the same values. You thus want to merge the two on the basis of this variable:
You can easily merge the above two dataframes.
Tip: check what happens if you change the order of the two arguments of the
This way of merging is equivalent to an outer join in SQL.
Unfortunately, you’re not always this lucky. In many cases, some of the columns names or variable values will differ, which makes it hard to follow the easy, standard procedure that was described just now. In addition, you may not always want to merge in the standard way that was described above.
In the following, some of the most common issues are listed and solved!
What If… (some of) the Data Frame’s Column Values are Different?
If (some of) the values of the variable on which you merge differ, you have a small problem, because the
merge() function supposes that these are the same so that any new variables that are present in the second data frame can be added correctly to the first.
Consider the following example:
> data2 x.Age.At.Death Location 1 21 5 2 39 6 3 71 7 4 40 8
You see that the values for the attribute
Age.At.Death do not fit with the ones that were defined for
No worries, the
merge() function provides extra arguments to solve this problem.
all.x allows you to specify to add the extra rows of the
Location variable to the resulting data frame, even though this column is not present in
In this case, the values of the
Location variable will be added to
writers_df for those rows of which the values of the
Age.At.Death attribute correspond. All rows where the
Age.At.Death of the two data frames don’t correspond, will be filled up with
Note that this join corresponds to a left outer join in SQL and that the default value of the
all.x argument is
FALSE, which means that one normally only takes into account the corresponding values of the merging variable.
Note also that you can also specify the argument
all.y=TRUE if you want to add extra rows for each row that
data2 has no matching row in
For those who are familiar with SQL, this type of join correponds to a right outer join.
What If… Both Data Frames Have the same Column Names?
What if your two data frames have exactly the same two variables, with or without the same values?
You can chose to keep all values from all corresponding variables and to add rows to the resulting data frame:
Or you can just chose to add values from one specific variable for which the ages of death correspond.
What If… the Data Frames’ Column Names are Different?
Lastly, what if the variable’s names on which you merge differ?
You just specify in the
merge() function that there are two other specifications through the arguments
Merging Data Frames on Row Names
You can indeed merge the columns of two data frames, that contain a distinct set of columns but some rows with the same names. The
merge() function and its arguments come to the rescue!
Consider this second example:
Address <- c("50 West 10th", "77 St. Marks Place", "778 Park Avenue") Maried <- c("YES", "NO", "YES") limited_writers_df <- data.frame(Address, Maried)
You see that this data set contains three rows, marked with numbers 1 to 3, and two additional columns that are not in
writers_df. To merge these two, you add the argument
by to the
merge() function and set it at the number 0, which specifies the row names.
Since you choose to keep all values from all corresponding variables and to add columns to the result, you set the
all argument to
It could be that the fields for rows that don’t occur in both data structures result in NA-values. You can easily solve this by removing them.
How to do this will be discussed below.
How to Remove Data Frame Rows and Columns with NA-Values
To remove all rows that contain NA-values, one of the easiest options is to use the
na.omit() function, which takes your data frame as an argument.
Let’s recycle the code from the previous section, where you had a lot of resulting NA-values:
Note that the example above also demonstrates that if you just want to select part of the data frame from which you want to remove the NA-values, it’s better to use
In this case, you’re interested to keep all rows for which the values of the columns
Name are complete.
How to Convert Between Data Structures
Convert Lists or Matrices to Data Frames
Lists or matrices that comply with the restrictions that the data frame structure imposes can be coerced into data frames with the
Remember that a data frame is similar to the structure of a matrix, where the columns can be of different types. There are also similarities with lists, where each column is an element of the list and each element has the same length. Any matrices or lists that you want to convert need to satisfy with these restrictions.
For example, the matrix
A can be converted because each column contains values of the numeric data type. You enter the matrix
A as an argument to the
You can follow the same procedures for lists like the one that is shown below:
Convert Data Frames to Matrices or Lists
To make the opposite move, that is, to convert data frames to matrices and lists, you first have to check for yourself if this is possible. Does your
writers_df contain one or more dimensions and what about the amount of data types?
Rewatch the small animation of the introduction if you’re not sure what data structure to pick.
Once you have an answer, you can use the functions
as.list() to convert
writers_df to a matrix or a list, respectively:
For those of you who want to specifically make numeric matrices, you can use the function
data.matrix() or add an
sapply() function to the
Note that with the current
writers_df, which contains a mixture of data types, NA-values will be introduced in the resulting matrices.
From Data Structures to Data Analysis, Data Manipulation and Data Visualization
Working with this R data structure is just the beginning of your data analysis!
If this tutorial has gotten you thrilled to dig deeper into programming with R, make sure to check out our free interactive Introduction to R course.
Those of you who are already more advanced with R and that want to take their skills to a higher level might be interested in our courses on data manipulation and data visualization.
Go to our course overview and take a look!
Learn more about R
Free Access Week | Aug 28 – Sept 3
How to Choose The Right Data Science Bootcamp in 2023 (With Examples)
DataCamp Portfolio Challenge: Win $500 Publishing Your Best Work
A Data Scientist’s Guide to Signal Processing
Chroma DB Tutorial: A Step-By-Step Guide
Introduction to Non-Linear Model and Insights Using R