Course
Long to Wide Data in R
There is a time when people need to convert data in the long format (you call it "the long-form" and what it means will be clear later) to the wide format and the vice versa. The data in real life comes in all sorts of format, but most R functions are designed for one specific form.
For instance, lme4::lmer()
for mixed linear modeling, ggplot2::ggplot()
for visualization are deisgned for long-form. But there are other functions designed for wide-form such as mosaicplot()
. That means that you need to convert the form to do analysis or visualization unless the data is already processed and in the right form.
There are several options in R for format coversion but it can become quite confusing when you're just starting out.
This tutorial will show you how to understand the mechanics of converting the data between the long-form and the wide-form. You'll explore the following topics:
- You'll learn more about the long form.
- Then, you'll get to know the wide form.
- You'll recap the differences between the long and wide formats;
- After which you'll see which steps you'd need to undertake to go from one form to another.
- Lastly, you'll implement all of this with the help of R. You'll learn how to go from a long format to a wide format with the
melt()
anddcast()
functions.
The Data
name | gender | year2011 | year2012 | year2013 | year2014 | year2015 |
---|---|---|---|---|---|---|
Jackson Smith | M | 74.69 | 84.99 | 91.73 | 105.11 | 111.04 |
Sophia Johnson | F | NA | NA | NA | NA | 75.89 |
Emma Williams | F | NA | NA | 75.74 | 86.50 | 91.50 |
Aiden Jones | M | NA | NA | NA | 71.89 | 81.42 |
Liam Brown | M | 88.24 | 96.91 | 101.85 | 108.13 | 112.45 |
Lucas Davis | M | 70.60 | 83.78 | 94.17 | 100.03 | 106.35 |
Olivia Miller | F | 64.78 | 80.76 | 87.30 | 97.13 | 103.80 |
Ava Wilson | F | 88.77 | 96.45 | 104.72 | 112.84 | NA |
This table shows the height of babies in a small local area. The heights were measured from the year 2011 to the year 2015. As you might already know, NA
means ‘Not Available.’ From the position of NA’s, you can figure out the babies’ ages, but the exact age of a baby is not available from the table. The table looks well-organized, but for anyone who wants to study the growth of babies, there is a problem: the important information is not the absolute time at which a baby is measured but the age of a baby when he or she was measured. Could you insert this information into the table? Below, you can see one way:
name | gender | age0 | age1 | age2 | age3 | age4 | age5 | age6 |
---|---|---|---|---|---|---|---|---|
Jackson Smith | M | NA | 76.69 | 84.91 | 97.06 | 105.73 | 107.32 | NA |
Sophia Johnson | F | NA | 76.05 | NA | NA | NA | NA | NA |
Emma Williams | F | 72.65 | 82.46 | 91.76 | NA | NA | NA | NA |
Aiden Jones | M | 69.76 | 79.89 | NA | NA | NA | NA | NA |
Liam Brown | M | NA | NA | 85.34 | 93.22 | 105.78 | 105.84 | 114.82 |
Lucas Davis | M | 71.59 | 84.87 | 92.10 | 100.73 | 104.21 | NA | NA |
Olivia Miller | F | 62.74 | 81.39 | 92.02 | 98.32 | 106.44 | NA | NA |
Ava Wilson | F | NA | NA | 88.39 | 100.67 | 107.58 | 111.52 | NA |
The number of NA
's has increased. ## Wide Form Data But what would you do if you wanted to convey how many months a baby was old? It would be something like this:
name | gender | m4 | m6 | m7 | m10 | m12 | ... | m55 | m60 | m65 | m72 |
---|---|---|---|---|---|---|---|---|---|---|---|
Jackson Smith | M | NA | NA | NA | NA | 76.69 | ... | NA | 107.32 | NA | NA |
Sophia Johnson | F | NA | NA | NA | NA | 76.05 | ... | NA | NA | NA | NA |
Emma Williams | F | NA | NA | NA | 72.65 | NA | ... | NA | NA | NA | NA |
Aiden Jones | M | NA | 69.76 | NA | NA | NA | ... | NA | NA | NA | NA |
Liam Brown | M | NA | NA | NA | NA | NA | ... | NA | 105.84 | NA | 114.82 |
Lucas Davis | M | NA | NA | 71.59 | NA | NA | ... | 104.21 | NA | NA | NA |
Olivia Miller | F | 62.74 | NA | NA | NA | NA | ... | NA | NA | NA | NA |
Ava Wilson | F | NA | NA | NA | NA | NA | ... | NA | NA | 111.52 | NA |
There are tons of NA
’s in the table above. If the babies differ in their relative measurement times, the wide form can waste a lot of space. Another option is to include the ages of the babies when it was the year 2015, but that is only possible when the absolute measurement times are all the same. If the babies differ in their measurement time by months and you want to know the ages in months, it does not work. Then this example seems to be the only way. In this case, you can consider using long-form. ## Data in Long Form Whereas all measurements from one baby were shown in one row in wide-form, you'll only see one measurement from a baby in one row in long-form. The simplest way to obtain the latter would be displaying a child's name, the measurement time, and the measurement (height) in one row. This is exactly what this table shows:
name | gender | month | height |
---|---|---|---|
Olivia Miller | F | 4 | 62.74 |
Aiden Jones | M | 6 | 69.76 |
Lucas Davis | M | 7 | 71.59 |
Emma Williams | F | 10 | 72.65 |
Jackson Smith | M | 12 | 76.69 |
... | ... | ... | ... |
Liam Brown | M | 72 | 114.82 |
If you arrange the data sorting the names, you'd get something like the table you see below. Obviously, this table occupies smaller space than Table 3, which you can see above!
name | gender | month | height |
---|---|---|---|
Jackson Smith | M | 12 | 76.69 |
Jackson Smith | M | 24 | 84.91 |
Jackson Smith | M | 36 | 97.06 |
Jackson Smith | M | 48 | 105.73 |
Jackson Smith | M | 60 | 107.32 |
Sophia Johnson | F | 12 | 76.05 |
Emma Williams | F | 10 | 72.65 |
... | ... | ... | ... |
Ava Wilson | F | 65 | 111.52 |
Long versus Wide
Note again what the fundamental differences are between the wide-form and the long-form: one simple difference is that the wide-form displays many measurements from one individual in one row and the column names show what the measurements are.
This table shows Liam Brown’s heights in the year 2011-2015 in one row:
name | gender | year2011 | year2012 | year2013 | year2014 | year2015 |
---|---|---|---|---|---|---|
Liam Brown | M | 88.24 | 96.91 | 101.85 | 108.13 | 112.45 |
In long-form Liam Brown's heights are shown in one column:
name | gender | _ |
---|---|---|
Liam Brown | M | 85.34 |
Liam Brown | M | 93.22 |
Liam Brown | M | 105.78 |
Liam Brown | M | 105.84 |
Liam Brown | M | 114.82 |
As you can see, you can not know what the measurements are. If you write down the column name, it would look like this:
name | gender | year2011, 2012, 2013, 2014, 2015 |
---|---|---|
Liam Brown | M | 85.34 |
Liam Brown | M | 93.22 |
Liam Brown | M | 105.78 |
Liam Brown | M | 105.84 |
Liam Brown | M | 114.82 |
But this one has its own problem. Sophia Johnson’s height was not measured in the year 2011-2014. So as you can see below, the column name cannot be set easily:
name | gender | ? |
---|---|---|
Liam Brown | M | 85.34 |
Liam Brown | M | 93.22 |
Liam Brown | M | 105.78 |
Liam Brown | M | 105.84 |
Liam Brown | M | 114.82 |
Sophia Johnson | F | 76.05 |
From Wide-Form to Long-Form
Converting the wide-form into the long-form can be thought of as a step-by-step process. Before converting the measurements in one row into one column, you can make the table in such a way that it contains only one measurement in each row. Let's do that for this table:
name | gender | age0 | age1 | age2 | age3 | age4 | age5 | age6 |
---|---|---|---|---|---|---|---|---|
Emma Williams | F | 72.65 | 82.46 | 91.76 | NA | NA | NA | NA |
Aiden Jones | M | 69.76 | 79.89 | NA | NA | NA | NA | NA |
The result is like the table below:
name | gender | age0 | age1 | age2 | age3 | age4 | age5 | age6 |
---|---|---|---|---|---|---|---|---|
Emma Williams | F | 72.65 | NA | NA | NA | NA | ||
Emma Williams | F | 82.46 | NA | |||||
Emma Williams | F | 91.76 | ||||||
Aiden Jones | M | 69.76 | NA | NA | NA | NA | NA | |
Aiden Jones | M | 79.89 |
Before gathering those measurements scattered through columns into one column, you need to make an additional column that says what the measurements are, just like below:
name | gender | measure | age0 | age1 | age2 | age3 | age4 | age5 | age6 |
---|---|---|---|---|---|---|---|---|---|
Emma Williams | F | age0 | 72.65 | NA | NA | NA | NA | ||
Emma Williams | F | age1 | 82.46 | NA | |||||
Emma Williams | F | age2 | 91.76 | ||||||
Aiden Jones | M | age0 | 69.76 | NA | NA | NA | NA | NA | |
Aiden Jones | M | age1 | 79.89 |
The column name is usually either “measure”
or “key”
. You can notice in the above table that there are “age0”
, “age1”
, “age2”
, “age3”
, “age4”
, “age5”
in both column names and one column. If you omit the column names, the table will be like below.
name | gender | measure | _ | __ | ___ | ____ | _____ | ______ | _______ |
---|---|---|---|---|---|---|---|---|---|
Emma Williams | F | age0 | 72.65 | NA | NA | NA | NA | ||
Emma Williams | F | age1 | 82.46 | ||||||
Emma Williams | F | age2 | 91.76 | ||||||
Aiden Jones | M | age0 | 69.76 | NA | NA | NA | NA | NA | |
Aiden Jones | M | age1 | 79.89 |
Since there is only one measurement in each row, you can gather them in one column and the result will be the following:
name | gender | measure | |
---|---|---|---|
Emma Williams | F | age0 | 72.65 |
Emma Williams | F | age1 | 82.46 |
Emma Williams | F | age2 | 91.76 |
Aiden Jones | M | age0 | 69.76 |
Aiden Jones | M | age1 | 79.89 |
Let's make new column name for all the measurements. Usually, it is named as "value":
name | gender | measure | value |
---|---|---|---|
Emma Williams | F | age0 | 72.65 |
Emma Williams | F | age1 | 82.46 |
Emma Williams | F | age2 | 91.76 |
Aiden Jones | M | age0 | 69.76 |
Aiden Jones | M | age1 | 79.89 |
Let's look at the above table. Do you notice that "age" is repeating? You can omit repeating entity, which will give you the following result:
name | gender | age | value |
---|---|---|---|
Emma Williams | F | 0 | 72.65 |
Emma Williams | F | 1 | 82.46 |
Emma Williams | F | 2 | 91.76 |
Aiden Jones | M | 0 | 69.76 |
Aiden Jones | M | 1 | 79.89 |
Age
, which is repeating in every row is now a column name. Column name value
in the table above can be more specific. You replaced it with height
, which is more revealing. You can see the result in the table below:
name | gender | age | height |
---|---|---|---|
Emma Williams | F | 0 | 72.65 |
Emma Williams | F | 1 | 82.46 |
Emma Williams | F | 2 | 91.76 |
Aiden Jones | M | 0 | 69.76 |
Aiden Jones | M | 1 | 79.89 |
Let's compare Table 15 with Table 17. value
, one of the column names of Table 15, can mean anything, but height
of Table 17 has very specific meaning: it can only designate the height whereas the value in Table 15 could be any value like height, weight, speed or color. Another way would be like the following:
name | gender | measure | value |
---|---|---|---|
Emma Williams | F | height.age0 | 72.65 |
Emma Williams | F | height.age1 | 82.46 |
Emma Williams | F | height.age2 | 91.76 |
Aiden Jones | M | height.age0 | 69.76 |
Aiden Jones | M | height.age1 | 79.89 |
Let's consider the situation in which the measurements are diverse. If the measurements are height, weight and BMI, it will be like table below. But in this case, the column value
has numerical values in the different scale so it could be confusing reading them. In this table, height was measured in centimeter, and weight in kilogram:
name | gender | measure | value |
---|---|---|---|
Emma Williams | F | height | 72.65 |
Emma Williams | F | weight | 8.22 |
Emma Williams | F | BMI | 22.60 |
Aiden Jones | M | height | 69.76 |
Aiden Jones | M | weight | 8.51 |
So when converting the wide-form into the long-form, one needs to decide which values will be gathered into one column. After that, you can decide the column name.
Let's summarize what is been said so far: to convert the wide-from into the long-form one need to decide which columns to modify that they are in one column. In doing so, many columns are being combined into one column. So you need a column name for that column, which you can see in the next tables:
id1 | id2 | age0 | age1 | age2 |
---|---|---|---|---|
Emma Williams | F | NA | 82.46 | 91.76 |
Aiden Jones | M | 69.76 | 79.89 | NA |
Emma Williams | F | NA | 82.46 | 91.76 |
id1 | id2 | age0 | age1 | age2 |
---|---|---|---|---|
Emma Williams | F | NA | ||
Emma Williams | F | 82.46 | ||
Emma Williams | F | 91.76 | ||
Aiden Jones | M | 69.76 | ||
Aiden Jones | M | 79.89 | ||
Aiden Jones | M | NA | ||
Emma Williams | F | NA | ||
Emma Williams | F | 82.46 | ||
Emma Williams | F | 91.76 |
id1 | id2 | age_ | age0 | age1 | age2 |
---|---|---|---|---|---|
Emma Williams | F | age0 | NA | ||
Emma Williams | F | age1 | 82.46 | ||
Emma Williams | F | age2 | 91.76 | ||
Aiden Jones | M | age0 | 69.76 | ||
Aiden Jones | M | age1 | 79.89 | ||
Aiden Jones | M | age2 | NA | ||
Emma Williams | F | age0 | NA | ||
Emma Williams | F | age1 | 82.46 | ||
Emma Williams | F | age2 | 91.76 |
id1 | id2 | age_ | |
---|---|---|---|
Emma Williams | F | age0 | NA |
Emma Williams | F | age1 | 82.46 |
Emma Williams | F | age2 | 91.76 |
Aiden Jones | M | age0 | 69.76 |
Aiden Jones | M | age1 | 79.89 |
Aiden Jones | M | age2 | NA |
Emma Williams | F | age0 | NA |
Emma Williams | F | age1 | 82.46 |
Emma Williams | F | age2 | 91.76 |
id1 | id2 | key | value |
---|---|---|---|
Emma Williams | F | age0 | NA |
Emma Williams | F | age1 | 82.46 |
Emma Williams | F | age2 | 91.76 |
Aiden Jones | M | age0 | 69.76 |
Aiden Jones | M | age1 | 79.89 |
Aiden Jones | M | age2 | NA |
Emma Williams | F | age0 | NA |
Emma Williams | F | age1 | 82.46 |
Emma Williams | F | age2 | 91.76 |
id1 | id2 | key | value |
---|---|---|---|
Emma Williams | F | age1 | 82.46 |
Emma Williams | F | age2 | 91.76 |
Aiden Jones | M | age0 | 69.76 |
Aiden Jones | M | age1 | 79.89 |
Emma Williams | F | age1 | 82.46 |
Emma Williams | F | age2 | 91.76 |
Let's look at the columns id1
and id2
in the above table. id2
does not specify an individual but it is the attribute(gender) of an individual. In other words, you can say it is a measurement. So you can convert it into the longer-form. The longest-form is made up of three columns, which are id
, key
and value
.
id | key | value |
---|---|---|
Emma Williams | gender | F |
Emma Williams | age1 | 82.46 |
Emma Williams | age2 | 91.76 |
Aiden Jones | gender | F |
Aiden Jones | age0 | 69.76 |
Aiden Jones | age1 | 79.89 |
Emma Williams | gender | F |
Emma Williams | age1 | 82.46 |
Emma Williams | age2 | 91.76 |
The above table shows the longest-form from table 25. But different people can have the same name.
In that case, long-form can be confusing since id
can not specify one person. You can put a number behind the name to ditinguish people(for example, using R function make.unique()
), but you can not differentiate the name that is altered and the original name with a number. Think of a person whose name is "John.2" for real!.
The obvious way around this is to assign a different number ID for a different person. In that case, names are another measurements.
The next table shows that even the names are considered as measurements. In a sense, this is the true longest-form.
id | key | value |
---|---|---|
1 | name | Emma Williams |
1 | gender | F |
1 | age1 | 82.46 |
1 | age2 | 91.76 |
2 | name | Aiden Jones |
2 | gender | F |
2 | age0 | 69.76 |
2 | age1 | 79.89 |
3 | name | Emma Williams |
3 | gender | F |
3 | age1 | 82.46 |
3 | age2 | 91.76 |
The longest-form is the easiest form for making a wide-form. If you reverse the process of converting the wide-form into the long-form, which is shown in tables 20 to 25, you get to the wide-form. The next tables below show this process:
id | key | name | gender | age0 | age1 | age2 |
---|---|---|---|---|---|---|
1 | name | Emma Williams | ||||
1 | gender | F | ||||
1 | age1 | 82.46 | ||||
1 | age2 | 91.76 | ||||
2 | name | Aiden Jones | ||||
2 | gender | F | ||||
2 | age0 | 69.76 | ||||
2 | age1 | 79.89 | ||||
3 | name | Emma Williams | ||||
3 | gender | F | ||||
3 | age1 | 82.46 | ||||
3 | age2 | 91.76 |
id | name | gender | age0 | age1 | age2 |
---|---|---|---|---|---|
1 | Emma Williams | ||||
1 | F | ||||
1 | 82.46 | ||||
1 | 91.76 | ||||
2 | Aiden Jones | ||||
2 | F | ||||
2 | 69.76 | ||||
2 | 79.89 | ||||
3 | Emma Williams | ||||
3 | F | ||||
3 | 82.46 | ||||
3 | 91.76 |
id | name | gender | age0 | age1 | age2 |
---|---|---|---|---|---|
1 | Emma Williams | F | 82.46 | 91.76 | |
1 | |||||
1 | |||||
1 | |||||
2 | Aiden Jones | F | 69.76 | 79.89 | |
2 | |||||
2 | |||||
2 | |||||
3 | Emma Williams | F | 82.46 | 91.76 | |
3 | |||||
3 | |||||
3 |
id | name | gender | age0 | age1 | age2 |
---|---|---|---|---|---|
1 | Emma Williams | F | 82.46 | 91.76 | |
2 | Aiden Jones | F | 69.76 | 79.89 | |
3 | Emma Williams | F | 82.46 | 91.76 |
The key factor in converting the wide-form into the long-form is deciding which columns will be gathered into one column.
But it does not have to be just one column. You could use two columns if that is necessary, just like in the table below:
name | gender | h2011 | h2012 | w2011 | w2012 |
---|---|---|---|---|---|
Jackson Smith | M | 74.69 | 84.99 | 9.60 | 12.0 |
Olivia Miller | F | NA | 80.76 | 7.15 | 10.7 |
Ava Wilson | F | 88.77 | 96.45 | NA | 15.0 |
Can you make it into the long-form?
Columns h2011
, h2012
, w2011
, w2012
mean heights and weights in the year 2011 and 2012. height
and weight
are measured in the different scales - cm, kg. So you can use only one column for both height and weight or you can use two columns, just like in this table:
name | gender | year | height | weight |
---|---|---|---|---|
Jackson Smith | M | 2011 | 74.69 | 9.60 |
Jackson Smith | M | 2012 | 84.99 | 12.00 |
Olivia Miller | F | 2011 | NA | 80.76 |
Olivia Miller | F | 2012 | 7.15 | 10.70 |
Ava Wilson | F | 2011 | 88.77 | 96.45 |
Ava Wilson | F | 2012 | NA | 15.00 |
Is this the long-form or the wide-form?
From Long to Wide Data in R
Let's do the coversion between the long-form and the wide-form with R. There are many functions and packages available for this task.
func | package | to_long_form | to_wide_form |
---|---|---|---|
stack/unstack | utils | stack | unstack |
reshape | stats | reshape(direction="long", ...) | reshape(direction="wide", ...) |
melt/dcast | reshape2 | melt | dcast |
gather/spread | tidyr | gather | spread |
Here I will explain how to use the melt()
and dcast()
functions of the reshape2
package. The reason is that stack()
and unstack()
are basic functions so one needs post-processing the result from them. Note that the gather()
and spread()
are tidyr
functions that are also very popular, but that won't be covered in this tutorial.
Another function that you might be able to use is reshape()
, which comes with the stats
package. In fact, don't be confused by this function, as there's also a package called reshape
! This package, together with reshape2
, was developed for those who were struggling with reshape()
. In addition, the function reshape()
seems to have an assumption that the data are longitudinal, which means measurements are repeated through time.
Let's convert the data in table 35 into the long-form. Assume that the table below stores a data.frame
called dat
.
name | gender | year2011 | year2012 | year2013 |
---|---|---|---|---|
Jackson Smith | M | 74.69 | 84.99 | 91.73 |
Emma Williams | F | NA | NA | 75.74 |
Liam Brown | M | 88.24 | NA | 101.85 |
Ava Wilson | F | 88.77 | 96.45 | NA |
babies = data.frame(name= c("Jackson Smith", "Emma Williams", "Liam Brown", "Ava Wilson"), gender = c("M", "F", "M", "F"), year2011= c(74.69, NA, 88.24, 88.77), year2012=c(84.99, NA, NA, 96.45), year2013=c(91.73, 75.74, 101.83, NA)) babies
## name gender year2011 year2012 year2013
## 1 Jackson Smith M 74.69 84.99 91.73
## 2 Emma Williams F NA NA 75.74
## 3 Liam Brown M 88.24 NA 101.83
## 4 Ava Wilson F 88.77 96.45 NA
The question that you want to ask is: "which columns to combine into one column?".
Those would be year2011
, year2012
, year2013
. When you use function melt()
, those columns should be listed for measure.vars
:
melt(babies, measure.vars = c("year2011", "year2012", "year2013"))
## name gender variable value
## 1 Jackson Smith M year2011 74.69
## 2 Emma Williams F year2011 NA
## 3 Liam Brown M year2011 88.24
## 4 Ava Wilson F year2011 88.77
## 5 Jackson Smith M year2012 84.99
## 6 Emma Williams F year2012 NA
## 7 Liam Brown M year2012 NA
## 8 Ava Wilson F year2012 96.45
## 9 Jackson Smith M year2013 91.73
## 10 Emma Williams F year2013 75.74
## 11 Liam Brown M year2013 101.83
## 12 Ava Wilson F year2013 NA
year2011
, year2012
, year2013
are the columns placed in 3rd, 4th, 5th so you can use that:
melt(babies, measure.vars = 3:5)
## name gender variable value
## 1 Jackson Smith M year2011 74.69
## 2 Emma Williams F year2011 NA
## 3 Liam Brown M year2011 88.24
## 4 Ava Wilson F year2011 88.77
## 5 Jackson Smith M year2012 84.99
## 6 Emma Williams F year2012 NA
## 7 Liam Brown M year2012 NA
## 8 Ava Wilson F year2012 96.45
## 9 Jackson Smith M year2013 91.73
## 10 Emma Williams F year2013 75.74
## 11 Liam Brown M year2013 101.83
## 12 Ava Wilson F year2013 NA
The result is in the table above. Note the column names variable
and value
. Table 23 and table 24 showed that the new column names are needed. Function melt()
has default names for new columns- variable
and value
. You can override the default value by using variable.name=
and value.name=
arguement. Let's do this below:
melt(babies, measure.vars=3:5, variable.name="year", value.name="height")
## name gender year height
## 1 Jackson Smith M year2011 74.69
## 2 Emma Williams F year2011 NA
## 3 Liam Brown M year2011 88.24
## 4 Ava Wilson F year2011 88.77
## 5 Jackson Smith M year2012 84.99
## 6 Emma Williams F year2012 NA
## 7 Liam Brown M year2012 NA
## 8 Ava Wilson F year2012 96.45
## 9 Jackson Smith M year2013 91.73
## 10 Emma Williams F year2013 75.74
## 11 Liam Brown M year2013 101.83
## 12 Ava Wilson F year2013 NA
Measurements were specified in the examples above. For those settings you are dealing with, Id
can indirectly specify the measurements. So you can choose columns for id rather than measurements.
The results are the same for the both below. The first one chooses columns for id, the second one chooses columns to be gathered:
melt(babies, id.vars=c("name","gender"))
melt(babies, measure.vars = c("year2011", "year2012", "year2013"))
## name gender variable value
## 1 Jackson Smith M year2011 74.69
## 2 Emma Williams F year2011 NA
## 3 Liam Brown M year2011 88.24
## 4 Ava Wilson F year2011 88.77
## 5 Jackson Smith M year2012 84.99
## 6 Emma Williams F year2012 NA
## 7 Liam Brown M year2012 NA
## 8 Ava Wilson F year2012 96.45
## 9 Jackson Smith M year2013 91.73
## 10 Emma Williams F year2013 75.74
## 11 Liam Brown M year2013 101.83
## 12 Ava Wilson F year2013 NA
If you specify both id.vars
and measure.vars
, columns not included in any of them are excluded from the result. For example, look at the code chunk below:
melt(babies, id.vars=c("name"), measure.vars=c("year2011", "year2012"))
## name variable value
## 1 Jackson Smith year2011 74.69
## 2 Emma Williams year2011 NA
## 3 Liam Brown year2011 88.24
## 4 Ava Wilson year2011 88.77
## 5 Jackson Smith year2012 84.99
## 6 Emma Williams year2012 NA
## 7 Liam Brown year2012 NA
## 8 Ava Wilson year2012 96.45
In addition, na.rm=T
can eliminate NAs:
melt(babies, id.vars=c("name"), measure.vars=c("year2011", "year2012"), na.rm=T)
## name variable value
## 1 Jackson Smith year2011 74.69
## 3 Liam Brown year2011 88.24
## 4 Ava Wilson year2011 88.77
## 5 Jackson Smith year2012 84.99
## 8 Ava Wilson year2012 96.45
The dcast()
function does the reverse of melt()
. id
, variable
, and value
are the columns of the longest-form. dcast(data, id ~ variable, value.var="value")
takes the longest-form into the wide-form.
babiesLong <- melt(babies, id.vars=c("name"), measure.vars=c("year2011", "year2012"), na.rm=T)
babiesLong
## name variable value
## 1 Jackson Smith year2011 74.69
## 3 Liam Brown year2011 88.24
## 4 Ava Wilson year2011 88.77
## 5 Jackson Smith year2012 84.99
## 8 Ava Wilson year2012 96.45
babiesLong
has name for id
, variable for variable
, and value for value
. So you can use dcast()
to make it into long-form:
dcast(babiesLong, name ~ variable, value.var="value")
## name year2011 year2012
## 1 Ava Wilson 88.77 96.45
## 2 Jackson Smith 74.69 84.99
## 3 Liam Brown 88.24 NA
Conclusion
That seems about it. In this tutorial, you focused on understanding the logic and the process of reshaping data. Usually, long-form is preferred for analysis but there are execptions. Data presenation in wide-form can save space when it is appropriate but as the number of missing values grow, you would better convert to the long-form.
For format conversion, you need to focus on what are the measurements. The motto for the long-form is one measurement per one row and the wide-form has many measurements in one row. When you make measurements throughout columns gathered in one column, the original column names need to be preserved in one additional column. When you make measurements in one column spreaded out throughout columns, you need to figure out what the column names should be from variable
column.
Once you understand the process, using functions such as dcast()
and melt()
becomes a lot easier!
Relevant Courses
Course
Intermediate R
Course
Introduction to Importing Data in R
tutorial
Data Reshaping in R Tutorial
Olivia Smith
7 min
tutorial
Merging Datasets in R
tutorial
Data Types in R
tutorial
Introduction to Data frames in R
Ryan Sheehy
5 min
tutorial