Tutorials
r programming
+1

Long to Wide Data in R

Learn why you would transform your data from a long to a wide format and vice versa and explore how to do this in R with melt() and dcast()!

There is a time when people need to convert data in the long format (you call it "the long-form" and what it means will be clear later) to the wide format and the vice versa. The data in real life comes in all sorts of format, but most R functions are designed for one specific form.

For instance, lme4::lmer() for mixed linear modeling, ggplot2::ggplot() for visualization are deisgned for long-form. But there are other functions designed for wide-form such as mosaicplot(). That means that you need to convert the form to do analysis or visualization unless the data is already processed and in the right form.

There are several options in R for format coversion but it can become quite confusing when you're just starting out.

This tutorial will show you how to understand the mechanics of converting the data between the long-form and the wide-form. You'll explore the following topics:

  • You'll learn more about the long form.
  • Then, you'll get to know the wide form.
  • You'll recap the differences between the long and wide formats;
  • After which you'll see which steps you'd need to undertake to go from one form to another.
  • Lastly, you'll implement all of this with the help of R. You'll learn how to go from a long format to a wide format with the melt() and dcast() functions.

The Data

Table 01
name gender year2011 year2012 year2013 year2014 year2015
Jackson Smith M 74.69 84.99 91.73 105.11 111.04
Sophia Johnson F NA NA NA NA 75.89
Emma Williams F NA NA 75.74 86.50 91.50
Aiden Jones M NA NA NA 71.89 81.42
Liam Brown M 88.24 96.91 101.85 108.13 112.45
Lucas Davis M 70.60 83.78 94.17 100.03 106.35
Olivia Miller F 64.78 80.76 87.30 97.13 103.80
Ava Wilson F 88.77 96.45 104.72 112.84 NA

This table shows the height of babies in a small local area. The heights were measured from the year 2011 to the year 2015. As you might already know, NA means ‘Not Available.’ From the position of NA’s, you can figure out the babies’ ages, but the exact age of a baby is not available from the table. The table looks well-organized, but for anyone who wants to study the growth of babies, there is a problem: the important information is not the absolute time at which a baby is measured but the age of a baby when he or she was measured. Could you insert this information into the table? Below, you can see one way:

Table 02
name gender age0 age1 age2 age3 age4 age5 age6
Jackson Smith M NA 76.69 84.91 97.06 105.73 107.32 NA
Sophia Johnson F NA 76.05 NA NA NA NA NA
Emma Williams F 72.65 82.46 91.76 NA NA NA NA
Aiden Jones M 69.76 79.89 NA NA NA NA NA
Liam Brown M NA NA 85.34 93.22 105.78 105.84 114.82
Lucas Davis M 71.59 84.87 92.10 100.73 104.21 NA NA
Olivia Miller F 62.74 81.39 92.02 98.32 106.44 NA NA
Ava Wilson F NA NA 88.39 100.67 107.58 111.52 NA

The number of NA's has increased. ## Wide Form Data But what would you do if you wanted to convey how many months a baby was old? It would be something like this:

Table 03
name gender m4 m6 m7 m10 m12 ... m55 m60 m65 m72
Jackson Smith M NA NA NA NA 76.69 ... NA 107.32 NA NA
Sophia Johnson F NA NA NA NA 76.05 ... NA NA NA NA
Emma Williams F NA NA NA 72.65 NA ... NA NA NA NA
Aiden Jones M NA 69.76 NA NA NA ... NA NA NA NA
Liam Brown M NA NA NA NA NA ... NA 105.84 NA 114.82
Lucas Davis M NA NA 71.59 NA NA ... 104.21 NA NA NA
Olivia Miller F 62.74 NA NA NA NA ... NA NA NA NA
Ava Wilson F NA NA NA NA NA ... NA NA 111.52 NA

There are tons of NA’s in the table above. If the babies differ in their relative measurement times, the wide form can waste a lot of space. Another option is to include the ages of the babies when it was the year 2015, but that is only possible when the absolute measurement times are all the same. If the babies differ in their measurement time by months and you want to know the ages in months, it does not work. Then this example seems to be the only way. In this case, you can consider using long-form. ## Data in Long Form Whereas all measurements from one baby were shown in one row in wide-form, you'll only see one measurement from a baby in one row in long-form. The simplest way to obtain the latter would be displaying a child's name, the measurement time, and the measurement (height) in one row. This is exactly what this table shows:

Table 04
name gender month height
Olivia Miller F 4 62.74
Aiden Jones M 6 69.76
Lucas Davis M 7 71.59
Emma Williams F 10 72.65
Jackson Smith M 12 76.69
... ... ... ...
Liam Brown M 72 114.82

If you arrange the data sorting the names, you'd get something like the table you see below. Obviously, this table occupies smaller space than Table 3, which you can see above!

Table 05
name gender month height
Jackson Smith M 12 76.69
Jackson Smith M 24 84.91
Jackson Smith M 36 97.06
Jackson Smith M 48 105.73
Jackson Smith M 60 107.32
Sophia Johnson F 12 76.05
Emma Williams F 10 72.65
... ... ... ...
Ava Wilson F 65 111.52

Long versus Wide

Note again what the fundamental differences are between the wide-form and the long-form: one simple difference is that the wide-form displays many measurements from one individual in one row and the column names show what the measurements are.

This table shows Liam Brown’s heights in the year 2011-2015 in one row:

Table 06
name gender year2011 year2012 year2013 year2014 year2015
Liam Brown M 88.24 96.91 101.85 108.13 112.45
In long-form Liam Brown's heights are shown in one column:
Table 07
name gender _
Liam Brown M 85.34
Liam Brown M 93.22
Liam Brown M 105.78
Liam Brown M 105.84
Liam Brown M 114.82
As you can see, you can not know what the measurements are. If you write down the column name, it would look like this:
Table 08
name gender year2011, 2012, 2013, 2014, 2015
Liam Brown M 85.34
Liam Brown M 93.22
Liam Brown M 105.78
Liam Brown M 105.84
Liam Brown M 114.82
But this one has its own problem. Sophia Johnson’s height was not measured in the year 2011-2014. So as you can see below, the column name cannot be set easily:
Table 09
name gender ?
Liam Brown M 85.34
Liam Brown M 93.22
Liam Brown M 105.78
Liam Brown M 105.84
Liam Brown M 114.82
Sophia Johnson F 76.05

From Wide-Form to Long-Form

Converting the wide-form into the long-form can be thought of as a step-by-step process. Before converting the measurements in one row into one column, you can make the table in such a way that it contains only one measurement in each row. Let's do that for this table:

Table 10
name gender age0 age1 age2 age3 age4 age5 age6
Emma Williams F 72.65 82.46 91.76 NA NA NA NA
Aiden Jones M 69.76 79.89 NA NA NA NA NA

The result is like the table below:

Table 11
name gender age0 age1 age2 age3 age4 age5 age6
Emma Williams F 72.65 NA NA NA NA
Emma Williams F 82.46 NA
Emma Williams F 91.76
Aiden Jones M 69.76 NA NA NA NA NA
Aiden Jones M 79.89

Before gathering those measurements scattered through columns into one column, you need to make an additional column that says what the measurements are, just like below:

Table 12
name gender measure age0 age1 age2 age3 age4 age5 age6
Emma Williams F age0 72.65 NA NA NA NA
Emma Williams F age1 82.46 NA
Emma Williams F age2 91.76
Aiden Jones M age0 69.76 NA NA NA NA NA
Aiden Jones M age1 79.89

The column name is usually either “measure” or “key”. You can notice in the above table that there are “age0”, “age1”, “age2”, “age3”, “age4”, “age5” in both column names and one column. If you omit the column names, the table will be like below.

Table 13
name gender measure _ __ ___ ____ _____ ______ _______
Emma Williams F age0 72.65 NA NA NA NA
Emma Williams F age1 82.46
Emma Williams F age2 91.76
Aiden Jones M age0 69.76 NA NA NA NA NA
Aiden Jones M age1 79.89

Since there is only one measurement in each row, you can gather them in one column and the result will be the following:

Table 14
name gender measure
Emma Williams F age0 72.65
Emma Williams F age1 82.46
Emma Williams F age2 91.76
Aiden Jones M age0 69.76
Aiden Jones M age1 79.89

Let's make new column name for all the measurements. Usually, it is named as "value":

Table 15
name gender measure value
Emma Williams F age0 72.65
Emma Williams F age1 82.46
Emma Williams F age2 91.76
Aiden Jones M age0 69.76
Aiden Jones M age1 79.89

Let's look at the above table. Do you notice that "age" is repeating? You can omit repeating entity, which will give you the following result:

Table 16
name gender age value
Emma Williams F 0 72.65
Emma Williams F 1 82.46
Emma Williams F 2 91.76
Aiden Jones M 0 69.76
Aiden Jones M 1 79.89

Age, which is repeating in every row is now a column name. Column name value in the table above can be more specific. You replaced it with height, which is more revealing. You can see the result in the table below:

Table 17
name gender age height
Emma Williams F 0 72.65
Emma Williams F 1 82.46
Emma Williams F 2 91.76
Aiden Jones M 0 69.76
Aiden Jones M 1 79.89

Let's compare Table 15 with Table 17. value, one of the column names of Table 15, can mean anything, but height of Table 17 has very specific meaning: it can only designate the height whereas the value in Table 15 could be any value like height, weight, speed or color. Another way would be like the following:

Table 18
name gender measure value
Emma Williams F height.age0 72.65
Emma Williams F height.age1 82.46
Emma Williams F height.age2 91.76
Aiden Jones M height.age0 69.76
Aiden Jones M height.age1 79.89

Let's consider the situation in which the measurements are diverse. If the measurements are height, weight and BMI, it will be like table below. But in this case, the column value has numerical values in the different scale so it could be confusing reading them. In this table, height was measured in centimeter, and weight in kilogram:

Table 19
name gender measure value
Emma Williams F height 72.65
Emma Williams F weight 8.22
Emma Williams F BMI 22.60
Aiden Jones M height 69.76
Aiden Jones M weight 8.51

So when converting the wide-form into the long-form, one needs to decide which values will be gathered into one column. After that, you can decide the column name.

Let's summarize what is been said so far: to convert the wide-from into the long-form one need to decide which columns to modify that they are in one column. In doing so, many columns are being combined into one column. So you need a column name for that column, which you can see in the next tables:

Table 20
id1 id2 age0 age1 age2
Emma Williams F NA 82.46 91.76
Aiden Jones M 69.76 79.89 NA
Emma Williams F NA 82.46 91.76
Table 21
id1 id2 age0 age1 age2
Emma Williams F NA
Emma Williams F 82.46
Emma Williams F 91.76
Aiden Jones M 69.76
Aiden Jones M 79.89
Aiden Jones M NA
Emma Williams F NA
Emma Williams F 82.46
Emma Williams F 91.76
Table 22
id1 id2 age_ age0 age1 age2
Emma Williams F age0 NA
Emma Williams F age1 82.46
Emma Williams F age2 91.76
Aiden Jones M age0 69.76
Aiden Jones M age1 79.89
Aiden Jones M age2 NA
Emma Williams F age0 NA
Emma Williams F age1 82.46
Emma Williams F age2 91.76
Table 23
id1 id2 age_
Emma Williams F age0 NA
Emma Williams F age1 82.46
Emma Williams F age2 91.76
Aiden Jones M age0 69.76
Aiden Jones M age1 79.89
Aiden Jones M age2 NA
Emma Williams F age0 NA
Emma Williams F age1 82.46
Emma Williams F age2 91.76
Table 24
id1 id2 key value
Emma Williams F age0 NA
Emma Williams F age1 82.46
Emma Williams F age2 91.76
Aiden Jones M age0 69.76
Aiden Jones M age1 79.89
Aiden Jones M age2 NA
Emma Williams F age0 NA
Emma Williams F age1 82.46
Emma Williams F age2 91.76
Table 25
id1 id2 key value
Emma Williams F age1 82.46
Emma Williams F age2 91.76
Aiden Jones M age0 69.76
Aiden Jones M age1 79.89
Emma Williams F age1 82.46
Emma Williams F age2 91.76
Let's look at the columns id1 and id2 in the above table. id2 does not specify an individual but it is the attribute(gender) of an individual. In other words, you can say it is a measurement. So you can convert it into the longer-form. The longest-form is made up of three columns, which are id, key and value.

Table 26
id key value
Emma Williams gender F
Emma Williams age1 82.46
Emma Williams age2 91.76
Aiden Jones gender F
Aiden Jones age0 69.76
Aiden Jones age1 79.89
Emma Williams gender F
Emma Williams age1 82.46
Emma Williams age2 91.76

The above table shows the longest-form from table 25. But different people can have the same name.

In that case, long-form can be confusing since id can not specify one person. You can put a number behind the name to ditinguish people(for example, using R function make.unique()), but you can not differentiate the name that is altered and the original name with a number. Think of a person whose name is "John.2" for real!.

The obvious way around this is to assign a different number ID for a different person. In that case, names are another measurements.

The next table shows that even the names are considered as measurements. In a sense, this is the true longest-form.

Table 27
id key value
1 name Emma Williams
1 gender F
1 age1 82.46
1 age2 91.76
2 name Aiden Jones
2 gender F
2 age0 69.76
2 age1 79.89
3 name Emma Williams
3 gender F
3 age1 82.46
3 age2 91.76

The longest-form is the easiest form for making a wide-form. If you reverse the process of converting the wide-form into the long-form, which is shown in tables 20 to 25, you get to the wide-form. The next tables below show this process:

Table 28. Measurements in key column are repeated on the column name
id key name gender age0 age1 age2
1 name Emma Williams
1 gender F
1 age1 82.46
1 age2 91.76
2 name Aiden Jones
2 gender F
2 age0 69.76
2 age1 79.89
3 name Emma Williams
3 gender F
3 age1 82.46
3 age2 91.76
Table 29. key column can be erased
id name gender age0 age1 age2
1 Emma Williams
1 F
1 82.46
1 91.76
2 Aiden Jones
2 F
2 69.76
2 79.89
3 Emma Williams
3 F
3 82.46
3 91.76
Table 30. The same id can be gathered into one row
id name gender age0 age1 age2
1 Emma Williams F 82.46 91.76
1
1
1
2 Aiden Jones F 69.76 79.89
2
2
2
3 Emma Williams F 82.46 91.76
3
3
3
Table 31. Delete any row that has no measurements
id name gender age0 age1 age2
1 Emma Williams F 82.46 91.76
2 Aiden Jones F 69.76 79.89
3 Emma Williams F 82.46 91.76

The key factor in converting the wide-form into the long-form is deciding which columns will be gathered into one column.

But it does not have to be just one column. You could use two columns if that is necessary, just like in the table below:

Table 32
name gender h2011 h2012 w2011 w2012
Jackson Smith M 74.69 84.99 9.60 12.0
Olivia Miller F NA 80.76 7.15 10.7
Ava Wilson F 88.77 96.45 NA 15.0

Can you make it into the long-form?

Columns h2011, h2012, w2011, w2012 mean heights and weights in the year 2011 and 2012. height and weight are measured in the different scales - cm, kg. So you can use only one column for both height and weight or you can use two columns, just like in this table:

Table 33
name gender year height weight
Jackson Smith M 2011 74.69 9.60
Jackson Smith M 2012 84.99 12.00
Olivia Miller F 2011 NA 80.76
Olivia Miller F 2012 7.15 10.70
Ava Wilson F 2011 88.77 96.45
Ava Wilson F 2012 NA 15.00

Is this the long-form or the wide-form?

From Long to Wide Data in R

Let's do the coversion between the long-form and the wide-form with R. There are many functions and packages available for this task.

Table 34
func package to_long_form to_wide_form
stack/unstack utils stack unstack
reshape stats reshape(direction="long", ...) reshape(direction="wide", ...)
melt/dcast reshape2 melt dcast
gather/spread tidyr gather spread

Here I will explain how to use the melt() and dcast() functions of the reshape2 package. The reason is that stack() and unstack() are basic functions so one needs post-processing the result from them. Note that the gather() and spread() are tidyr functions that are also very popular, but that won't be covered in this tutorial.

Another function that you might be able to use is reshape(), which comes with the stats package. In fact, don't be confused by this function, as there's also a package called reshape! This package, together with reshape2, was developed for those who were struggling with reshape(). In addition, the function reshape() seems to have an assumption that the data are longitudinal, which means measurements are repeated through time.

Let's convert the data in table 35 into the long-form. Assume that the table below stores a data.frame called dat.

Table 35
name gender year2011 year2012 year2013
Jackson Smith M 74.69 84.99 91.73
Emma Williams F NA NA 75.74
Liam Brown M 88.24 NA 101.85
Ava Wilson F 88.77 96.45 NA

babies = data.frame(name= c("Jackson Smith", "Emma Williams", "Liam Brown", "Ava Wilson"), gender = c("M", "F", "M", "F"), year2011= c(74.69, NA, 88.24, 88.77), year2012=c(84.99, NA, NA, 96.45), year2013=c(91.73, 75.74, 101.83, NA)) babies

##            name gender year2011 year2012 year2013
## 1 Jackson Smith      M    74.69    84.99    91.73
## 2 Emma Williams      F       NA       NA    75.74
## 3    Liam Brown      M    88.24       NA   101.83
## 4    Ava Wilson      F    88.77    96.45       NA

The question that you want to ask is: "which columns to combine into one column?".

Those would be year2011, year2012, year2013. When you use function melt(), those columns should be listed for measure.vars:

melt(babies, measure.vars = c("year2011", "year2012", "year2013"))

##             name gender variable  value
## 1  Jackson Smith      M year2011  74.69
## 2  Emma Williams      F year2011     NA
## 3     Liam Brown      M year2011  88.24
## 4     Ava Wilson      F year2011  88.77
## 5  Jackson Smith      M year2012  84.99
## 6  Emma Williams      F year2012     NA
## 7     Liam Brown      M year2012     NA
## 8     Ava Wilson      F year2012  96.45
## 9  Jackson Smith      M year2013  91.73
## 10 Emma Williams      F year2013  75.74
## 11    Liam Brown      M year2013 101.83
## 12    Ava Wilson      F year2013     NA

year2011, year2012, year2013 are the columns placed in 3rd, 4th, 5th so you can use that:

melt(babies, measure.vars = 3:5)

##             name gender variable  value
## 1  Jackson Smith      M year2011  74.69
## 2  Emma Williams      F year2011     NA
## 3     Liam Brown      M year2011  88.24
## 4     Ava Wilson      F year2011  88.77
## 5  Jackson Smith      M year2012  84.99
## 6  Emma Williams      F year2012     NA
## 7     Liam Brown      M year2012     NA
## 8     Ava Wilson      F year2012  96.45
## 9  Jackson Smith      M year2013  91.73
## 10 Emma Williams      F year2013  75.74
## 11    Liam Brown      M year2013 101.83
## 12    Ava Wilson      F year2013     NA

The result is in the table above. Note the column names variable and value. Table 23 and table 24 showed that the new column names are needed. Function melt() has default names for new columns- variable and value. You can override the default value by using variable.name= and value.name= arguement. Let's do this below:

melt(babies, measure.vars=3:5, variable.name="year", value.name="height")

##             name gender     year height
## 1  Jackson Smith      M year2011  74.69
## 2  Emma Williams      F year2011     NA
## 3     Liam Brown      M year2011  88.24
## 4     Ava Wilson      F year2011  88.77
## 5  Jackson Smith      M year2012  84.99
## 6  Emma Williams      F year2012     NA
## 7     Liam Brown      M year2012     NA
## 8     Ava Wilson      F year2012  96.45
## 9  Jackson Smith      M year2013  91.73
## 10 Emma Williams      F year2013  75.74
## 11    Liam Brown      M year2013 101.83
## 12    Ava Wilson      F year2013     NA

Measurements were specified in the examples above. For those settings you are dealing with, Id can indirectly specify the measurements. So you can choose columns for id rather than measurements.

The results are the same for the both below. The first one chooses columns for id, the second one chooses columns to be gathered:

melt(babies, id.vars=c("name","gender"))

melt(babies, measure.vars = c("year2011", "year2012", "year2013"))

##             name gender variable  value
## 1  Jackson Smith      M year2011  74.69
## 2  Emma Williams      F year2011     NA
## 3     Liam Brown      M year2011  88.24
## 4     Ava Wilson      F year2011  88.77
## 5  Jackson Smith      M year2012  84.99
## 6  Emma Williams      F year2012     NA
## 7     Liam Brown      M year2012     NA
## 8     Ava Wilson      F year2012  96.45
## 9  Jackson Smith      M year2013  91.73
## 10 Emma Williams      F year2013  75.74
## 11    Liam Brown      M year2013 101.83
## 12    Ava Wilson      F year2013     NA

If you specify both id.vars and measure.vars, columns not included in any of them are excluded from the result. For example, look at the code chunk below:

melt(babies, id.vars=c("name"), measure.vars=c("year2011", "year2012"))

##            name variable value
## 1 Jackson Smith year2011 74.69
## 2 Emma Williams year2011    NA
## 3    Liam Brown year2011 88.24
## 4    Ava Wilson year2011 88.77
## 5 Jackson Smith year2012 84.99
## 6 Emma Williams year2012    NA
## 7    Liam Brown year2012    NA
## 8    Ava Wilson year2012 96.45

In addition, na.rm=T can eliminate NAs:

melt(babies, id.vars=c("name"), measure.vars=c("year2011", "year2012"), na.rm=T)

##            name variable value
## 1 Jackson Smith year2011 74.69
## 3    Liam Brown year2011 88.24
## 4    Ava Wilson year2011 88.77
## 5 Jackson Smith year2012 84.99
## 8    Ava Wilson year2012 96.45

The dcast() function does the reverse of melt(). id, variable, and value are the columns of the longest-form. dcast(data, id ~ variable, value.var="value") takes the longest-form into the wide-form.

babiesLong <- melt(babies, id.vars=c("name"), measure.vars=c("year2011", "year2012"), na.rm=T)
babiesLong

##            name variable value
## 1 Jackson Smith year2011 74.69
## 3    Liam Brown year2011 88.24
## 4    Ava Wilson year2011 88.77
## 5 Jackson Smith year2012 84.99
## 8    Ava Wilson year2012 96.45

babiesLong has name for id, variable for variable, and value for value. So you can use dcast() to make it into long-form:

dcast(babiesLong, name ~ variable, value.var="value")

##            name year2011 year2012
## 1    Ava Wilson    88.77    96.45
## 2 Jackson Smith    74.69    84.99
## 3    Liam Brown    88.24       NA

Conclusion

That seems about it. In this tutorial, you focused on understanding the logic and the process of reshaping data. Usually, long-form is preferred for analysis but there are execptions. Data presenation in wide-form can save space when it is appropriate but as the number of missing values grow, you would better convert to the long-form.

For format conversion, you need to focus on what are the measurements. The motto for the long-form is one measurement per one row and the wide-form has many measurements in one row. When you make measurements throughout columns gathered in one column, the original column names need to be preserved in one additional column. When you make measurements in one column spreaded out throughout columns, you need to figure out what the column names should be from variable column.

Once you understand the process, using functions such as dcast() and melt() becomes a lot easier!

Want to leave a comment?