There is a time when people need to convert data in the long format (you call it "the long-form" and what it means will be clear later) to the wide format and the vice versa. The data in real life comes in all sorts of format, but most R functions are designed for one specific form.
lme4::lmer() for mixed linear modeling,
ggplot2::ggplot() for visualization are deisgned for long-form. But there are other functions designed for wide-form such as
mosaicplot(). That means that you need to convert the form to do analysis or visualization unless the data is already processed and in the right form.
There are several options in R for format coversion but it can become quite confusing when you're just starting out.
This tutorial will show you how to understand the mechanics of converting the data between the long-form and the wide-form. You'll explore the following topics:
- You'll learn more about the long form.
- Then, you'll get to know the wide form.
- You'll recap the differences between the long and wide formats;
- After which you'll see which steps you'd need to undertake to go from one form to another.
- Lastly, you'll implement all of this with the help of R. You'll learn how to go from a long format to a wide format with the
This table shows the height of babies in a small local area. The heights were measured from the year 2011 to the year 2015. As you might already know,
NA means ‘Not Available.’ From the position of NA’s, you can figure out the babies’ ages, but the exact age of a baby is not available from the table. The table looks well-organized, but for anyone who wants to study the growth of babies, there is a problem: the important information is not the absolute time at which a baby is measured but the age of a baby when he or she was measured. Could you insert this information into the table? Below, you can see one way:
The number of
NA's has increased. ## Wide Form Data But what would you do if you wanted to convey how many months a baby was old? It would be something like this:
There are tons of
NA’s in the table above. If the babies differ in their relative measurement times, the wide form can waste a lot of space. Another option is to include the ages of the babies when it was the year 2015, but that is only possible when the absolute measurement times are all the same. If the babies differ in their measurement time by months and you want to know the ages in months, it does not work. Then this example seems to be the only way. In this case, you can consider using long-form. ## Data in Long Form Whereas all measurements from one baby were shown in one row in wide-form, you'll only see one measurement from a baby in one row in long-form. The simplest way to obtain the latter would be displaying a child's name, the measurement time, and the measurement (height) in one row. This is exactly what this table shows:
If you arrange the data sorting the names, you'd get something like the table you see below. Obviously, this table occupies smaller space than Table 3, which you can see above!
Long versus Wide
Note again what the fundamental differences are between the wide-form and the long-form: one simple difference is that the wide-form displays many measurements from one individual in one row and the column names show what the measurements are.
This table shows Liam Brown’s heights in the year 2011-2015 in one row:
In long-form Liam Brown's heights are shown in one column:
As you can see, you can not know what the measurements are. If you write down the column name, it would look like this:
|name||gender||year2011, 2012, 2013, 2014, 2015|
But this one has its own problem. Sophia Johnson’s height was not measured in the year 2011-2014. So as you can see below, the column name cannot be set easily:
From Wide-Form to Long-Form
Converting the wide-form into the long-form can be thought of as a step-by-step process. Before converting the measurements in one row into one column, you can make the table in such a way that it contains only one measurement in each row. Let's do that for this table:
The result is like the table below:
Before gathering those measurements scattered through columns into one column, you need to make an additional column that says what the measurements are, just like below:
The column name is usually either
“key”. You can notice in the above table that there are
“age5” in both column names and one column. If you omit the column names, the table will be like below.
Since there is only one measurement in each row, you can gather them in one column and the result will be the following:
Let's make new column name for all the measurements. Usually, it is named as "value":
Let's look at the above table. Do you notice that "age" is repeating? You can omit repeating entity, which will give you the following result:
Age, which is repeating in every row is now a column name. Column name
value in the table above can be more specific. You replaced it with
height, which is more revealing. You can see the result in the table below:
Let's compare Table 15 with Table 17.
value, one of the column names of Table 15, can mean anything, but
height of Table 17 has very specific meaning: it can only designate the height whereas the value in Table 15 could be any value like height, weight, speed or color. Another way would be like the following:
Let's consider the situation in which the measurements are diverse. If the measurements are height, weight and BMI, it will be like table below. But in this case, the column
value has numerical values in the different scale so it could be confusing reading them. In this table, height was measured in centimeter, and weight in kilogram:
So when converting the wide-form into the long-form, one needs to decide which values will be gathered into one column. After that, you can decide the column name.
Let's summarize what is been said so far: to convert the wide-from into the long-form one need to decide which columns to modify that they are in one column. In doing so, many columns are being combined into one column. So you need a column name for that column, which you can see in the next tables:
Let's look at the columns
id2 in the above table.
id2 does not specify an individual but it is the attribute(gender) of an individual. In other words, you can say it is a measurement. So you can convert it into the longer-form. The longest-form is made up of three columns, which are
The above table shows the longest-form from table 25. But different people can have the same name.
In that case, long-form can be confusing since
id can not specify one person. You can put a number behind the name to ditinguish people(for example, using R function
make.unique()), but you can not differentiate the name that is altered and the original name with a number. Think of a person whose name is "John.2" for real!.
The obvious way around this is to assign a different number ID for a different person. In that case, names are another measurements.
The next table shows that even the names are considered as measurements. In a sense, this is the true longest-form.
The longest-form is the easiest form for making a wide-form. If you reverse the process of converting the wide-form into the long-form, which is shown in tables 20 to 25, you get to the wide-form. The next tables below show this process:
The key factor in converting the wide-form into the long-form is deciding which columns will be gathered into one column.
But it does not have to be just one column. You could use two columns if that is necessary, just like in the table below:
Can you make it into the long-form?
w2012 mean heights and weights in the year 2011 and 2012.
weight are measured in the different scales - cm, kg. So you can use only one column for both height and weight or you can use two columns, just like in this table:
Is this the long-form or the wide-form?
From Long to Wide Data in R
Let's do the coversion between the long-form and the wide-form with R. There are many functions and packages available for this task.
|reshape||stats||reshape(direction="long", ...)||reshape(direction="wide", ...)|
Here I will explain how to use the
dcast() functions of the
reshape2 package. The reason is that
unstack() are basic functions so one needs post-processing the result from them. Note that the
tidyr functions that are also very popular, but that won't be covered in this tutorial.
Another function that you might be able to use is
reshape(), which comes with the
stats package. In fact, don't be confused by this function, as there's also a package called
reshape! This package, together with
reshape2, was developed for those who were struggling with
reshape(). In addition, the function
reshape() seems to have an assumption that the data are longitudinal, which means measurements are repeated through time.
Let's convert the data in table 35 into the long-form. Assume that the table below stores a
babies = data.frame(name= c("Jackson Smith", "Emma Williams", "Liam Brown", "Ava Wilson"), gender = c("M", "F", "M", "F"), year2011= c(74.69, NA, 88.24, 88.77), year2012=c(84.99, NA, NA, 96.45), year2013=c(91.73, 75.74, 101.83, NA)) babies
## name gender year2011 year2012 year2013 ## 1 Jackson Smith M 74.69 84.99 91.73 ## 2 Emma Williams F NA NA 75.74 ## 3 Liam Brown M 88.24 NA 101.83 ## 4 Ava Wilson F 88.77 96.45 NA
The question that you want to ask is: "which columns to combine into one column?".
Those would be
year2013. When you use function
melt(), those columns should be listed for
melt(babies, measure.vars = c("year2011", "year2012", "year2013")) ## name gender variable value ## 1 Jackson Smith M year2011 74.69 ## 2 Emma Williams F year2011 NA ## 3 Liam Brown M year2011 88.24 ## 4 Ava Wilson F year2011 88.77 ## 5 Jackson Smith M year2012 84.99 ## 6 Emma Williams F year2012 NA ## 7 Liam Brown M year2012 NA ## 8 Ava Wilson F year2012 96.45 ## 9 Jackson Smith M year2013 91.73 ## 10 Emma Williams F year2013 75.74 ## 11 Liam Brown M year2013 101.83 ## 12 Ava Wilson F year2013 NA
year2013 are the columns placed in 3rd, 4th, 5th so you can use that:
melt(babies, measure.vars = 3:5) ## name gender variable value ## 1 Jackson Smith M year2011 74.69 ## 2 Emma Williams F year2011 NA ## 3 Liam Brown M year2011 88.24 ## 4 Ava Wilson F year2011 88.77 ## 5 Jackson Smith M year2012 84.99 ## 6 Emma Williams F year2012 NA ## 7 Liam Brown M year2012 NA ## 8 Ava Wilson F year2012 96.45 ## 9 Jackson Smith M year2013 91.73 ## 10 Emma Williams F year2013 75.74 ## 11 Liam Brown M year2013 101.83 ## 12 Ava Wilson F year2013 NA
The result is in the table above. Note the column names
value. Table 23 and table 24 showed that the new column names are needed. Function
melt() has default names for new columns-
value. You can override the default value by using
value.name= arguement. Let's do this below:
melt(babies, measure.vars=3:5, variable.name="year", value.name="height") ## name gender year height ## 1 Jackson Smith M year2011 74.69 ## 2 Emma Williams F year2011 NA ## 3 Liam Brown M year2011 88.24 ## 4 Ava Wilson F year2011 88.77 ## 5 Jackson Smith M year2012 84.99 ## 6 Emma Williams F year2012 NA ## 7 Liam Brown M year2012 NA ## 8 Ava Wilson F year2012 96.45 ## 9 Jackson Smith M year2013 91.73 ## 10 Emma Williams F year2013 75.74 ## 11 Liam Brown M year2013 101.83 ## 12 Ava Wilson F year2013 NA
Measurements were specified in the examples above. For those settings you are dealing with,
Id can indirectly specify the measurements. So you can choose columns for id rather than measurements.
The results are the same for the both below. The first one chooses columns for id, the second one chooses columns to be gathered:
melt(babies, id.vars=c("name","gender")) melt(babies, measure.vars = c("year2011", "year2012", "year2013")) ## name gender variable value ## 1 Jackson Smith M year2011 74.69 ## 2 Emma Williams F year2011 NA ## 3 Liam Brown M year2011 88.24 ## 4 Ava Wilson F year2011 88.77 ## 5 Jackson Smith M year2012 84.99 ## 6 Emma Williams F year2012 NA ## 7 Liam Brown M year2012 NA ## 8 Ava Wilson F year2012 96.45 ## 9 Jackson Smith M year2013 91.73 ## 10 Emma Williams F year2013 75.74 ## 11 Liam Brown M year2013 101.83 ## 12 Ava Wilson F year2013 NA
If you specify both
measure.vars, columns not included in any of them are excluded from the result. For example, look at the code chunk below:
melt(babies, id.vars=c("name"), measure.vars=c("year2011", "year2012")) ## name variable value ## 1 Jackson Smith year2011 74.69 ## 2 Emma Williams year2011 NA ## 3 Liam Brown year2011 88.24 ## 4 Ava Wilson year2011 88.77 ## 5 Jackson Smith year2012 84.99 ## 6 Emma Williams year2012 NA ## 7 Liam Brown year2012 NA ## 8 Ava Wilson year2012 96.45
na.rm=T can eliminate NAs:
melt(babies, id.vars=c("name"), measure.vars=c("year2011", "year2012"), na.rm=T) ## name variable value ## 1 Jackson Smith year2011 74.69 ## 3 Liam Brown year2011 88.24 ## 4 Ava Wilson year2011 88.77 ## 5 Jackson Smith year2012 84.99 ## 8 Ava Wilson year2012 96.45
dcast() function does the reverse of
value are the columns of the longest-form.
dcast(data, id ~ variable, value.var="value") takes the longest-form into the wide-form.
babiesLong <- melt(babies, id.vars=c("name"), measure.vars=c("year2011", "year2012"), na.rm=T) babiesLong ## name variable value ## 1 Jackson Smith year2011 74.69 ## 3 Liam Brown year2011 88.24 ## 4 Ava Wilson year2011 88.77 ## 5 Jackson Smith year2012 84.99 ## 8 Ava Wilson year2012 96.45
babiesLong has name for
id, variable for
variable, and value for
value. So you can use
dcast() to make it into long-form:
dcast(babiesLong, name ~ variable, value.var="value") ## name year2011 year2012 ## 1 Ava Wilson 88.77 96.45 ## 2 Jackson Smith 74.69 84.99 ## 3 Liam Brown 88.24 NA
That seems about it. In this tutorial, you focused on understanding the logic and the process of reshaping data. Usually, long-form is preferred for analysis but there are execptions. Data presenation in wide-form can save space when it is appropriate but as the number of missing values grow, you would better convert to the long-form.
For format conversion, you need to focus on what are the measurements. The motto for the long-form is one measurement per one row and the wide-form has many measurements in one row. When you make measurements throughout columns gathered in one column, the original column names need to be preserved in one additional column. When you make measurements in one column spreaded out throughout columns, you need to figure out what the column names should be from
Once you understand the process, using functions such as
melt() becomes a lot easier!
Introduction to R
How to Become a Data Scientist in 8 StepsFind out everything you need to know about becoming a data scientist, and find out whether it’s the right career for you!
Predicting FIFA World Cup Qatar 2022 WinnersLearn to use Elo ratings to quantify national soccer team performance, and see how the model can be used to predict the winner of FIFA World Cup Qatar 2022.
Arne Warnke •
How Data Science is Changing SoccerWith the Fifa 2022 World Cup upon us, learn about the most widely used data science use-cases in soccer.
Regular Expressions Cheat SheetRegular expressions (regex or regexp) are a pattern of characters that describe an amount of text. Regular expressions are one of the most widely used tools in natural language processing and allow you to supercharge common text data manipulation tasks. Use this cheat sheet as a handy reminder when working with regular expressions.
DataCamp Team •
ggplot2 Cheat Sheetggplot2 is considered to be one of the most robust data visualization packages in any programming language. Use this cheat sheet to guide your ggplot2 learning journey.
DataCamp Team •