Joining DataFrames in pandas Tutorial
• September 19, 2018 • 14 min read

Have you ever tried solving a Kaggle challenge? If yes, you might have noticed that in most of the challenges, the data provided to you is present in multiple files, with some of the columns present in more than one files. Well, what is the first thing that comes to your mind? Join them of course!
Joining and merging DataFrames is the core process to start with data analysis and machine learning tasks. It is one of the toolkits which every Data Analyst or Data Scientist should master because in almost all the cases data comes from multiple source and files. You may need to bring all the data in one place by some sort of join logic and then start your analysis. People who work with SQL like query languages might know the importance of this task. Even if you want to build some machine learning models on some data, you may need to merge multiple csv
files together in a single DataFrame.
Thankfully you have the most popular library in python, pandas
to your rescue! pandas
provides various facilities for easily combining together Series, DataFrames, and Panel objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.
In this tutorial, you will practice a few standard techniques. More specifically, you will learn to:
- Concatenate DataFrames along row and column.
- Merge DataFrames on specific keys by different join logics like left-join, inner-join, etc.
- Time-series friendly merging provided in pandas
Along the way, you will also learn a few tricks which you require before and after joining.
Concatenate DataFrames
Start by importing the library you will be using throughout the tutorial: pandas
import pandas as pd
You will be performing all the operations in this tutorial on the dummy DataFrames that you will create. To create a DataFrame you can use python dictionary like:
dummy_data1 = {
'id': ['1', '2', '3', '4', '5'],
'Feature1': ['A', 'C', 'E', 'G', 'I'],
'Feature2': ['B', 'D', 'F', 'H', 'J']}
Here the keys of the dictionary dummy_data1
are the column names and the values in the list are the data corresponding to each observation or row. To transform this into a pandas DataFrame, you will use the DataFrame()
function of pandas, along with its columns
argument to name your columns:
df1 = pd.DataFrame(dummy_data1, columns = ['id', 'Feature1', 'Feature2'])
df1
id | Feature1 | Feature2 | |
---|---|---|---|
0 | 1 | A | B |
1 | 2 | C | D |
2 | 3 | E | F |
3 | 4 | G | H |
4 | 5 | I | J |
As you can notice, you now have a DataFrame with 3 columns id
, Feature1
, and Feature2
. There is an additional un-named column which pandas intrinsically creates as the row labels. Similar to the previous DataFrame df1
, you will create two more DataFrames df2
and df3
:
dummy_data2 = {
'id': ['1', '2', '6', '7', '8'],
'Feature1': ['K', 'M', 'O', 'Q', 'S'],
'Feature2': ['L', 'N', 'P', 'R', 'T']}
df2 = pd.DataFrame(dummy_data2, columns = ['id', 'Feature1', 'Feature2'])
df2
id | Feature1 | Feature2 | |
---|---|---|---|
0 | 1 | K | L |
1 | 2 | M | N |
2 | 6 | O | P |
3 | 7 | Q | R |
4 | 8 | S | T |
dummy_data3 = {
'id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
'Feature3': [12, 13, 14, 15, 16, 17, 15, 12, 13, 23]}
df3 = pd.DataFrame(dummy_data3, columns = ['id', 'Feature3'])
df3
id | Feature3 | |
---|---|---|
0 | 1 | 12 |
1 | 2 | 13 |
2 | 3 | 14 |
3 | 4 | 15 |
4 | 5 | 16 |
5 | 7 | 17 |
6 | 8 | 15 |
7 | 9 | 12 |
8 | 10 | 13 |
9 | 11 | 23 |
To simply concatenate the DataFrames along the row you can use the concat()
function in pandas
. You will have to pass the names of the DataFrames in a list as the argument to the concat()
function:
df_row = pd.concat([df1, df2])
df_row
id | Feature1 | Feature2 | |
---|---|---|---|
0 | 1 | A | B |
1 | 2 | C | D |
2 | 3 | E | F |
3 | 4 | G | H |
4 | 5 | I | J |
0 | 1 | K | L |
1 | 2 | M | N |
2 | 6 | O | P |
3 | 7 | Q | R |
4 | 8 | S | T |
You can notice that the two DataFrames df1
and df2
are now concatenated into a single DataFrame df_row
along the row. However, the row labels seem to be wrong! If you want the row labels to adjust automatically according to the join, you will have to set the argument ignore_index
as True
while calling the concat()
function:
df_row_reindex = pd.concat([df1, df2], ignore_index=True)
df_row_reindex
id | Feature1 | Feature2 | |
---|---|---|---|
0 | 1 | A | B |
1 | 2 | C | D |
2 | 3 | E | F |
3 | 4 | G | H |
4 | 5 | I | J |
5 | 1 | K | L |
6 | 2 | M | N |
7 | 6 | O | P |
8 | 7 | Q | R |
9 | 8 | S | T |
Now the row labels are correct!
pandas
also provides you with an option to label the DataFrames, after the concatenation, with a key so that you may know which data came from which DataFrame. You can achieve the same by passing additional argument keys
specifying the label names of the DataFrames in a list. Here you will perform the same concatenation with keys as x
and y
for DataFrames df1
and df2
respectively.
frames = [df1,df2]
df_keys = pd.concat(frames, keys=['x', 'y'])
df_keys
id | Feature1 | Feature2 | ||
---|---|---|---|---|
x | 0 | 1 | A | B |
1 | 2 | C | D | |
2 | 3 | E | F | |
3 | 4 | G | H | |
4 | 5 | I | J | |
y | 0 | 1 | K | L |
1 | 2 | M | N | |
2 | 6 | O | P | |
3 | 7 | Q | R | |
4 | 8 | S | T |
Mentioning the keys also makes it easy to retrieve data corresponding to a particular DataFrame. You can retrieve the data of DataFrame df2
which had the label y
by using the loc
method:
df_keys.loc['y']
id | Feature1 | Feature2 | |
---|---|---|---|
0 | 1 | K | L |
1 | 2 | M | N |
2 | 6 | O | P |
3 | 7 | Q | R |
4 | 8 | S | T |
You can also pass a dictionary to concat()
, in which case the dictionary keys will be used for the keys
argument (unless other keys are specified):
pieces = {'x': df1, 'y': df2}
df_piece = pd.concat(pieces)
df_piece
id | Feature1 | Feature2 | ||
---|---|---|---|---|
x | 0 | 1 | A | B |
1 | 2 | C | D | |
2 | 3 | E | F | |
3 | 4 | G | H | |
4 | 5 | I | J | |
y | 0 | 1 | K | L |
1 | 2 | M | N | |
2 | 6 | O | P | |
3 | 7 | Q | R | |
4 | 8 | S | T |
It is worth noting that concat()
makes a full copy of the data, and continuosly reusing this function can create a significant performance hit. If you need to use the operation over several datasets, use a list comprehension.
frames = [ process_your_file(f) for f in files ]
result = pd.concat(frames)
To concatenate DataFrames along column, you can specify the axis
parameter as 1
:
df_col = pd.concat([df1,df2], axis=1)
df_col
id | Feature1 | Feature2 | id | Feature1 | Feature2 | |
---|---|---|---|---|---|---|
0 | 1 | A | B | 1 | K | L |
1 | 2 | C | D | 2 | M | N |
2 | 3 | E | F | 6 | O | P |
3 | 4 | G | H | 7 | Q | R |
4 | 5 | I | J | 8 | S | T |
Merge DataFrames
Another ubiquitous operation related to DataFrames is the merging operation. Two DataFrames might hold different kinds of information about the same entity and linked by some common feature/column. To join these DataFrames, pandas provides multiple functions like concat()
, merge()
, join()
, etc. In this section, you will practice using merge()
function of pandas.
You can join DataFrames df_row
(which you created by concatenating df1
and df2
along the row) and df3
on the common column (or key) id
. To do so, pass the names of the DataFrames and an additional argument on
as the name of the common column, here id
, to the merge()
function:
df_merge_col = pd.merge(df_row, df3, on='id')
df_merge_col
id | Feature1 | Feature2 | Feature3 | |
---|---|---|---|---|
0 | 1 | A | B | 12 |
1 | 1 | K | L | 12 |
2 | 2 | C | D | 13 |
3 | 2 | M | N | 13 |
4 | 3 | E | F | 14 |
5 | 4 | G | H | 15 |
6 | 5 | I | J | 16 |
7 | 7 | Q | R | 17 |
8 | 8 | S | T | 15 |
You can notice that the DataFrames are now merged into a single DataFrame based on the common values present in the id
column of both the DataFrames. For example, here id
value 1
was present with both A
, B
and K
, L
in the DataFrame df_row
hence this id
got repeated twice in the final DataFrame df_merge_col
with repeated value 12
of Feature3
which came from DataFrame df3
.
It might happen that the column on which you want to merge the DataFrames have different names (unlike in this case). For such merges, you will have to specify the arguments left_on
as the left DataFrame name and right_on
as the right DataFrame name, like :
df_merge_difkey = pd.merge(df_row, df3, left_on='id', right_on='id')
df_merge_difkey
id | Feature1 | Feature2 | Feature3 | |
---|---|---|---|---|
0 | 1 | A | B | 12 |
1 | 1 | K | L | 12 |
2 | 2 | C | D | 13 |
3 | 2 | M | N | 13 |
4 | 3 | E | F | 14 |
5 | 4 | G | H | 15 |
6 | 5 | I | J | 16 |
7 | 7 | Q | R | 17 |
8 | 8 | S | T | 15 |
You can also append rows to a DataFrame by passing a Series or dict to append()
function which returns a new DataFrame:
add_row = pd.Series(['10', 'X1', 'X2', 'X3'],
index=['id','Feature1', 'Feature2', 'Feature3'])
df_add_row = df_merge_col.append(add_row, ignore_index=True)
df_add_row
id | Feature1 | Feature2 | Feature3 | |
---|---|---|---|---|
0 | 1 | A | B | 12 |
1 | 1 | K | L | 12 |
2 | 2 | C | D | 13 |
3 | 2 | M | N | 13 |
4 | 3 | E | F | 14 |
5 | 4 | G | H | 15 |
6 | 5 | I | J | 16 |
7 | 7 | Q | R | 17 |
8 | 8 | S | T | 15 |
9 | 10 | X1 | X2 | X3 |
Join DataFrames
In this section, you will practice the various join logics available to merge pandas DataFrames based on some common column/key. The logic behind these joins is very much the same that you have in SQL when you join tables.
Full Outer Join
The FULL OUTER JOIN
combines the results of both the left and the right outer joins. The joined DataFrame will contain all records from both the DataFrames and fill in NaN
s for missing matches on either side. You can perform a full outer join by specifying the how
argument as outer
in the merge()
function:
df_outer = pd.merge(df1, df2, on='id', how='outer')
df_outer
id | Feature1_x | Feature2_x | Feature1_y | Feature2_y | |
---|---|---|---|---|---|
0 | 1 | A | B | K | L |
1 | 2 | C | D | M | N |
2 | 3 | E | F | NaN | NaN |
3 | 4 | G | H | NaN | NaN |
4 | 5 | I | J | NaN | NaN |
5 | 6 | NaN | NaN | O | P |
6 | 7 | NaN | NaN | Q | R |
7 | 8 | NaN | NaN | S | T |
You can notice that the resulting DataFrame had all the entries from both the tables with NaN
values for missing matches on either side. However, one more thing to notice is the suffix which got appended to the column names to show which column came from which DataFrame. The default suffixes are x
and y
, however, you can modify them by specifying the suffixes
argument in the merge()
function:
df_suffix = pd.merge(df1, df2, left_on='id',right_on='id',how='outer',suffixes=('_left','_right'))
df_suffix
id | Feature1_left | Feature2_left | Feature1_right | Feature2_right | |
---|---|---|---|---|---|
0 | 1 | A | B | K | L |
1 | 2 | C | D | M | N |
2 | 3 | E | F | NaN | NaN |
3 | 4 | G | H | NaN | NaN |
4 | 5 | I | J | NaN | NaN |
5 | 6 | NaN | NaN | O | P |
6 | 7 | NaN | NaN | Q | R |
7 | 8 | NaN | NaN | S | T |
Inner Join
The INNER JOIN
produces only the set of records that match in both DataFrame A and DataFrame B. You have to pass inner
in the how
argument of merge()
function to do inner join:
df_inner = pd.merge(df1, df2, on='id', how='inner')
df_inner
id | Feature1_x | Feature2_x | Feature1_y | Feature2_y | |
---|---|---|---|---|---|
0 | 1 | A | B | K | L |
1 | 2 | C | D | M | N |
Right Join
The RIGHT JOIN
produces a complete set of records from DataFrame B (right DataFrame), with the matching records (where available) in DataFrame A (left DataFrame). If there is no match, the right side will contain null. You have to pass right
in the how
argument of merge()
function to do right join:
df_right = pd.merge(df1, df2, on='id', how='right')
df_right
id | Feature1_x | Feature2_x | Feature1_y | Feature2_y | |
---|---|---|---|---|---|
0 | 1 | A | B | K | L |
1 | 2 | C | D | M | N |
2 | 6 | NaN | NaN | O | P |
3 | 7 | NaN | NaN | Q | R |
4 | 8 | NaN | NaN | S | T |
Left Join
The LEFT JOIN
produces a complete set of records from DataFrame A (left DataFrame), with the matching records (where available) in DataFrame B (right DataFrame). If there is no match, the left side will contain null. You have to pass left
in the how
argument of merge()
function to do left join:
df_left = pd.merge(df1, df2, on='id', how='left')
df_left
id | Feature1_x | Feature2_x | Feature1_y | Feature2_y | |
---|---|---|---|---|---|
0 | 1 | A | B | K | L |
1 | 2 | C | D | M | N |
2 | 3 | E | F | NaN | NaN |
3 | 4 | G | H | NaN | NaN |
4 | 5 | I | J | NaN | NaN |
Joining on index
Sometimes you may have to perform the join on the indexes or the row labels. To do so, you have to specify right_index
(for the indexes of the right DataFrame) and left_index
(for the indexes of the left DataFrame) as True
:
df_index = pd.merge(df1, df2, right_index=True, left_index=True)
df_index
id_x | Feature1_x | Feature2_x | id_y | Feature1_y | Feature2_y | |
---|---|---|---|---|---|---|
0 | 1 | A | B | 1 | K | L |
1 | 2 | C | D | 2 | M | N |
2 | 3 | E | F | 6 | O | P |
3 | 4 | G | H | 7 | Q | R |
4 | 5 | I | J | 8 | S | T |
Time-series Friendly Merging
Pandas provides special functions for merging Time-series DataFrames. Perhaps the most useful and popular one is the merge_asof()
function. The merge_asof()
is similar to an ordered left-join except that you match on nearest key rather than equal keys. For each row in the left DataFrame, you select the last row in the right DataFrame whose on
key is less than the left’s key. Both DataFrames must be sorted by the key.
Optionally an asof merge can perform a group-wise merge. This matches the by
key equally, in addition to the nearest match on the on
key.
For example, you might have trades and quotes, and you want to asof merge them. Here the left DataFrame is chosen as trades
and right DataFrame as quotes
. They are asof merged on key time
and group-wise merged by their ticker
symbol.
trades = pd.DataFrame({
'time': pd.to_datetime(['20160525 13:30:00.023',
'20160525 13:30:00.038',
'20160525 13:30:00.048',
'20160525 13:30:00.048',
'20160525 13:30:00.048']),
'ticker': ['MSFT', 'MSFT','GOOG', 'GOOG', 'AAPL'],
'price': [51.95, 51.95,720.77, 720.92, 98.00],
'quantity': [75, 155,100, 100, 100]},
columns=['time', 'ticker', 'price', 'quantity'])
quotes = pd.DataFrame({
'time': pd.to_datetime(['20160525 13:30:00.023',
'20160525 13:30:00.023',
'20160525 13:30:00.030',
'20160525 13:30:00.041',
'20160525 13:30:00.048',
'20160525 13:30:00.049',
'20160525 13:30:00.072',
'20160525 13:30:00.075']),
'ticker': ['GOOG', 'MSFT', 'MSFT','MSFT', 'GOOG', 'AAPL', 'GOOG','MSFT'],
'bid': [720.50, 51.95, 51.97, 51.99,720.50, 97.99, 720.50, 52.01],
'ask': [720.93, 51.96, 51.98, 52.00,720.93, 98.01, 720.88, 52.03]},
columns=['time', 'ticker', 'bid', 'ask'])
trades
time | ticker | price | quantity | |
---|---|---|---|---|
0 | 2016-05-25 13:30:00.023 | MSFT | 51.95 | 75 |
1 | 2016-05-25 13:30:00.038 | MSFT | 51.95 | 155 |
2 | 2016-05-25 13:30:00.048 | GOOG | 720.77 | 100 |
3 | 2016-05-25 13:30:00.048 | GOOG | 720.92 | 100 |
4 | 2016-05-25 13:30:00.048 | AAPL | 98.00 | 100 |
quotes
time | ticker | bid | ask | |
---|---|---|---|---|
0 | 2016-05-25 13:30:00.023 | GOOG | 720.50 | 720.93 |
1 | 2016-05-25 13:30:00.023 | MSFT | 51.95 | 51.96 |
2 | 2016-05-25 13:30:00.030 | MSFT | 51.97 | 51.98 |
3 | 2016-05-25 13:30:00.041 | MSFT | 51.99 | 52.00 |
4 | 2016-05-25 13:30:00.048 | GOOG | 720.50 | 720.93 |
5 | 2016-05-25 13:30:00.049 | AAPL | 97.99 | 98.01 |
6 | 2016-05-25 13:30:00.072 | GOOG | 720.50 | 720.88 |
7 | 2016-05-25 13:30:00.075 | MSFT | 52.01 | 52.03 |
df_merge_asof = pd.merge_asof(trades, quotes,
on='time',
by='ticker')
df_merge_asof
time | ticker | price | quantity | bid | ask | |
---|---|---|---|---|---|---|
0 | 2016-05-25 13:30:00.023 | MSFT | 51.95 | 75 | 51.95 | 51.96 |
1 | 2016-05-25 13:30:00.038 | MSFT | 51.95 | 155 | 51.97 | 51.98 |
2 | 2016-05-25 13:30:00.048 | GOOG | 720.77 | 100 | 720.50 | 720.93 |
3 | 2016-05-25 13:30:00.048 | GOOG | 720.92 | 100 | 720.50 | 720.93 |
4 | 2016-05-25 13:30:00.048 | AAPL | 98.00 | 100 | NaN | NaN |
If you observe carefully, you can notice the reason behind NaN
appearing in the AAPL
ticker row. Since the right DataFrame quotes
didn't have any time
value less than 13:30:00.048
(the time
in the left table) for AAPL
ticker, NaN
s were introduced in the bid
and ask
columns.
You can also set a predefined tolerance level for time
column. Suppose you only want asof merge within 2ms between the quote
time and the trade
time, then you will have to specify tolerance
argument:
df_merge_asof_tolerance = pd.merge_asof(trades, quotes,
on='time',
by='ticker',
tolerance=pd.Timedelta('2ms'))
df_merge_asof_tolerance
time | ticker | price | quantity | bid | ask | |
---|---|---|---|---|---|---|
0 | 2016-05-25 13:30:00.023 | MSFT | 51.95 | 75 | 51.95 | 51.96 |
1 | 2016-05-25 13:30:00.038 | MSFT | 51.95 | 155 | NaN | NaN |
2 | 2016-05-25 13:30:00.048 | GOOG | 720.77 | 100 | 720.50 | 720.93 |
3 | 2016-05-25 13:30:00.048 | GOOG | 720.92 | 100 | 720.50 | 720.93 |
4 | 2016-05-25 13:30:00.048 | AAPL | 98.00 | 100 | NaN | NaN |
Notice the difference between the above and previous result. Rows are not merged if the time tolerance didn't match 2ms
.
Conclusion
Hurray! You have come to the end of the tutorial. In this tutorial, you learned to concatenate and merge DataFrames based on several logics using the concat()
and merge()
functions of pandas
library. Towards the end, you also practiced the special function merge_asof()
for merging Time-series DataFrames. Along the way, you also learned to play with indexes of the DataFrames. There are several other options you can explore for joining DataFrames in pandas, and I encourage you to look at its fantastic documentation. Happy exploring!
This tutorial used the following sources to help write it:
If you would like to learn more about pandas, take DataCamp's pandas Foundations course and check out our DataFrames in Python Pandas Tutorial.
← Back to tutorial