Blog

Joining DataFrames in pandas Tutorial

In this tutorial, you’ll learn various ways in which multiple DataFrames could be merged in python using Pandas library.

Updated Dec 2022 · 19 min read

Have you ever tried solving a Kaggle challenge? If yes, you might have noticed that in most of the challenges, the data provided to you is present in multiple files, with some of the columns present in more than one of the files. Well, what is the first thing that comes to your mind? Join them of course!

In this tutorial, you will practice a few standard pandas joining techniques. More specifically, you will learn to:

Concatenate DataFrames along row and column.
Merge DataFrames on specific keys by different join logics like left-join, inner-join, etc.
Join DataFrames by index.
Time-series friendly merging provided in pandas

Along the way, you will also learn a few tricks which you require before and after joining.

Pandas joining

Joining and merging DataFrames is the core process to start with data analysis and machine learning tasks. It is one of the toolkits which every Data Analyst or Data Scientist should master because, in almost all cases, data comes from multiple sources and files. You may need to bring all the data in one place by some sort of join logic and then start your analysis. People who work with SQL-like query languages might know the importance of this task. Even if you want to build some machine learning models on some data, you may need to merge multiple csv files together in a single DataFrame.

Thankfully you have the most popular library in python, pandas to your rescue! pandas provides various facilities for easily combining together Series, DataFrames, and Panel objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

Run and edit the code from this tutorial online

Run Code

pandas Concatenate

Start by importing the library you will be using throughout the tutorial: pandas

import pandas as pd

You will be performing all the operations in this tutorial on the dummy DataFrames that you will create. To create a DataFrame, you can use a Python dictionary like:

dummy_data1 = {
        'id': ['1', '2', '3', '4', '5'],
        'Feature1': ['A', 'C', 'E', 'G', 'I'],
        'Feature2': ['B', 'D', 'F', 'H', 'J']}

Here, the keys of the dictionary dummy_data1 are the column names, and the values in the list are the data corresponding to each observation or row. To transform this into a pandas DataFrame, you will use the DataFrame() function of pandas, along with its columns argument to name your columns:

df1 = pd.DataFrame(dummy_data1, columns = ['id', 'Feature1', 'Feature2'])

df1

	id	Feature1	Feature2
0	1	A	B
1	2	C	D
2	3	E	F
3	4	G	H
4	5	I	J

As you can notice, you now have a DataFrame with three columns id, Feature1, and Feature2. There is an additional un-named column which pandas intrinsically creates as the row labels. Similar to the previous DataFrame df1, you will create two more DataFrames df2 and df3 :

dummy_data2 = {
        'id': ['1', '2', '6', '7', '8'],
        'Feature1': ['K', 'M', 'O', 'Q', 'S'],
        'Feature2': ['L', 'N', 'P', 'R', 'T']}

df2 = pd.DataFrame(dummy_data2, columns = ['id', 'Feature1', 'Feature2'])

df2

	id	Feature1	Feature2
0	1	K	L
1	2	M	N
2	6	O	P
3	7	Q	R
4	8	S	T

dummy_data3 = {
        'id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
        'Feature3': [12, 13, 14, 15, 16, 17, 15, 12, 13, 23]}

df3 = pd.DataFrame(dummy_data3, columns = ['id', 'Feature3'])

df3

	id	Feature3
0	1	12
1	2	13
2	3	14
3	4	15
4	5	16
5	7	17
6	8	15
7	9	12
8	10	13
9	11	23

concat()

To simply concatenate the DataFrames along the row you can use the concat() function in pandas. You will have to pass the names of the DataFrames in a list as the argument to the concat() function:

df_row = pd.concat([df1, df2])

df_row

	id	Feature1	Feature2
0	1	A	B
1	2	C	D
2	3	E	F
3	4	G	H
4	5	I	J
0	1	K	L
1	2	M	N
2	6	O	P
3	7	Q	R
4	8	S	T

You can notice that the two DataFrames df1 and df2 are now concatenated into a single DataFrame df_row along the row. However, the row labels seem to be wrong. If you want the row labels to adjust automatically according to the join, you will have to set the argument ignore_index as True while calling the concat() function:

df_row_reindex = pd.concat([df1, df2], ignore_index=True)

df_row_reindex

	id	Feature1	Feature2
0	1	A	B
1	2	C	D
2	3	E	F
3	4	G	H
4	5	I	J
5	1	K	L
6	2	M	N
7	6	O	P
8	7	Q	R
9	8	S	T

Now the row labels are correct!

pandas also provides you with an option to label the DataFrames, after the concatenation, with a key so that you may know which data came from which DataFrame. You can achieve the same by passing additional argument keys specifying the label names of the DataFrames in a list. Here you will perform the same concatenation with keys as x and y for DataFrames df1 and df2 respectively.

frames = [df1,df2]
df_keys = pd.concat(frames, keys=['x', 'y'])

df_keys

		id	Feature1	Feature2
x	0	1	A	B
	1	2	C	D
	2	3	E	F
	3	4	G	H
	4	5	I	J
y	0	1	K	L
	1	2	M	N
	2	6	O	P
	3	7	Q	R
	4	8	S	T

Mentioning the keys also makes it easy to retrieve data corresponding to a particular DataFrame. You can retrieve the data of DataFrame df2 which had the label y by using the loc method:

df_keys.loc['y']

	id	Feature1	Feature2
0	1	K	L
1	2	M	N
2	6	O	P
3	7	Q	R
4	8	S	T

You can also pass a dictionary to concat(), in which case the dictionary keys will be used for the keys argument (unless other keys are specified):

pieces = {'x': df1, 'y': df2}

df_piece = pd.concat(pieces)

df_piece

		id	Feature1	Feature2
x	0	1	A	B
	1	2	C	D
	2	3	E	F
	3	4	G	H
	4	5	I	J
y	0	1	K	L
	1	2	M	N
	2	6	O	P
	3	7	Q	R
	4	8	S	T

It is worth noting that concat() makes a full copy of the data, and continuosly reusing this function can create a significant performance hit. If you need to use the operation over several datasets, use a list comprehension.

frames = [ process_your_file(f) for f in files ]
result = pd.concat(frames)

To concatenate DataFrames along column, you can specify the axis parameter as 1 :

df_col = pd.concat([df1,df2], axis=1)

df_col

	id	Feature1	Feature2	id	Feature1	Feature2
0	1	A	B	1	K	L
1	2	C	D	2	M	N
2	3	E	F	6	O	P
3	4	G	H	7	Q	R
4	5	I	J	8	S	T

pandas Merge DataFrames

Another ubiquitous operation related to DataFrames is the merging operation. Two DataFrames might hold different kinds of information about the same entity and be linked by some common feature/column. To join these DataFrames, pandas provides multiple functions like concat(), merge() , join(), etc. In this section, you will practice using merge() function of pandas.

You can join DataFrames df_row (which you created by concatenating df1 and df2 along the row) and df3 on the common column (or key) id. To do so, pass the names of the DataFrames and an additional argument on as the name of the common column, here id, to the merge() function:

df_merge_col = pd.merge(df_row, df3, on='id')

df_merge_col

	id	Feature1	Feature2	Feature3
0	1	A	B	12
1	1	K	L	12
2	2	C	D	13
3	2	M	N	13
4	3	E	F	14
5	4	G	H	15
6	5	I	J	16
7	7	Q	R	17
8	8	S	T	15

You can notice that the DataFrames are now merged into a single DataFrame based on the common values present in the id column of both the DataFrames. For example, here id value 1 was present with both A, B and K, L in the DataFrame df_row hence this id got repeated twice in the final DataFrame df_merge_col with repeated value 12 of Feature3 which came from DataFrame df3.

It might happen that the column on which you want to merge the DataFrames have different names (unlike in this case). For such merges, you will have to specify the arguments left_on as the left DataFrame name and right_on as the right DataFrame name, like :

df_merge_difkey = pd.merge(df_row, df3, left_on='id', right_on='id')

df_merge_difkey

	id	Feature1	Feature2	Feature3
0	1	A	B	12
1	1	K	L	12
2	2	C	D	13
3	2	M	N	13
4	3	E	F	14
5	4	G	H	15
6	5	I	J	16
7	7	Q	R	17
8	8	S	T	15

You can also append rows to a DataFrame by passing a Series or dict to append() function which returns a new DataFrame:

add_row = pd.Series(['10', 'X1', 'X2', 'X3'],
                    index=['id','Feature1', 'Feature2', 'Feature3'])

df_add_row = df_merge_col.append(add_row, ignore_index=True)

df_add_row

	id	Feature1	Feature2	Feature3
0	1	A	B	12
1	1	K	L	12
2	2	C	D	13
3	2	M	N	13
4	3	E	F	14
5	4	G	H	15
6	5	I	J	16
7	7	Q	R	17
8	8	S	T	15
9	10	X1	X2	X3

Types of pandas Join

In this section, you will practice the various join logics available to merge pandas DataFrames based on some common column/key. The logic behind these joins is very much the same that you have in SQL when you join tables.

Full Outer Join

The FULL OUTER JOIN combines the results of both the left and the right outer joins. The joined DataFrame will contain all records from both the DataFrames and fill in NaNs for missing matches on either side. You can perform a full outer join by specifying the how argument as outer in the merge() function:

df_outer = pd.merge(df1, df2, on='id', how='outer')

df_outer

	id	Feature1_x	Feature2_x	Feature1_y	Feature2_y
0	1	A	B	K	L
1	2	C	D	M	N
2	3	E	F	NaN	NaN
3	4	G	H	NaN	NaN
4	5	I	J	NaN	NaN
5	6	NaN	NaN	O	P
6	7	NaN	NaN	Q	R
7	8	NaN	NaN	S	T

You can notice that the resulting DataFrame had all the entries from both the tables with NaN values for missing matches on either side. However, one more thing to notice is the suffix which got appended to the column names to show which column came from which DataFrame. The default suffixes are x and y, however, you can modify them by specifying the suffixes argument in the merge() function:

df_suffix = pd.merge(df1, df2, left_on='id',right_on='id',how='outer',suffixes=('_left','_right'))

df_suffix

	id	Feature1_left	Feature2_left	Feature1_right	Feature2_right
0	1	A	B	K	L
1	2	C	D	M	N
2	3	E	F	NaN	NaN
3	4	G	H	NaN	NaN
4	5	I	J	NaN	NaN
5	6	NaN	NaN	O	P
6	7	NaN	NaN	Q	R
7	8	NaN	NaN	S	T

Inner Join

The INNER JOIN produces only the set of records that match in both DataFrame A and DataFrame B. You have to pass inner in the how argument of merge() function to do inner join:

df_inner = pd.merge(df1, df2, on='id', how='inner')

df_inner

	id	Feature1_x	Feature2_x	Feature1_y	Feature2_y
0	1	A	B	K	L
1	2	C	D	M	N

Right Join

The RIGHT JOIN produces a complete set of records from DataFrame B (right DataFrame), with the matching records (where available) in DataFrame A (left DataFrame). If there is no match, the right side will contain null. You have to pass right in the how argument of merge() function to do right join:

df_right = pd.merge(df1, df2, on='id', how='right')

df_right

	id	Feature1_x	Feature2_x	Feature1_y	Feature2_y
0	1	A	B	K	L
1	2	C	D	M	N
2	6	NaN	NaN	O	P
3	7	NaN	NaN	Q	R
4	8	NaN	NaN	S	T

Left Join

The LEFT JOIN produces a complete set of records from DataFrame A (left DataFrame), with the matching records (where available) in DataFrame B (right DataFrame). If there is no match, the left side will contain null. You have to pass left in the how argument of merge() function to do left join:

df_left = pd.merge(df1, df2, on='id', how='left')

df_left

	id	Feature1_x	Feature2_x	Feature1_y	Feature2_y
0	1	A	B	K	L
1	2	C	D	M	N
2	3	E	F	NaN	NaN
3	4	G	H	NaN	NaN
4	5	I	J	NaN	NaN

Joining on index

Sometimes you may have to perform the join on the indexes or the row labels. To do so, you have to specify right_index (for the indexes of the right DataFrame) and left_index (for the indexes of the left DataFrame) as True :

df_index = pd.merge(df1, df2, right_index=True, left_index=True)

df_index

	id_x	Feature1_x	Feature2_x	id_y	Feature1_y	Feature2_y
0	1	A	B	1	K	L
1	2	C	D	2	M	N
2	3	E	F	6	O	P
3	4	G	H	7	Q	R
4	5	I	J	8	S	T

pandas Join

Pandas DataFrame.join function is used for joining data frames on unique indexes. You can use the optional argument `on` to join column(s) names on the index and how arguments handle the operation of the two objects. By default, it will use inner join.

pandas Join Two Dataframes

Let’s join two data frames using .join. We have provided `lsuffix` and `rsuffix` to avoid raising the column overlapping error. It joins based on the index, not on the column, so we need to either change the ‘id’ column or provide a suffix.

df2.join(df3, lsuffix='_left', rsuffix='_right')

	id_left	Feature1	Feature2	id_right	Feature3
0	1	K	L	1	12
1	2	M	N	2	13
2	6	O	P	3	14
3	7	Q	R	4	15
4	8	S	T	5	16

We can also join columns on the index using the `on` argument. To apply the join successfully, we have to df3 ‘id’ column to index and provide the `on` argument with the ‘id’ column. By default, it will use the left join.

df2.join(df3.set_index('id'), on='id')

	id	Feature1	Feature2	Feature3
0	1	K	L	12.0
1	2	M	N	13.0
2	6	O	P	NaN
3	7	Q	R	17.0
4	8	S	T	15.0

Just like the merge function, we can change the operation of join by providing a `how` argument. In our case, we will be using an inner join.

df2.join(df3.set_index('id'), on='id', how = "inner")

	id_left	Feature1	Feature2	Feature3
0	1	K	L	12
1	2	M	N	13
3	7	Q	R	17
4	8	S	T	15

Time-series Friendly Merging

Pandas provides special functions for merging Time-series DataFrames. Perhaps the most useful and popular one is the merge_asof() function. The merge_asof() is similar to an ordered left-join except that you match on nearest key rather than equal keys. For each row in the left DataFrame, you select the last row in the right DataFrame whose on key is less than the left’s key. Both DataFrames must be sorted by the key.

Optionally an asof merge can perform a group-wise merge. This matches the by key equally, in addition to the nearest match on the on key.

For example, you might have trades and quotes, and you want to asof merge them. Here the left DataFrame is chosen as trades and right DataFrame as quotes. They are asof merged on key time and group-wise merged by their ticker symbol.

trades = pd.DataFrame({
    'time': pd.to_datetime(['20160525 13:30:00.023',
                            '20160525 13:30:00.038',
                            '20160525 13:30:00.048',
                            '20160525 13:30:00.048',
                            '20160525 13:30:00.048']),
    'ticker': ['MSFT', 'MSFT','GOOG', 'GOOG', 'AAPL'],
    'price': [51.95, 51.95,720.77, 720.92, 98.00],
    'quantity': [75, 155,100, 100, 100]},
    columns=['time', 'ticker', 'price', 'quantity'])

quotes = pd.DataFrame({
    'time': pd.to_datetime(['20160525 13:30:00.023',
                            '20160525 13:30:00.023',
                            '20160525 13:30:00.030',
                            '20160525 13:30:00.041',
                            '20160525 13:30:00.048',
                            '20160525 13:30:00.049',
                            '20160525 13:30:00.072',
                            '20160525 13:30:00.075']),
    'ticker': ['GOOG', 'MSFT', 'MSFT','MSFT', 'GOOG', 'AAPL', 'GOOG','MSFT'],
    'bid': [720.50, 51.95, 51.97, 51.99,720.50, 97.99, 720.50, 52.01],
    'ask': [720.93, 51.96, 51.98, 52.00,720.93, 98.01, 720.88, 52.03]},
    columns=['time', 'ticker', 'bid', 'ask'])

trades

	time	ticker	price	quantity
0	2016-05-25 13:30:00.023	MSFT	51.95	75
1	2016-05-25 13:30:00.038	MSFT	51.95	155
2	2016-05-25 13:30:00.048	GOOG	720.77	100
3	2016-05-25 13:30:00.048	GOOG	720.92	100
4	2016-05-25 13:30:00.048	AAPL	98.00	100

quotes

	time	ticker	bid	ask
0	2016-05-25 13:30:00.023	GOOG	720.50	720.93
1	2016-05-25 13:30:00.023	MSFT	51.95	51.96
2	2016-05-25 13:30:00.030	MSFT	51.97	51.98
3	2016-05-25 13:30:00.041	MSFT	51.99	52.00
4	2016-05-25 13:30:00.048	GOOG	720.50	720.93
5	2016-05-25 13:30:00.049	AAPL	97.99	98.01
6	2016-05-25 13:30:00.072	GOOG	720.50	720.88
7	2016-05-25 13:30:00.075	MSFT	52.01	52.03

df_merge_asof = pd.merge_asof(trades, quotes,
              on='time',
              by='ticker')

df_merge_asof

	time	ticker	price	quantity	bid	ask
0	2016-05-25 13:30:00.023	MSFT	51.95	75	51.95	51.96
1	2016-05-25 13:30:00.038	MSFT	51.95	155	51.97	51.98
2	2016-05-25 13:30:00.048	GOOG	720.77	100	720.50	720.93
3	2016-05-25 13:30:00.048	GOOG	720.92	100	720.50	720.93
4	2016-05-25 13:30:00.048	AAPL	98.00	100	NaN	NaN

If you observe carefully, you can notice the reason behind NaN appearing in the AAPL ticker row. Since the right DataFrame quotes didn't have any time value less than 13:30:00.048 (the time in the left table) for AAPL ticker, NaNs were introduced in the bid and ask columns.

You can also set a predefined tolerance level for time column. Suppose you only want asof merge within 2ms between the quote time and the trade time, then you will have to specify tolerance argument:

df_merge_asof_tolerance = pd.merge_asof(trades, quotes,
              on='time',
              by='ticker',
              tolerance=pd.Timedelta('2ms'))

df_merge_asof_tolerance

	time	ticker	price	quantity	bid	ask
0	2016-05-25 13:30:00.023	MSFT	51.95	75	51.95	51.96
1	2016-05-25 13:30:00.038	MSFT	51.95	155	NaN	NaN
2	2016-05-25 13:30:00.048	GOOG	720.77	100	720.50	720.93
3	2016-05-25 13:30:00.048	GOOG	720.92	100	720.50	720.93
4	2016-05-25 13:30:00.048	AAPL	98.00	100	NaN	NaN

Notice the difference between the above and previous result. Rows are not merged if the time tolerance didn't match 2ms.

Conclusion

In this tutorial, you learned to concatenate and merge DataFrames based on several logics using the concat() and merge() functions of pandas library. Toward the end, you also practiced the special function merge_asof() for merging Time-series DataFrames. Along the way, you also learned to play with indexes of the DataFrames. There are several other options you can explore for joining DataFrames in pandas, and I encourage you to look at its fantastic documentation. Happy exploring!

This tutorial used this pandas documentation to help write it.

If you would like to learn more about pandas, take DataCamp's pandas Foundations course and check out our DataFrames in Python Pandas Tutorial.

DataCamp also has several other handy pandas tutorials including:

Happy learning!

Topics

Python

Data Science

Data Analysis

Learn more about Python and pandas

Course

Pandas Joins for Spreadsheet Users

4 hr

3.4K

Learn how to effectively and efficiently join datasets in tabular format using the Python Pandas library.

See Details

Start Course

Certification available

Course

Data Manipulation with pandas

4 hr

350.2K

Learn how to import and clean data, calculate statistics, and create visualizations with pandas.

See Details

Start Course

Certification available

Course

Joining Data with pandas

4 hr

138.4K

Learn to combine data from multiple tables by joining data together using pandas.

See Details

Start Course

Exploring Matplotlib Inline: A Quick Tutorial

Learn how matplotlib inline can enable you to display your data visualizations directly in a notebook quickly and easily! In this article, we cover what matplotlib inline is, how to use it, and how to pair it with other libraries to create powerful visualizations.

Amberle McKee

See More See More

Joining DataFrames in pandas Tutorial

Pandas joining

pandas Concatenate

concat()

pandas Merge DataFrames

Types of pandas Join

Full Outer Join

Inner Join

Right Join

Left Join

Joining on index

pandas Join

pandas Join Two Dataframes

Time-series Friendly Merging

Conclusion

Exploring Matplotlib Inline: A Quick Tutorial

How to Use the NumPy linspace() Function

Python Absolute Value: A Quick Tutorial

How to Check if a File Exists in Python

Writing Custom Context Managers in Python

How to Convert a List to a String in Python

Pandas Joins for Spreadsheet Users

Data Manipulation with pandas

Joining Data with pandas

Exploring Matplotlib Inline: A Quick Tutorial

How to Use the NumPy linspace() Function

Python Absolute Value: A Quick Tutorial

How to Check if a File Exists in Python

Writing Custom Context Managers in Python

How to Convert a List to a String in Python

Pandas joining

pandas Concatenate

concat()

pandas Merge DataFrames

Types of pandas Join

Full Outer Join

Inner Join

Right Join

Left Join

Joining on index

pandas Join

pandas Join Two Dataframes

Time-series Friendly Merging

Conclusion

Exploring Matplotlib Inline: A Quick Tutorial

How to Use the NumPy linspace() Function

Python Absolute Value: A Quick Tutorial

How to Check if a File Exists in Python

Writing Custom Context Managers in Python

How to Convert a List to a String in Python

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}Pandas Joins for Spreadsheet Users

Data Manipulation with pandas

Joining Data with pandas

Exploring Matplotlib Inline: A Quick Tutorial

How to Use the NumPy linspace() Function

Python Absolute Value: A Quick Tutorial

How to Check if a File Exists in Python

Writing Custom Context Managers in Python

How to Convert a List to a String in Python

Pandas Joins for Spreadsheet Users