Course
Have you ever tried solving a Kaggle challenge? If yes, you might have noticed that in most of the challenges, the data provided to you is present in multiple files, with some of the columns present in more than one of the files. Well, what is the first thing that comes to your mind? Join them of course!
In this tutorial, you will practice a few standard pandas joining techniques. More specifically, you will learn to:
- Concatenate DataFrames along row and column.
- Merge DataFrames on specific keys by different join logics like left-join, inner-join, etc.
- Join DataFrames by index.
- Time-series friendly merging provided in pandas
Along the way, you will also learn a few tricks which you require before and after joining.
Pandas joining
Joining and merging DataFrames is the core process to start with data analysis and machine learning tasks. It is one of the toolkits which every Data Analyst or Data Scientist should master because, in almost all cases, data comes from multiple sources and files. You may need to bring all the data in one place by some sort of join logic and then start your analysis. People who work with SQL-like query languages might know the importance of this task. Even if you want to build some machine learning models on some data, you may need to merge multiple csv files together in a single DataFrame.
Thankfully you have the most popular library in python, pandas to your rescue! pandas provides various facilities for easily combining together Series, DataFrames, and Panel objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.
Run and edit the code from this tutorial online
Run codepandas Concatenate
Start by importing the library you will be using throughout the tutorial: pandas
import pandas as pd
You will be performing all the operations in this tutorial on the dummy DataFrames that you will create. To create a DataFrame, you can use a Python dictionary like:
dummy_data1 = {
'id': ['1', '2', '3', '4', '5'],
'Feature1': ['A', 'C', 'E', 'G', 'I'],
'Feature2': ['B', 'D', 'F', 'H', 'J']}
Here, the keys of the dictionary dummy_data1 are the column names, and the values in the list are the data corresponding to each observation or row. To transform this into a pandas DataFrame, you will use the DataFrame() function of pandas, along with its columns argument to name your columns:
df1 = pd.DataFrame(dummy_data1, columns = ['id', 'Feature1', 'Feature2'])
df1
id | Feature1 | Feature2 | |
---|---|---|---|
0 | 1 | A | B |
1 | 2 | C | D |
2 | 3 | E | F |
3 | 4 | G | H |
4 | 5 | I | J |
As you can notice, you now have a DataFrame with three columns id, Feature1, and Feature2. There is an additional un-named column which pandas intrinsically creates as the row labels. Similar to the previous DataFrame df1, you will create two more DataFrames df2 and df3 :
dummy_data2 = {
'id': ['1', '2', '6', '7', '8'],
'Feature1': ['K', 'M', 'O', 'Q', 'S'],
'Feature2': ['L', 'N', 'P', 'R', 'T']}
df2 = pd.DataFrame(dummy_data2, columns = ['id', 'Feature1', 'Feature2'])
df2
id | Feature1 | Feature2 | |
---|---|---|---|
0 | 1 | K | L |
1 | 2 | M | N |
2 | 6 | O | P |
3 | 7 | Q | R |
4 | 8 | S | T |
dummy_data3 = {
'id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
'Feature3': [12, 13, 14, 15, 16, 17, 15, 12, 13, 23]}
df3 = pd.DataFrame(dummy_data3, columns = ['id', 'Feature3'])
df3
id | Feature3 | |
---|---|---|
0 | 1 | 12 |
1 | 2 | 13 |
2 | 3 | 14 |
3 | 4 | 15 |
4 | 5 | 16 |
5 | 7 | 17 |
6 | 8 | 15 |
7 | 9 | 12 |
8 | 10 | 13 |
9 | 11 | 23 |
concat()
To simply concatenate the DataFrames along the row you can use the concat() function in pandas. You will have to pass the names of the DataFrames in a list as the argument to the concat() function:
df_row = pd.concat([df1, df2])
df_row
id | Feature1 | Feature2 | |
---|---|---|---|
0 | 1 | A | B |
1 | 2 | C | D |
2 | 3 | E | F |
3 | 4 | G | H |
4 | 5 | I | J |
0 | 1 | K | L |
1 | 2 | M | N |
2 | 6 | O | P |
3 | 7 | Q | R |
4 | 8 | S | T |
You can notice that the two DataFrames df1 and df2 are now concatenated into a single DataFrame df_row along the row. However, the row labels seem to be wrong. If you want the row labels to adjust automatically according to the join, you will have to set the argument ignore_index as True while calling the concat() function:
df_row_reindex = pd.concat([df1, df2], ignore_index=True)
df_row_reindex
id | Feature1 | Feature2 | |
---|---|---|---|
0 | 1 | A | B |
1 | 2 | C | D |
2 | 3 | E | F |
3 | 4 | G | H |
4 | 5 | I | J |
5 | 1 | K | L |
6 | 2 | M | N |
7 | 6 | O | P |
8 | 7 | Q | R |
9 | 8 | S | T |
Now the row labels are correct!
pandas also provides you with an option to label the DataFrames, after the concatenation, with a key so that you may know which data came from which DataFrame. You can achieve the same by passing additional argument keys specifying the label names of the DataFrames in a list. Here you will perform the same concatenation with keys as x and y for DataFrames df1 and df2 respectively.
frames = [df1,df2]
df_keys = pd.concat(frames, keys=['x', 'y'])
df_keys
id | Feature1 | Feature2 | ||
---|---|---|---|---|
x | 0 | 1 | A | B |
1 | 2 | C | D | |
2 | 3 | E | F | |
3 | 4 | G | H | |
4 | 5 | I | J | |
y | 0 | 1 | K | L |
1 | 2 | M | N | |
2 | 6 | O | P | |
3 | 7 | Q | R | |
4 | 8 | S | T |
Mentioning the keys also makes it easy to retrieve data corresponding to a particular DataFrame. You can retrieve the data of DataFrame df2 which had the label y by using the loc method:
df_keys.loc['y']
id | Feature1 | Feature2 | |
---|---|---|---|
0 | 1 | K | L |
1 | 2 | M | N |
2 | 6 | O | P |
3 | 7 | Q | R |
4 | 8 | S | T |
You can also pass a dictionary to concat(), in which case the dictionary keys will be used for the keys argument (unless other keys are specified):
pieces = {'x': df1, 'y': df2}
df_piece = pd.concat(pieces)
df_piece
id | Feature1 | Feature2 | ||
---|---|---|---|---|
x | 0 | 1 | A | B |
1 | 2 | C | D | |
2 | 3 | E | F | |
3 | 4 | G | H | |
4 | 5 | I | J | |
y | 0 | 1 | K | L |
1 | 2 | M | N | |
2 | 6 | O | P | |
3 | 7 | Q | R | |
4 | 8 | S | T |
It is worth noting that concat() makes a full copy of the data, and continuosly reusing this function can create a significant performance hit. If you need to use the operation over several datasets, use a list comprehension.
frames = [ process_your_file(f) for f in files ]
result = pd.concat(frames)
To concatenate DataFrames along column, you can specify the axis parameter as 1 :
df_col = pd.concat([df1,df2], axis=1)
df_col
id | Feature1 | Feature2 | id | Feature1 | Feature2 | |
---|---|---|---|---|---|---|
0 | 1 | A | B | 1 | K | L |
1 | 2 | C | D | 2 | M | N |
2 | 3 | E | F | 6 | O | P |
3 | 4 | G | H | 7 | Q | R |
4 | 5 | I | J | 8 | S | T |
pandas Merge DataFrames
Another ubiquitous operation related to DataFrames is the merging operation. Two DataFrames might hold different kinds of information about the same entity and be linked by some common feature/column. To join these DataFrames, pandas provides multiple functions like concat(), merge() , join(), etc. In this section, you will practice using merge() function of pandas.
You can join DataFrames df_row (which you created by concatenating df1 and df2 along the row) and df3 on the common column (or key) id. To do so, pass the names of the DataFrames and an additional argument on as the name of the common column, here id, to the merge() function:
df_merge_col = pd.merge(df_row, df3, on='id')
df_merge_col
id | Feature1 | Feature2 | Feature3 | |
---|---|---|---|---|
0 | 1 | A | B | 12 |
1 | 1 | K | L | 12 |
2 | 2 | C | D | 13 |
3 | 2 | M | N | 13 |
4 | 3 | E | F | 14 |
5 | 4 | G | H | 15 |
6 | 5 | I | J | 16 |
7 | 7 | Q | R | 17 |
8 | 8 | S | T | 15 |
You can notice that the DataFrames are now merged into a single DataFrame based on the common values present in the id column of both the DataFrames. For example, here id value 1 was present with both A, B and K, L in the DataFrame df_row hence this id got repeated twice in the final DataFrame df_merge_col with repeated value 12 of Feature3 which came from DataFrame df3.
It might happen that the column on which you want to merge the DataFrames have different names (unlike in this case). For such merges, you will have to specify the arguments left_on as the left DataFrame name and right_on as the right DataFrame name, like :
df_merge_difkey = pd.merge(df_row, df3, left_on='id', right_on='id')
df_merge_difkey
id | Feature1 | Feature2 | Feature3 | |
---|---|---|---|---|
0 | 1 | A | B | 12 |
1 | 1 | K | L | 12 |
2 | 2 | C | D | 13 |
3 | 2 | M | N | 13 |
4 | 3 | E | F | 14 |
5 | 4 | G | H | 15 |
6 | 5 | I | J | 16 |
7 | 7 | Q | R | 17 |
8 | 8 | S | T | 15 |
You can also append rows to a DataFrame by passing a Series or dict to append() function which returns a new DataFrame:
add_row = pd.Series(['10', 'X1', 'X2', 'X3'],
index=['id','Feature1', 'Feature2', 'Feature3'])
df_add_row = df_merge_col.append(add_row, ignore_index=True)
df_add_row
id | Feature1 | Feature2 | Feature3 | |
---|---|---|---|---|
0 | 1 | A | B | 12 |
1 | 1 | K | L | 12 |
2 | 2 | C | D | 13 |
3 | 2 | M | N | 13 |
4 | 3 | E | F | 14 |
5 | 4 | G | H | 15 |
6 | 5 | I | J | 16 |
7 | 7 | Q | R | 17 |
8 | 8 | S | T | 15 |
9 | 10 | X1 | X2 | X3 |
Types of pandas Join
In this section, you will practice the various join logics available to merge pandas DataFrames based on some common column/key. The logic behind these joins is very much the same that you have in SQL when you join tables.
Full Outer Join
The FULL OUTER JOIN combines the results of both the left and the right outer joins. The joined DataFrame will contain all records from both the DataFrames and fill in NaNs for missing matches on either side. You can perform a full outer join by specifying the how argument as outer in the merge() function:
df_outer = pd.merge(df1, df2, on='id', how='outer')
df_outer
id | Feature1_x | Feature2_x | Feature1_y | Feature2_y | |
---|---|---|---|---|---|
0 | 1 | A | B | K | L |
1 | 2 | C | D | M | N |
2 | 3 | E | F | NaN | NaN |
3 | 4 | G | H | NaN | NaN |
4 | 5 | I | J | NaN | NaN |
5 | 6 | NaN | NaN | O | P |
6 | 7 | NaN | NaN | Q | R |
7 | 8 | NaN | NaN | S | T |
You can notice that the resulting DataFrame had all the entries from both the tables with NaN values for missing matches on either side. However, one more thing to notice is the suffix which got appended to the column names to show which column came from which DataFrame. The default suffixes are x and y, however, you can modify them by specifying the suffixes argument in the merge() function:
df_suffix = pd.merge(df1, df2, left_on='id',right_on='id',how='outer',suffixes=('_left','_right'))
df_suffix
id | Feature1_left | Feature2_left | Feature1_right | Feature2_right | |
---|---|---|---|---|---|
0 | 1 | A | B | K | L |
1 | 2 | C | D | M | N |
2 | 3 | E | F | NaN | NaN |
3 | 4 | G | H | NaN | NaN |
4 | 5 | I | J | NaN | NaN |
5 | 6 | NaN | NaN | O | P |
6 | 7 | NaN | NaN | Q | R |
7 | 8 | NaN | NaN | S | T |
Inner Join
The INNER JOIN produces only the set of records that match in both DataFrame A and DataFrame B. You have to pass inner in the how argument of merge() function to do inner join:
df_inner = pd.merge(df1, df2, on='id', how='inner')
df_inner
id | Feature1_x | Feature2_x | Feature1_y | Feature2_y | |
---|---|---|---|---|---|
0 | 1 | A | B | K | L |
1 | 2 | C | D | M | N |
Right Join
The RIGHT JOIN produces a complete set of records from DataFrame B (right DataFrame), with the matching records (where available) in DataFrame A (left DataFrame). If there is no match, the right side will contain null. You have to pass right in the how argument of merge() function to do right join:
df_right = pd.merge(df1, df2, on='id', how='right')
df_right
id | Feature1_x | Feature2_x | Feature1_y | Feature2_y | |
---|---|---|---|---|---|
0 | 1 | A | B | K | L |
1 | 2 | C | D | M | N |
2 | 6 | NaN | NaN | O | P |
3 | 7 | NaN | NaN | Q | R |
4 | 8 | NaN | NaN | S | T |
Left Join
The LEFT JOIN produces a complete set of records from DataFrame A (left DataFrame), with the matching records (where available) in DataFrame B (right DataFrame). If there is no match, the left side will contain null. You have to pass left in the how argument of merge() function to do left join:
df_left = pd.merge(df1, df2, on='id', how='left')
df_left
id | Feature1_x | Feature2_x | Feature1_y | Feature2_y | |
---|---|---|---|---|---|
0 | 1 | A | B | K | L |
1 | 2 | C | D | M | N |
2 | 3 | E | F | NaN | NaN |
3 | 4 | G | H | NaN | NaN |
4 | 5 | I | J | NaN | NaN |
Joining on index
Sometimes you may have to perform the join on the indexes or the row labels. To do so, you have to specify right_index (for the indexes of the right DataFrame) and left_index (for the indexes of the left DataFrame) as True :
df_index = pd.merge(df1, df2, right_index=True, left_index=True)
df_index
id_x | Feature1_x | Feature2_x | id_y | Feature1_y | Feature2_y | |
---|---|---|---|---|---|---|
0 | 1 | A | B | 1 | K | L |
1 | 2 | C | D | 2 | M | N |
2 | 3 | E | F | 6 | O | P |
3 | 4 | G | H | 7 | Q | R |
4 | 5 | I | J | 8 | S | T |
pandas Join
Pandas DataFrame.join function is used for joining data frames on unique indexes. You can use the optional argument `on` to join column(s) names on the index and how arguments handle the operation of the two objects. By default, it will use inner join.
pandas Join Two Dataframes
Let’s join two data frames using .join. We have provided `lsuffix` and `rsuffix` to avoid raising the column overlapping error. It joins based on the index, not on the column, so we need to either change the ‘id’ column or provide a suffix.
df2.join(df3, lsuffix='_left', rsuffix='_right')
id_left | Feature1 | Feature2 | id_right | Feature3 | |
---|---|---|---|---|---|
0 | 1 | K | L | 1 | 12 |
1 | 2 | M | N | 2 | 13 |
2 | 6 | O | P | 3 | 14 |
3 | 7 | Q | R | 4 | 15 |
4 | 8 | S | T | 5 | 16 |
We can also join columns on the index using the `on` argument. To apply the join successfully, we have to df3 ‘id’ column to index and provide the `on` argument with the ‘id’ column. By default, it will use the left join.
df2.join(df3.set_index('id'), on='id')
id | Feature1 | Feature2 | Feature3 | |
---|---|---|---|---|
0 | 1 | K | L | 12.0 |
1 | 2 | M | N | 13.0 |
2 | 6 | O | P | NaN |
3 | 7 | Q | R | 17.0 |
4 | 8 | S | T | 15.0 |
Just like the merge function, we can change the operation of join by providing a `how` argument. In our case, we will be using an inner join.
df2.join(df3.set_index('id'), on='id', how = "inner")
id_left | Feature1 | Feature2 | Feature3 | |
---|---|---|---|---|
0 | 1 | K | L | 12 |
1 | 2 | M | N | 13 |
3 | 7 | Q | R | 17 |
4 | 8 | S | T | 15 |
Time-series Friendly Merging
Pandas provides special functions for merging Time-series DataFrames. Perhaps the most useful and popular one is the merge_asof() function. The merge_asof() is similar to an ordered left-join except that you match on nearest key rather than equal keys. For each row in the left DataFrame, you select the last row in the right DataFrame whose on key is less than the left’s key. Both DataFrames must be sorted by the key.
Optionally an asof merge can perform a group-wise merge. This matches the by key equally, in addition to the nearest match on the on key.
For example, you might have trades and quotes, and you want to asof merge them. Here the left DataFrame is chosen as trades and right DataFrame as quotes. They are asof merged on key time and group-wise merged by their ticker symbol.
trades = pd.DataFrame({
'time': pd.to_datetime(['20160525 13:30:00.023',
'20160525 13:30:00.038',
'20160525 13:30:00.048',
'20160525 13:30:00.048',
'20160525 13:30:00.048']),
'ticker': ['MSFT', 'MSFT','GOOG', 'GOOG', 'AAPL'],
'price': [51.95, 51.95,720.77, 720.92, 98.00],
'quantity': [75, 155,100, 100, 100]},
columns=['time', 'ticker', 'price', 'quantity'])
quotes = pd.DataFrame({
'time': pd.to_datetime(['20160525 13:30:00.023',
'20160525 13:30:00.023',
'20160525 13:30:00.030',
'20160525 13:30:00.041',
'20160525 13:30:00.048',
'20160525 13:30:00.049',
'20160525 13:30:00.072',
'20160525 13:30:00.075']),
'ticker': ['GOOG', 'MSFT', 'MSFT','MSFT', 'GOOG', 'AAPL', 'GOOG','MSFT'],
'bid': [720.50, 51.95, 51.97, 51.99,720.50, 97.99, 720.50, 52.01],
'ask': [720.93, 51.96, 51.98, 52.00,720.93, 98.01, 720.88, 52.03]},
columns=['time', 'ticker', 'bid', 'ask'])
trades
time | ticker | price | quantity | |
---|---|---|---|---|
0 | 2016-05-25 13:30:00.023 | MSFT | 51.95 | 75 |
1 | 2016-05-25 13:30:00.038 | MSFT | 51.95 | 155 |
2 | 2016-05-25 13:30:00.048 | GOOG | 720.77 | 100 |
3 | 2016-05-25 13:30:00.048 | GOOG | 720.92 | 100 |
4 | 2016-05-25 13:30:00.048 | AAPL | 98.00 | 100 |
quotes
time | ticker | bid | ask | |
---|---|---|---|---|
0 | 2016-05-25 13:30:00.023 | GOOG | 720.50 | 720.93 |
1 | 2016-05-25 13:30:00.023 | MSFT | 51.95 | 51.96 |
2 | 2016-05-25 13:30:00.030 | MSFT | 51.97 | 51.98 |
3 | 2016-05-25 13:30:00.041 | MSFT | 51.99 | 52.00 |
4 | 2016-05-25 13:30:00.048 | GOOG | 720.50 | 720.93 |
5 | 2016-05-25 13:30:00.049 | AAPL | 97.99 | 98.01 |
6 | 2016-05-25 13:30:00.072 | GOOG | 720.50 | 720.88 |
7 | 2016-05-25 13:30:00.075 | MSFT | 52.01 | 52.03 |
df_merge_asof = pd.merge_asof(trades, quotes,
on='time',
by='ticker')
df_merge_asof
time | ticker | price | quantity | bid | ask | |
---|---|---|---|---|---|---|
0 | 2016-05-25 13:30:00.023 | MSFT | 51.95 | 75 | 51.95 | 51.96 |
1 | 2016-05-25 13:30:00.038 | MSFT | 51.95 | 155 | 51.97 | 51.98 |
2 | 2016-05-25 13:30:00.048 | GOOG | 720.77 | 100 | 720.50 | 720.93 |
3 | 2016-05-25 13:30:00.048 | GOOG | 720.92 | 100 | 720.50 | 720.93 |
4 | 2016-05-25 13:30:00.048 | AAPL | 98.00 | 100 | NaN | NaN |
If you observe carefully, you can notice the reason behind NaN appearing in the AAPL ticker row. Since the right DataFrame quotes didn't have any time value less than 13:30:00.048 (the time in the left table) for AAPL ticker, NaNs were introduced in the bid and ask columns.
You can also set a predefined tolerance level for time column. Suppose you only want asof merge within 2ms between the quote time and the trade time, then you will have to specify tolerance argument:
df_merge_asof_tolerance = pd.merge_asof(trades, quotes,
on='time',
by='ticker',
tolerance=pd.Timedelta('2ms'))
df_merge_asof_tolerance
time | ticker | price | quantity | bid | ask | |
---|---|---|---|---|---|---|
0 | 2016-05-25 13:30:00.023 | MSFT | 51.95 | 75 | 51.95 | 51.96 |
1 | 2016-05-25 13:30:00.038 | MSFT | 51.95 | 155 | NaN | NaN |
2 | 2016-05-25 13:30:00.048 | GOOG | 720.77 | 100 | 720.50 | 720.93 |
3 | 2016-05-25 13:30:00.048 | GOOG | 720.92 | 100 | 720.50 | 720.93 |
4 | 2016-05-25 13:30:00.048 | AAPL | 98.00 | 100 | NaN | NaN |
Notice the difference between the above and previous result. Rows are not merged if the time tolerance didn't match 2ms.
Conclusion
In this tutorial, you learned to concatenate and merge DataFrames based on several logics using the concat() and merge() functions of pandas library. Toward the end, you also practiced the special function merge_asof() for merging Time-series DataFrames. Along the way, you also learned to play with indexes of the DataFrames. There are several other options you can explore for joining DataFrames in pandas, and I encourage you to look at its fantastic documentation. Happy exploring!
This tutorial used this pandas documentation to help write it.
If you would like to learn more about pandas, take DataCamp's pandas Foundations course and check out our DataFrames in Python Pandas Tutorial.
DataCamp also has several other handy pandas tutorials including:
- Importing CSV data into pandas
- Adding columns in pandas
- Sorting values in pandas
- Removing duplicates in pandas with
drop_duplicates
Happy learning!
Learn more about Python and pandas
Course
Data Manipulation with pandas
Course
Joining Data with pandas
tutorial
Pandas Tutorial: DataFrames in Python
tutorial
Merging Datasets in R
tutorial
Hierarchical indices, groupby and pandas
tutorial
Data Preparation with pandas
tutorial
Python Select Columns Tutorial
DataCamp Team
7 min
tutorial