Joining Data In Pandas
Following from my last blog on grouping data in Pandas. I would like to continue this series on Python Pandas by discuss merging and joining data.
When working with a single dataset there is little need for this functionality but bring two or more datasets together is when data analysis gets interesting.
Pandas has a number of options for putting two or more dataframes together. We will explore some of these in the blog and hopefully make make the topic clear.
import pandas as pd
import numpy as np
Lets make some dataframes¶
# lets make some dataframes
df = pd.DataFrame(np.random.randint(0,100,size=(6, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.randint(0,100,size=(6, 4)), columns=list('ABEF'))
df3 = pd.DataFrame(np.random.randint(0,100,size=(3, 1)), columns=list('G'))
df4 = pd.DataFrame(np.random.randint(0,100,size=(3, 1)), columns=list('H'))
# lets add some string columns
df['letters'] = pd.Series(['a','a','a','b','b','c'])
df2['letters'] = pd.Series(['c','a','b','b','c','c'])
df3['letters'] = pd.Series(['a','b','c'])
df4['letters'] = pd.Series(['a','y','z'])
df
A | B | C | D | letters | |
---|---|---|---|---|---|
0 | 70 | 56 | 85 | 74 | a |
1 | 53 | 44 | 92 | 80 | a |
2 | 3 | 22 | 71 | 99 | a |
3 | 32 | 70 | 23 | 58 | b |
4 | 99 | 84 | 61 | 40 | b |
5 | 81 | 54 | 59 | 4 | c |
Now once you have cleaned up your data you might want to join two or more datasets together. This might give you the opportunity to gain grater in sight than from a single dataset.
There are a few ways to join data in Pandas but really there is only two you need to know. These are concat() and merge(). Both of these come as top level functions so are called directly on Pandas.
Concat() simply stacks two frames together side by side or one on top of the other aligning only the index. Merge() is more like a QSL join and aligns data using an index or a column on the basis of common values. Merge will duplicate values if required.
Note - if you do not understand joins I would suggest getting up to speed first.
pd.concat()¶
The most basic concat tries to align horizontally on the column axis. Only the common columns align. Note this can be frustrating if column names are spelt wrong, or are named differently.
pd.concat([df, df2])
A | B | C | D | letters | E | F | |
---|---|---|---|---|---|---|---|
0 | 70 | 56 | 85.0 | 74.0 | a | NaN | NaN |
1 | 53 | 44 | 92.0 | 80.0 | a | NaN | NaN |
2 | 3 | 22 | 71.0 | 99.0 | a | NaN | NaN |
3 | 32 | 70 | 23.0 | 58.0 | b | NaN | NaN |
4 | 99 | 84 | 61.0 | 40.0 | b | NaN | NaN |
5 | 81 | 54 | 59.0 | 4.0 | c | NaN | NaN |
0 | 42 | 20 | NaN | NaN | c | 77.0 | 40.0 |
1 | 49 | 65 | NaN | NaN | a | 56.0 | 47.0 |
2 | 29 | 26 | NaN | NaN | b | 97.0 | 46.0 |
3 | 69 | 89 | NaN | NaN | b | 98.0 | 77.0 |
4 | 24 | 71 | NaN | NaN | c | 73.0 | 92.0 |
5 | 63 | 99 | NaN | NaN | c | 28.0 | 2.0 |
If we add axis=1 it aligns vertically on the row axis. Note that the axis is aligned but the common columns are not.
pd.concat([df, df2], axis=1)
A | B | C | D | letters | A | B | E | F | letters | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 70 | 56 | 85 | 74 | a | 42 | 20 | 77 | 40 | c |
1 | 53 | 44 | 92 | 80 | a | 49 | 65 | 56 | 47 | a |
2 | 3 | 22 | 71 | 99 | a | 29 | 26 | 97 | 46 | b |
3 | 32 | 70 | 23 | 58 | b | 69 | 89 | 98 | 77 | b |
4 | 99 | 84 | 61 | 40 | b | 24 | 71 | 73 | 92 | c |
5 | 81 | 54 | 59 | 4 | c | 63 | 99 | 28 | 2 | c |
Other useful Parameters are ignore_index= which will produce a new index. names= which will name the indexes. Sort=False which will keep the order.
pd.merge()¶
How if you want to join the data using common data in a column you will need to use the merge() method.
We will use df as the main dataframe and join the simple df3.
There are a number of parameters that you need to work with the most important are:
- left
- right
- how
- on
Left and Right are simply which dataframe do you want to use for left and right.
On is which column do you want to join on. This should be a data that is common to both dataframes.
how is what sort of join do you want to use. The two images below show this best.
pd.merge(left=df, right=df3, on='letters', how='left')
A | B | C | D | letters | G | |
---|---|---|---|---|---|---|
0 | 70 | 56 | 85 | 74 | a | 86 |
1 | 53 | 44 | 92 | 80 | a | 86 |
2 | 3 | 22 | 71 | 99 | a | 86 |
3 | 32 | 70 | 23 | 58 | b | 52 |
4 | 99 | 84 | 61 | 40 | b | 52 |
5 | 81 | 54 | 59 | 4 | c | 33 |
Understanding Joins¶
Joins can be tricky to learn but are essential for data analysis.
There are four main types:
- Inner - this captures only rows that are the same in both datasets
- Outer - this captures all data from both dataframes
- Left - this captures all data from the left frame and matching data from the right frame
- Right - this captures all data from the right frame and matching data from the left frame
Note when using Outer, Left or Right any rows that do not match will be filled with NAN's to signify blank data.
Lets do some simple examples using df3 and df4. We will join on 'letters' like before. You can see that 'a' is the only common row between the two letters columns.
First lets see df3 and df4.
df3
G | letters | |
---|---|---|
0 | 86 | a |
1 | 52 | b |
2 | 33 | c |
df4
H | letters | |
---|---|---|
0 | 7 | a |
1 | 86 | y |
2 | 70 | z |
Inner¶
Now lets merge on 'inner'. We get just one row which is the match/overlap.
pd.merge(left=df3, right=df4, on='letters', how='inner')
G | letters | H | |
---|---|---|---|
0 | 86 | a | 7 |
Outer¶
Now lets merge on 'outer'. We get all rows from both frames.
pd.merge(left=df3, right=df4, on='letters', how='outer')
G | letters | H | |
---|---|---|---|
0 | 86.0 | a | 7.0 |
1 | 52.0 | b | NaN |
2 | 33.0 | c | NaN |
3 | NaN | y | 86.0 |
4 | NaN | z | 70.0 |
Left¶
Now lefts merge on 'left'. We get all from the left frame and matches from the right.
pd.merge(left=df3, right=df4, on='letters', how='left')
G | letters | H | |
---|---|---|---|
0 | 86 | a | 7.0 |
1 | 52 | b | NaN |
2 | 33 | c | NaN |
Right¶
Finally lefts merge on 'right'. We get all from the right frame and matches from the left.
pd.merge(left=df3, right=df4, on='letters', how='right')
G | letters | H | |
---|---|---|---|
0 | 86.0 | a | 7 |
1 | NaN | y | 86 |
2 | NaN | z | 70 |
Recap we have looked at the Concat() and Merge() methods. There are other methods for doing similar operations e.g. append, join. But these are not really needed as concat and merge cover the use cases.
Remember Concat gives the flexibility to join based on the axis( all rows or all columns). Merge is based on any particular column each of the two dataframes, these columns are variables 'on' which to Join.
So that is all for today. Hope it was useful.