++++Notebook converted from Jupyter for blog publishing.
06-Combining-DataFrames
Combining DataFrames
Full Official Guide (Lots of examples!)
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html (opens in a new tab)
import numpy as np
import pandas as pdConcatenation
Directly "glue" together dataframes.
data_one = {'A': ['A0', 'A1', 'A2', 'A3'],'B': ['B0', 'B1', 'B2', 'B3']}data_two = {'C': ['C0', 'C1', 'C2', 'C3'], 'D': ['D0', 'D1', 'D2', 'D3']}one = pd.DataFrame(data_one)two = pd.DataFrame(data_two)oneA
B
0
A0
B0twoC
D
0
C0
D0Axis = 0
Concatenate along rows
axis0 = pd.concat([one,two],axis=0)axis0A
B
C
D
0Axis = 1
Concatenate along columns
axis1 = pd.concat([one,two],axis=1)axis1A
B
C
D
0Axis 0 , but columns match up
In case you wanted this:
two.columns = one.columnspd.concat([one,two])A
B
0
A0
B0Merge
Data Tables
registrations = pd.DataFrame({'reg_id':[1,2,3,4],'name':['Andrew','Bobo','Claire','David']})
logins = pd.DataFrame({'log_id':[1,2,3,4],'name':['Xavier','Andrew','Yolanda','Bobo']})registrationsreg_id
name
0
1
Andrewloginslog_id
name
0
1
Xavierpd.merge()
Merge pandas DataFrames based on key columns, similar to a SQL join. Results based on the how parameter.
help(pd.merge)Help on function merge in module pandas.core.reshape.merge:
merge(left, right, how: str = 'inner', on=None, left_on=None, right_on=None, left_index: bool = False, right_index: bool = False, sort: bool = False, suffixes=('_x', '_y'), copy: bool = True, indicator: bool = False, validate=None) -> 'DataFrame'
Merge DataFrame or named Series objects with a database-style join.
Inner,Left, Right, and Outer Joins
Inner Join
Match up where the key is present in BOTH tables. There should be no NaNs due to the join, since by definition to be part of the Inner Join they need info in both tables. Only Andrew and Bobo both registered and logged in.
# Notice pd.merge doesn't take in a list like concat
pd.merge(registrations,logins,how='inner',on='name')reg_id
name
log_id
0
1# Pandas smart enough to figure out key column (on parameter) if only one column name matches up
pd.merge(registrations,logins,how='inner')reg_id
name
log_id
0
1# Pandas reports an error if "on" key column isn't in both dataframes
# pd.merge(registrations,logins,how='inner',on='reg_id')Left Join
Match up AND include all rows from Left Table. Show everyone who registered on Left Table, if they don't have login info, then fill with NaN.
pd.merge(registrations,logins,how='left')reg_id
name
log_id
0
1Right Join
Match up AND include all rows from Right Table. Show everyone who logged in on the Right Table, if they don't have registration info, then fill with NaN.
pd.merge(registrations,logins,how='right')reg_id
name
log_id
0
1.0Outer Join
Match up on all info found in either Left or Right Table. Show everyone that's in the Log in table and the registrations table. Fill any missing info with NaN
pd.merge(registrations,logins,how='outer')reg_id
name
log_id
0
1.0Join on Index or Column
Use combinations of left_on,right_on,left_index,right_index to merge a column or index on each other
registrationsreg_id
name
0
1
Andrewloginslog_id
name
0
1
Xavierregistrations = registrations.set_index("name")registrationsreg_id
name
Andrew
1
Bobopd.merge(registrations,logins,left_index=True,right_on='name')reg_id
log_id
name
1
1pd.merge(logins,registrations,right_index=True,left_on='name')log_id
name
reg_id
1
2Dealing with differing key column names in joined tables
registrations = registrations.reset_index()registrationsname
reg_id
0
Andrew
1loginslog_id
name
0
1
Xavierregistrations.columns = ['reg_name','reg_id']registrationsreg_name
reg_id
0
Andrew
1# ERROR
# pd.merge(registrations,logins)pd.merge(registrations,logins,left_on='reg_name',right_on='name')reg_name
reg_id
log_id
name
0pd.merge(registrations,logins,left_on='reg_name',right_on='name').drop('reg_name',axis=1)reg_id
log_id
name
0
1Pandas automatically tags duplicate columns
registrations.columns = ['name','id']logins.columns = ['id','name']registrationsname
id
0
Andrew
1loginsid
name
0
1
Xavier# _x is for left
# _y is for right
pd.merge(registrations,logins,on='name')name
id_x
id_y
0
Andrewpd.merge(registrations,logins,on='name',suffixes=('_reg','_log'))name
id_reg
id_log
0
Andrew