Combining Datasets in Pandas: Concatenating, Joining, and Merging¶
In data analysis, we often need to combine datasets for a more comprehensive analysis. Pandas provides powerful methods for combining datasets: concatenating, joining, and merging. In this tutorial, we will cover these three methods with examples.
Concatenation¶
Concatenation involves stacking datasets either vertically (row-wise) or horizontally (column-wise). It is useful when you need to combine datasets with the same structure.
Creating DataFrames for Concatenation¶
We will start by creating three DataFrames with sales data from different months.
import pandas as pd
# Creating DataFrames with sales data
sales_jan = pd.DataFrame({
'Product': ['Apples', 'Bananas', 'Cherries'],
'Sales': [150, 200, 90],
}, index=['a', 'b', 'c'])
sales_feb = pd.DataFrame({
'Product': ['Apples', 'Bananas', 'Cherries'],
'Sales': [180, 210, 120],
}, index=['d', 'e', 'f'])
sales_mar = pd.DataFrame({
'Product': ['Apples', 'Bananas', 'Cherries'],
'Sales': [170, 220, 110],
}, index=['g', 'h', 'i'])
Concatenating DataFrames Vertically (Row-wise)¶
The pd.concat()
function is used to concatenate DataFrames. By default, it concatenates row-wise.
# Concatenating DataFrames for January, February, and March sales data
pd.concat([sales_jan, sales_feb, sales_mar])
Product | Sales | |
---|---|---|
a | Apples | 150 |
b | Bananas | 200 |
c | Cherries | 90 |
d | Apples | 180 |
e | Bananas | 210 |
f | Cherries | 120 |
g | Apples | 170 |
h | Bananas | 220 |
i | Cherries | 110 |
This combines the sales data from January, February, and March into one DataFrame.
Concatenating DataFrames Horizontally (Column-wise)¶
We can also concatenate the DataFrames column-wise by specifying axis=1
.
# Concatenating sales data column-wise
pd.concat([sales_jan, sales_feb, sales_mar], axis=1)
Product | Sales | Product | Sales | Product | Sales | |
---|---|---|---|---|---|---|
a | Apples | 150.0 | NaN | NaN | NaN | NaN |
b | Bananas | 200.0 | NaN | NaN | NaN | NaN |
c | Cherries | 90.0 | NaN | NaN | NaN | NaN |
d | NaN | NaN | Apples | 180.0 | NaN | NaN |
e | NaN | NaN | Bananas | 210.0 | NaN | NaN |
f | NaN | NaN | Cherries | 120.0 | NaN | NaN |
g | NaN | NaN | NaN | NaN | Apples | 170.0 |
h | NaN | NaN | NaN | NaN | Bananas | 220.0 |
i | NaN | NaN | NaN | NaN | Cherries | 110.0 |
This combines the DataFrames side by side. If the index values don’t match, Pandas fills in the missing entries with NaN
.
# Creating customer data
customers = pd.DataFrame({
'CustomerID': [101, 102, 103, 104],
'Name': ['Alice', 'Bob', 'Charlie', 'Diana']
})
# Creating purchase data
purchases = pd.DataFrame({
'CustomerID': [104, 101, 103],
'PurchaseAmount': [250, 300, 150]
})
Merging with Inner Join (Default)¶
The default merge type is an inner join, which returns rows where the values match in both DataFrames.
# Performing an inner join based on 'CustomerID'
pd.merge(customers, purchases, on='CustomerID')
CustomerID | Name | PurchaseAmount | |
---|---|---|---|
0 | 101 | Alice | 300 |
1 | 103 | Charlie | 150 |
2 | 104 | Diana | 250 |
This will give us a DataFrame with customers who made a purchase and their corresponding purchase amounts.
Outer, Left, and Right Joins¶
Pandas supports different types of joins:
- Outer Join: Includes all rows from both DataFrames, filling missing values with
NaN
. - Left Join: Includes all rows from the left DataFrame and matches from the right.
- Right Join: Includes all rows from the right DataFrame and matches from the left.
# Outer join to include all customers and purchases
pd.merge(customers, purchases, how='outer', on='CustomerID')
# Left join to include all customers
# pd.merge(customers, purchases, how='left', on='CustomerID')
# Right join to include all purchases
pd.merge(customers, purchases, how='right', on='CustomerID')
CustomerID | Name | PurchaseAmount | |
---|---|---|---|
0 | 104 | Diana | 250 |
1 | 101 | Alice | 300 |
2 | 103 | Charlie | 150 |
Merging on Multiple Columns¶
Let’s create two more DataFrames with sales data to demonstrate merging on multiple columns.
import pandas as pd
# Creating DataFrames with multiple keys
sales_info = pd.DataFrame({
'StoreID': ['S1', 'S2', 'S3', 'S4'],
'Product': ['Apples', 'Bananas', 'Cherries', 'Corn'],
'Sales': [120, 200, 90, 150]
})
store_info = pd.DataFrame({
'StoreID': ['S1', 'S2', 'S3', 'S4'],
'Product': ['Apples', 'Bananas', 'Cherries', 'Dates'],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston'],
'Manager': ['John', 'Anna', 'Peter', 'Mike']
})
# Merging on both 'StoreID' and 'Product'
merged_data = pd.merge(sales_info, store_info, on=['StoreID', 'Product'])
# Displaying merged DataFrame
print("Merged DataFrame:")
print(merged_data)
Merged DataFrame: StoreID Product Sales City Manager 0 S1 Apples 120 New York John 1 S2 Bananas 200 Los Angeles Anna 2 S3 Cherries 90 Chicago Peter
This will merge both DataFrames on StoreID
and add the additional information like city and manager.
Joining¶
Joining is used to combine DataFrames based on their index. It is ideal when you want to align datasets using the row labels.
Creating DataFrames for Joining¶
Let’s create two DataFrames with sales and revenue data where the indexes represent different stores.
# Creating sales and revenue data
sales_data = pd.DataFrame({
'StoreID': ['S1', 'S2', 'S3'],
'Sales': [300, 450, 150]
}, index=['NY', 'LA', 'CH'])
revenue_data = pd.DataFrame({
'Revenue': [1200, 1500, 700]
}, index=['NY', 'LA', 'SF'])
Joining DataFrames Based on Index¶
We can join these two DataFrames based on their index using the join()
method.
# Joining sales and revenue data based on index
sales_data.join(revenue_data)
StoreID | Sales | Revenue | |
---|---|---|---|
NY | S1 | 300 | 1200.0 |
LA | S2 | 450 | 1500.0 |
CH | S3 | 150 | NaN |
This will join the two DataFrames where their indexes match.
Inner and Outer Joins with Index¶
You can specify whether to perform an inner or outer join when joining on the index.
# Performing an inner join (only matching indexes)
sales_data.join(revenue_data, how='inner')
# Performing an outer join (includes all indexes)
sales_data.join(revenue_data, how='outer')
StoreID | Sales | Revenue | |
---|---|---|---|
CH | S3 | 150.0 | NaN |
LA | S2 | 450.0 | 1500.0 |
NY | S1 | 300.0 | 1200.0 |
SF | NaN | NaN | 700.0 |