Basics of Pandas in Python: Aggregation & Groupby Methods¶
Pandas is a powerful library in Python used for data manipulation and analysis. This tutorial will guide you through the basics of aggregation and groupby methods in Pandas, which are essential for summarizing and analyzing data.
Aggregation Methods¶
Aggregation involves computing a summary statistic (or statistics) about each group, which can include functions like mean, median, min, max, sum, count, etc.
Creating a Sample DataFrame¶
Let's start by creating a sample DataFrame that contains sales data for a fictional retail store. This data includes the Product ID
, Category
, Quantity Sold
, and Revenue
.
import pandas as pd
# Sample data
data = {
'Product ID': [101, 102, 103, 104, 105, 106],
'Category': ['Electronics', 'Electronics', 'Furniture', 'Furniture', 'Office Supplies', 'Office Supplies'],
'Quantity Sold': [30, 45, 10, 5, 50, 60],
'Revenue': [3000, 4500, 2000, 1000, 2500, 3000]
}
df = pd.DataFrame(data)
Viewing the DataFrame¶
Before we perform any aggregation, it's good practice to understand the structure of the DataFrame.
# View the DataFrame
df
Product ID | Category | Quantity Sold | Revenue | |
---|---|---|---|---|
0 | 101 | Electronics | 30 | 3000 |
1 | 102 | Electronics | 45 | 4500 |
2 | 103 | Furniture | 10 | 2000 |
3 | 104 | Furniture | 5 | 1000 |
4 | 105 | Office Supplies | 50 | 2500 |
5 | 106 | Office Supplies | 60 | 3000 |
Calculating Summary Statistics¶
The describe()
method generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution.
# Get summary statistics
df.describe()
Product ID | Quantity Sold | Revenue | |
---|---|---|---|
count | 6.000000 | 6.000000 | 6.000000 |
mean | 103.500000 | 33.333333 | 2666.666667 |
std | 1.870829 | 22.286020 | 1169.045194 |
min | 101.000000 | 5.000000 | 1000.000000 |
25% | 102.250000 | 15.000000 | 2125.000000 |
50% | 103.500000 | 37.500000 | 2750.000000 |
75% | 104.750000 | 48.750000 | 3000.000000 |
max | 106.000000 | 60.000000 | 4500.000000 |
Transposing the Summary Statistics¶
Transposing the summary statistics can make them easier to read, especially when dealing with many columns.
# Transpose the summary statistics
df.describe().transpose()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Product ID | 6.0 | 103.500000 | 1.870829 | 101.0 | 102.25 | 103.5 | 104.75 | 106.0 |
Quantity Sold | 6.0 | 33.333333 | 22.286020 | 5.0 | 15.00 | 37.5 | 48.75 | 60.0 |
Revenue | 6.0 | 2666.666667 | 1169.045194 | 1000.0 | 2125.00 | 2750.0 | 3000.00 | 4500.0 |
# Mean of Quantity Sold
df['Quantity Sold'].mean()
np.float64(33.333333333333336)
Median¶
Finds the middle value when the data is sorted.
# Median of Revenue
df['Revenue'].median()
np.float64(2750.0)
Mode¶
Identifies the most frequently occurring value(s).
# Mode of Quantity Sold
df['Quantity Sold'].mode()
0 5 1 10 2 30 3 45 4 50 5 60 Name: Quantity Sold, dtype: int64
Standard Deviation¶
Measures the amount of variation or dispersion of a set of values.
# Standard deviation of Revenue
df['Revenue'].std()
np.float64(1169.045194450012)
Variance¶
Calculates how far the numbers are spread out from their average value.
# Variance of Quantity Sold
df['Quantity Sold'].var()
np.float64(496.6666666666667)
Sum¶
Adds up all the values in a column.
# Total Quantity Sold
df['Quantity Sold'].sum()
np.int64(200)
Product¶
Calculates the product of all the values in a column.
# Product of Revenue
df['Revenue'].prod()
np.int64(-414184810805067776)
# Group the data by Category
grouped = df.groupby('Category')
# Mean Quantity Sold by Category
grouped['Quantity Sold'].mean()
Category Electronics 37.5 Furniture 7.5 Office Supplies 55.0 Name: Quantity Sold, dtype: float64
Sum by Group¶
Calculates the total for each group.
# Total Revenue by Category
grouped['Revenue'].sum()
Category Electronics 7500 Furniture 3000 Office Supplies 5500 Name: Revenue, dtype: int64
Min and Max by Group¶
Finds the minimum and maximum values within each group.
# Minimum and Maximum Quantity Sold by Category
grouped['Quantity Sold'].agg(['min', 'max'])
min | max | |
---|---|---|
Category | ||
Electronics | 30 | 45 |
Furniture | 5 | 10 |
Office Supplies | 50 | 60 |
Grouping by Multiple Columns¶
You can also group the data by multiple columns.
# Group by Category and Product ID
grouped_multi = df.groupby(['Category', 'Product ID'])
Aggregations on Multiple Groups¶
# Sum of Quantity Sold for each Product in each Category
grouped_multi['Quantity Sold'].sum()
Category Product ID Electronics 101 30 102 45 Furniture 103 10 104 5 Office Supplies 105 50 106 60 Name: Quantity Sold, dtype: int64
Using the aggregate()
Function¶
The aggregate()
function allows you to perform multiple aggregations on one or more columns.
# Multiple aggregations on Quantity Sold and Revenue
grouped.agg({
'Quantity Sold': ['min', 'max', 'mean'],
'Revenue': ['sum', 'mean']
})
Quantity Sold | Revenue | ||||
---|---|---|---|---|---|
min | max | mean | sum | mean | |
Category | |||||
Electronics | 30 | 45 | 37.5 | 7500 | 3750.0 |
Furniture | 5 | 10 | 7.5 | 3000 | 1500.0 |
Office Supplies | 50 | 60 | 55.0 | 5500 | 2750.0 |
Applying Custom Functions¶
You can apply custom functions to your groups using the apply()
method.
# Define a custom function to calculate the range
def data_range(x):
return x.max() - x.min()
# Apply the custom function to Quantity Sold
grouped['Quantity Sold'].apply(data_range)
Category Electronics 15 Furniture 5 Office Supplies 10 Name: Quantity Sold, dtype: int64