Pandas Groupby Average
GroupBy operations are powerful tools for summarizing and aggregating data. One common operation is calculating the average (mean) of groups within a DataFrame. Whether you're analyzing sales data by region, customer behavior by age group, or any other grouped data, groupby() method combined with aggregation functions like mean() makes it easy to compute averages for each group.
Let's understand with a simple example:
import pandas as pd
data = {'Name': ['Emma', 'Hasan', 'Rob', 'Emma', 'Hasan'],
'Marks': [85, 70, 65, 90, 65]}
df = pd.DataFrame(data)
average_marks = df.groupby('Name')['Marks'].mean()
print(average_marks)
Output
Name Emma 87.5 Hasan 67.5 Rob 65.0 Name: Marks, dtype: float64
The groupby function involves three key steps:
- Splitting: The data is divided into groups based on specified criteria.
- Applying: A function (like mean, sum, etc.) is applied to each group.
- Combining: The results are combined back into a DataFrame or Series.
This method is significant because it enables efficient analysis of large datasets by summarizing information in a structured way.
Method 1: Grouping by a Single Column
The most basic way to calculate the average for grouped data with a single column and then applying the mean() function to the grouped data.
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Gender': ['Female', 'Male', 'Male', 'Female', 'Female'],
'Age': [25, 30, 35, 28, 22],
'Salary': [50000, 60000, 70000, 55000, 48000]
}
df = pd.DataFrame(data)
# Grouping by 'Gender' and calculating the mean for each group
grouped_data = df.groupby('Gender').mean()
print(grouped_data)
Output
Age Salary Gender Female 25.0 51000.0 Male 32.5 65000.0
Method 2: Grouping by Multiple Columns
You can also group by multiple columns to calculate averages for more specific subgroups. This is helpful when you want to segment your data into more detailed categories, such as grouping by both Gender and Age.
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace'],
'Gender': ['Female', 'Male', 'Male', 'Female', 'Female', 'Male', 'Female'],
'Age': [25, 30, 35, 28, 22, 40, 29],
'Salary': [50000, 60000, 70000, 55000, 48000, 72000, 53000]
}
df = pd.DataFrame(data)
# Grouping by 'Gender' and 'Age', then calculating the mean
grouped_data = df.groupby(['Gender', 'Age']).mean()
print(grouped_data)
Output
Salary
Gender Age
Female 22 48000.0
25 50000.0
28 55000.0
29 53000.0
Male 30 60000.0
35 70000.0
40 72000.0
Method 3: Grouping with Multiple Aggregation Functions
Sometimes, you may want to calculate not just the average, but multiple statistics (such as count, sum, or median) for each group. Pandas allows to apply multiple aggregation functions simultaneously using agg().
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Gender': ['Female', 'Male', 'Male', 'Female', 'Female'],
'Salary': [50000, 60000, 70000, 55000, 52000]
}
df = pd.DataFrame(data)
# Group by 'Gender' and calculate statistics for 'Salary'
grouped_df = df.groupby('Gender')['Salary'].agg(['mean', 'sum', 'count'])
print(grouped_df)
Output
mean sum count Gender Female 52333.333333 157000 3 Male 65000.000000 130000 2