++++

Data Science

May 2026×Notebook lesson

Notebook converted from Jupyter for blog publishing.

05-Groupby-Operations-and-MultiIndex

Driptanil DattaSoftware Developer

Groupby Operations and Multi-level Index

import numpy as np
import pandas as pd

Data

df = pd.read_csv('mpg.csv')

df

HTML

mpg
cylinders
displacement
horsepower
weight

groupby() method

# Creates a groupby object waiting for an aggregate method
df.groupby('model_year')

RESULT

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000246790FEC88>

Adding an aggregate method call. To use a grouped object, you need to tell pandas how you want to aggregate the data.

Common Options:

mean(): Compute mean of groups sum(): Compute sum of group values size(): Compute group sizes count(): Compute count of group std(): Standard deviation of groups var(): Compute variance of groups sem(): Standard error of the mean of groups describe(): Generates descriptive statistics first(): Compute first of group values last(): Compute last of group values nth() : Take nth value, or a subset if n is a list min(): Compute min of group values max(): Compute max of group values

Full List at the Online Documentation: https://pandas.pydata.org/docs/reference/groupby.html (opens in a new tab)

# model_year becomes the index! It is NOT a column name,it is now the name of the index
df.groupby('model_year').mean()

HTML

mpg
cylinders
displacement
weight
acceleration

avg_year = df.groupby('model_year').mean()

avg_year.index

RESULT

Int64Index([70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82], dtype='int64', name='model_year')

avg_year.columns

RESULT

Index(['mpg', 'cylinders', 'displacement', 'weight', 'acceleration', 'origin'], dtype='object')

avg_year['mpg']

RESULT

model_year
70    17.689655
71    21.250000
72    18.714286
73    17.100000

df.groupby('model_year').mean()['mpg']

RESULT

model_year
70    17.689655
71    21.250000
72    18.714286
73    17.100000

df.groupby('model_year').describe()

HTML

mpg
cylinders
...
acceleration
origin

df.groupby('model_year').describe().transpose()

HTML

model_year
70
71
72
73

Groupby Multiple Columns

Let's explore average mpg per year per cylinder count

df.groupby(['model_year','cylinders']).mean()

HTML

mpg
displacement
weight
acceleration
origin

df.groupby(['model_year','cylinders']).mean().index

RESULT

MultiIndex([(70, 4),
            (70, 6),
            (70, 8),
            (71, 4),
            (71, 6),

MultiIndex

The MultiIndex Object

year_cyl = df.groupby(['model_year','cylinders']).mean()

year_cyl

HTML

mpg
displacement
weight
acceleration
origin

year_cyl.index

RESULT

MultiIndex([(70, 4),
            (70, 6),
            (70, 8),
            (71, 4),
            (71, 6),

year_cyl.index.levels

RESULT

FrozenList([[70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82], [3, 4, 5, 6, 8]])

year_cyl.index.names

RESULT

FrozenList(['model_year', 'cylinders'])

Indexing with the Hierarchical Index

Full Documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html (opens in a new tab)

year_cyl.head()

HTML

mpg
displacement
weight
acceleration
origin

Grab Based on Outside Index

year_cyl.loc[70]

HTML

mpg
displacement
weight
acceleration
origin

year_cyl.loc[[70,72]]

HTML

mpg
displacement
weight
acceleration
origin

Grab a Single Row

year_cyl.loc[(70,8)]

RESULT

mpg               14.111111
displacement     367.555556
weight          3940.055556
acceleration      11.194444
origin             1.000000

Grab Based on Cross-section with .xs()

This method takes a key{:python} argument to select data at a particular level of a MultiIndex.

Parameters

key : label or tuple of label Label contained in the index, or partially in a MultiIndex. axis : {0 or 'index', 1 or 'columns'}, default 0 Axis to retrieve cross-section on. level : object, defaults to first n levels (n=1 or len(key)) In case of a key partially contained in a MultiIndex, indicate which levels are used. Levels can be referred by label or position.

year_cyl.xs(key=70,axis=0,level='model_year')

HTML

mpg
displacement
weight
acceleration
origin

# Mean column values for 4 cylinders per year
year_cyl.xs(key=4,axis=0,level='cylinders')

HTML

mpg
displacement
weight
acceleration
origin

Careful note!

Keep in mind, its usually much easier to filter out values before running a groupby() call, so you should attempt to filter out any values/categories you don't want to use. For example, its much easier to remove 4 cylinder cars before the groupby() call, very difficult to this sort of thing after a group by.

df[df['cylinders'].isin([6,8])].groupby(['model_year','cylinders']).mean()

HTML

mpg
displacement
weight
acceleration
origin

Swap Levels

Swapping Levels: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#swapping-levels-with-swaplevel (opens in a new tab)
Generalized Method is reorder_levels: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#reordering-levels-with-reorder-levels (opens in a new tab)

year_cyl.swaplevel().head()

HTML

mpg
displacement
weight
acceleration
origin

Sorting MultiIndex

https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#sorting-a-multiindex (opens in a new tab)

year_cyl.sort_index(level='model_year',ascending=False)

HTML

mpg
displacement
weight
acceleration
origin

year_cyl.sort_index(level='cylinders',ascending=False)

HTML

mpg
displacement
weight
acceleration
origin

Advanced: agg() method

The agg() method allows you to customize what aggregate functions you want per category

df

HTML

mpg
cylinders
displacement
horsepower
weight

agg() on a DataFrame

# These strings need to match up with built-in method names
df.agg(['median','mean'])

HTML

mpg
cylinders
displacement
weight
acceleration

df.agg(['sum','mean'])[['mpg','weight']]

HTML

mpg
weight
sum
9358.800000
1.182229e+06

Specify aggregate methods per column

agg() is very powerful,allowing you to pass in a dictionary where the keys are the columns and the values are a list of aggregate methods.

df.agg({'mpg':['median','mean'],'weight':['mean','std']})

HTML

mpg
weight
mean
23.514573
2970.424623

agg() with groupby()

df.groupby('model_year').agg({'mpg':['median','mean'],'weight':['mean','std']})

HTML

mpg
weight
median
mean
mean

04 Missing Data 06 Combining Dataframes