++++Notebook converted from Jupyter for blog publishing.
05-Groupby-Operations-and-MultiIndex
Groupby Operations and Multi-level Index
import numpy as np
import pandas as pdData
df = pd.read_csv('mpg.csv')dfmpg
cylinders
displacement
horsepower
weightgroupby() method
# Creates a groupby object waiting for an aggregate method
df.groupby('model_year')<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000246790FEC88>Adding an aggregate method call. To use a grouped object, you need to tell pandas how you want to aggregate the data.
Common Options:
mean(): Compute mean of groups sum(): Compute sum of group values size(): Compute group sizes count(): Compute count of group std(): Standard deviation of groups var(): Compute variance of groups sem(): Standard error of the mean of groups describe(): Generates descriptive statistics first(): Compute first of group values last(): Compute last of group values nth() : Take nth value, or a subset if n is a list min(): Compute min of group values max(): Compute max of group values
Full List at the Online Documentation: https://pandas.pydata.org/docs/reference/groupby.html (opens in a new tab)
# model_year becomes the index! It is NOT a column name,it is now the name of the index
df.groupby('model_year').mean()mpg
cylinders
displacement
weight
accelerationavg_year = df.groupby('model_year').mean()avg_year.indexInt64Index([70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82], dtype='int64', name='model_year')avg_year.columnsIndex(['mpg', 'cylinders', 'displacement', 'weight', 'acceleration', 'origin'], dtype='object')avg_year['mpg']model_year
70 17.689655
71 21.250000
72 18.714286
73 17.100000df.groupby('model_year').mean()['mpg']model_year
70 17.689655
71 21.250000
72 18.714286
73 17.100000df.groupby('model_year').describe()mpg
cylinders
...
acceleration
origindf.groupby('model_year').describe().transpose()model_year
70
71
72
73Groupby Multiple Columns
Let's explore average mpg per year per cylinder count
df.groupby(['model_year','cylinders']).mean()mpg
displacement
weight
acceleration
origindf.groupby(['model_year','cylinders']).mean().indexMultiIndex([(70, 4),
(70, 6),
(70, 8),
(71, 4),
(71, 6),MultiIndex
The MultiIndex Object
year_cyl = df.groupby(['model_year','cylinders']).mean()year_cylmpg
displacement
weight
acceleration
originyear_cyl.indexMultiIndex([(70, 4),
(70, 6),
(70, 8),
(71, 4),
(71, 6),year_cyl.index.levelsFrozenList([[70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82], [3, 4, 5, 6, 8]])year_cyl.index.namesFrozenList(['model_year', 'cylinders'])Indexing with the Hierarchical Index
Full Documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html (opens in a new tab)
year_cyl.head()mpg
displacement
weight
acceleration
originGrab Based on Outside Index
year_cyl.loc[70]mpg
displacement
weight
acceleration
originyear_cyl.loc[[70,72]]mpg
displacement
weight
acceleration
originGrab a Single Row
year_cyl.loc[(70,8)]mpg 14.111111
displacement 367.555556
weight 3940.055556
acceleration 11.194444
origin 1.000000Grab Based on Cross-section with .xs()
This method takes a key{:python} argument to select data at a particular
level of a MultiIndex.
Parameters
key : label or tuple of label Label contained in the index, or partially in a MultiIndex. axis : {0 or 'index', 1 or 'columns'}, default 0 Axis to retrieve cross-section on. level : object, defaults to first n levels (n=1 or len(key)) In case of a key partially contained in a MultiIndex, indicate which levels are used. Levels can be referred by label or position.
year_cyl.xs(key=70,axis=0,level='model_year')mpg
displacement
weight
acceleration
origin# Mean column values for 4 cylinders per year
year_cyl.xs(key=4,axis=0,level='cylinders')mpg
displacement
weight
acceleration
originCareful note!
Keep in mind, its usually much easier to filter out values before running a groupby() call, so you should attempt to filter out any values/categories you don't want to use. For example, its much easier to remove 4 cylinder cars before the groupby() call, very difficult to this sort of thing after a group by.
df[df['cylinders'].isin([6,8])].groupby(['model_year','cylinders']).mean()mpg
displacement
weight
acceleration
originSwap Levels
- Swapping Levels: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#swapping-levels-with-swaplevel (opens in a new tab)
- Generalized Method is reorder_levels: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#reordering-levels-with-reorder-levels (opens in a new tab)
year_cyl.swaplevel().head()mpg
displacement
weight
acceleration
originSorting MultiIndex
year_cyl.sort_index(level='model_year',ascending=False)mpg
displacement
weight
acceleration
originyear_cyl.sort_index(level='cylinders',ascending=False)mpg
displacement
weight
acceleration
originAdvanced: agg() method
The agg() method allows you to customize what aggregate functions you want per category
dfmpg
cylinders
displacement
horsepower
weightagg() on a DataFrame
# These strings need to match up with built-in method names
df.agg(['median','mean'])mpg
cylinders
displacement
weight
accelerationdf.agg(['sum','mean'])[['mpg','weight']]mpg
weight
sum
9358.800000
1.182229e+06Specify aggregate methods per column
agg() is very powerful,allowing you to pass in a dictionary where the keys are the columns and the values are a list of aggregate methods.
df.agg({'mpg':['median','mean'],'weight':['mean','std']})mpg
weight
mean
23.514573
2970.424623agg() with groupby()
df.groupby('model_year').agg({'mpg':['median','mean'],'weight':['mean','std']})mpg
weight
median
mean
mean