++++Notebook converted from Jupyter for blog publishing.
03-Useful-Methods
Useful Methods
Let's cover some useful methods and functions built in to pandas. This is actually just a small sampling of the functions and methods available in Pandas, but they are some of the most commonly used. The documentation (opens in a new tab) is a great resource to continue exploring more methods and functions (we will introduce more further along in the course). Here is a list of functions and methods we'll cover here (click on one to jump to that section in this notebook.):
- apply() method
- apply() with a function
- apply() with a lambda expression
- apply() on multiple columns
- describe()
- sort_values()
- corr()
- idxmin and idxmax
- value_counts
- replace
- unique and nunique
- map
- duplicated and drop_duplicates
- between
- sample
- nlargest
Make sure to view the video lessons to get the full explanation!
The .apply() method
Here we will learn about a very useful method known as apply on a DataFrame. This allows us to apply and broadcast custom functions on a DataFrame column
import pandas as pd
import numpy as npdf = pd.read_csv('tips.csv')df.head()total_bill
tip
sex
smoker
dayapply with a function
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- ----- def last_four(num):
return str(num)[-4:]df['CC Number'][0]3560325168603410last_four(3560325168603410)'3410'df['last_four'] = df['CC Number'].apply(last_four)df.head()total_bill
tip
sex
smoker
dayUsing .apply() with more complex functions
df['total_bill'].mean()19.78594262295082def yelp(price):
if price < 10:
return '$'
elif price >= 10 and price < 30:
return '$$'
else:
return '$$$'df['Expensive'] = df['total_bill'].apply(yelp)# dfapply with lambda
def simple(num):
return num*2lambda num: num*2<function __main__.<lambda>(num)>df['total_bill'].apply(lambda bill:bill*0.18)0 3.0582
1 1.8612
2 3.7818
3 4.2624
4 4.4262apply that uses multiple columns
Note, there are several ways to do this:
df.head()total_bill
tip
sex
smoker
daydef quality(total_bill,tip):
if tip/total_bill > 0.25:
return "Generous"
else:
return "Other"df['Tip Quality'] = df[['total_bill','tip']].apply(lambda df: quality(df['total_bill'],df['tip']),axis=1)df.head()total_bill
tip
sex
smoker
dayimport numpy as npdf['Tip Quality'] = np.vectorize(quality)(df['total_bill'], df['tip'])df.head()total_bill
tip
sex
smoker
daySo, which one is faster?
import timeit
# code snippet to be executed only once
setup = '''
import numpy as np
import pandas as pd
df = pd.read_csv('tips.csv')
def quality(total_bill,tip):
if tip/total_bill > 0.25:
return "Generous"
else:
return "Other"
'''
# code snippet whose execution time is to be measured
stmt_one = '''
df['Tip Quality'] = df[['total_bill','tip']].apply(lambda df: quality(df['total_bill'],df['tip']),axis=1)
'''
stmt_two = '''
df['Tip Quality'] = np.vectorize(quality)(df['total_bill'], df['tip'])
'''timeit.timeit(setup = setup,
stmt = stmt_one,
number = 1000)5.0198852999999986timeit.timeit(setup = setup,
stmt = stmt_two,
number = 1000)0.21840849999999534Wow! Vectorization is much faster! Keep np.vectorize() in mind for the future.
Full Details: https://docs.scipy.org/doc/numpy/reference/generated/numpy.vectorize.html (opens in a new tab)
df.describe for statistical summaries
df.describe()total_bill
tip
size
price_per_person
CC Numberdf.describe().transpose()count
mean
std
min
25%sort_values()
df.sort_values('tip')total_bill
tip
sex
smoker
day# Helpful if you want to reorder after a sort
# https://stackoverflow.com/questions/13148429/how-to-change-the-order-of-dataframe-columns
df.sort_values(['tip','size'])total_bill
tip
sex
smoker
daydf.corr() for correlation checks
Wikipedia on Correlation (opens in a new tab)
df.corr()total_bill
tip
size
price_per_person
CC Numberdf[['total_bill','tip']].corr()total_bill
tip
total_bill
1.000000
0.675734idxmin and idxmax
df.head()total_bill
tip
sex
smoker
daydf['total_bill'].max()50.81df['total_bill'].idxmax()170df['total_bill'].idxmin()67df.iloc[67]total_bill 3.07
tip 1
sex Female
smoker Yes
day Satdf.iloc[170]total_bill 50.81
tip 10
sex Male
smoker Yes
day Satvalue_counts
Nice method to quickly get a count per category. Only makes sense on categorical columns.
df.head()total_bill
tip
sex
smoker
daydf['sex'].value_counts()Male 157
Female 87
Name: sex, dtype: int64replace
Quickly replace values with another one.
df.head()total_bill
tip
sex
smoker
daydf['Tip Quality'].replace(to_replace='Other',value='Ok')0 Ok
1 Ok
2 Ok
3 Ok
4 Okdf['Tip Quality'] = df['Tip Quality'].replace(to_replace='Other',value='Ok')df.head()total_bill
tip
sex
smoker
dayunique
df['size'].unique()array([2, 3, 4, 1, 6, 5], dtype=int64)df['size'].nunique()6df['time'].unique()array(['Dinner', 'Lunch'], dtype=object)map
my_map = {'Dinner':'D','Lunch':'L'}df['time'].map(my_map)0 D
1 D
2 D
3 D
4 Ddf.head()total_bill
tip
sex
smoker
dayDuplicates
.duplicated() and .drop_duplicates()
# Returns True for the 1st instance of a duplicated row
df.duplicated()0 False
1 False
2 False
3 False
4 Falsesimple_df = pd.DataFrame([1,2,2],['a','b','c'])simple_df0
a
1
b
2simple_df.duplicated()a False
b False
c True
dtype: boolsimple_df.drop_duplicates()0
a
1
b
2between
left: A scalar value that defines the left boundary right: A scalar value that defines the right boundary inclusive: A Boolean value which is True by default. If False, it excludes the two passed arguments while checking.
df['total_bill'].between(10,20,inclusive=True)0 True
1 True
2 False
3 False
4 Falsedf[df['total_bill'].between(10,20,inclusive=True)]total_bill
tip
sex
smoker
daysample
df.sample(5)total_bill
tip
sex
smoker
daydf.sample(frac=0.1)total_bill
tip
sex
smoker
daynlargest and nsmallest
df.nlargest(10,'tip')total_bill
tip
sex
smoker
day