++++

Data Science

May 2026×Notebook lesson

Notebook converted from Jupyter for blog publishing.

03-Useful-Methods

Driptanil DattaSoftware Developer

Useful Methods

Let's cover some useful methods and functions built in to pandas. This is actually just a small sampling of the functions and methods available in Pandas, but they are some of the most commonly used. The documentation (opens in a new tab) is a great resource to continue exploring more methods and functions (we will introduce more further along in the course). Here is a list of functions and methods we'll cover here (click on one to jump to that section in this notebook.):

Make sure to view the video lessons to get the full explanation!

The .apply() method

Here we will learn about a very useful method known as apply on a DataFrame. This allows us to apply and broadcast custom functions on a DataFrame column

import pandas as pd
import numpy as np

df = pd.read_csv('tips.csv')

df.head()

HTML

total_bill
tip
sex
smoker
day

apply with a function

df.info()

STDOUT

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----

def last_four(num):
    return str(num)[-4:]

df['CC Number'][0]

RESULT

3560325168603410

last_four(3560325168603410)

RESULT

'3410'

df['last_four'] = df['CC Number'].apply(last_four)

df.head()

HTML

total_bill
tip
sex
smoker
day

Using .apply() with more complex functions

df['total_bill'].mean()

RESULT

19.78594262295082

def yelp(price):
    if price < 10:
        return '$'
    elif price >= 10 and price < 30:
        return '$$'
    else:
        return '$$$'

df['Expensive'] = df['total_bill'].apply(yelp)

# df

apply with lambda

def simple(num):
    return num*2

lambda num: num*2

RESULT

<function __main__.<lambda>(num)>

df['total_bill'].apply(lambda bill:bill*0.18)

RESULT

0      3.0582
1      1.8612
2      3.7818
3      4.2624
4      4.4262

apply that uses multiple columns

Note, there are several ways to do this:

https://stackoverflow.com/questions/19914937/applying-function-with-multiple-arguments-to-create-a-new-pandas-column (opens in a new tab)

df.head()

HTML

total_bill
tip
sex
smoker
day

def quality(total_bill,tip):
    if tip/total_bill  > 0.25:
        return "Generous"
    else:
        return "Other"

df['Tip Quality'] = df[['total_bill','tip']].apply(lambda df: quality(df['total_bill'],df['tip']),axis=1)

df.head()

HTML

total_bill
tip
sex
smoker
day

import numpy as np

df['Tip Quality'] = np.vectorize(quality)(df['total_bill'], df['tip'])

df.head()

HTML

total_bill
tip
sex
smoker
day

So, which one is faster?

import timeit 
  
# code snippet to be executed only once 
setup = '''
import numpy as np
import pandas as pd
df = pd.read_csv('tips.csv')
def quality(total_bill,tip):
    if tip/total_bill  > 0.25:
        return "Generous"
    else:
        return "Other"
'''
  
# code snippet whose execution time is to be measured 
stmt_one = ''' 
df['Tip Quality'] = df[['total_bill','tip']].apply(lambda df: quality(df['total_bill'],df['tip']),axis=1)
'''
 
stmt_two = '''
df['Tip Quality'] = np.vectorize(quality)(df['total_bill'], df['tip'])
'''

timeit.timeit(setup = setup, 
                    stmt = stmt_one, 
                    number = 1000)

RESULT

5.0198852999999986

timeit.timeit(setup = setup, 
                    stmt = stmt_two, 
                    number = 1000)

RESULT

0.21840849999999534

Wow! Vectorization is much faster! Keep np.vectorize() in mind for the future.

Full Details: https://docs.scipy.org/doc/numpy/reference/generated/numpy.vectorize.html (opens in a new tab)

df.describe for statistical summaries

df.describe()

HTML

total_bill
tip
size
price_per_person
CC Number

df.describe().transpose()

HTML

count
mean
std
min
25%

sort_values()

df.sort_values('tip')

HTML

total_bill
tip
sex
smoker
day

# Helpful if you want to reorder after a sort
# https://stackoverflow.com/questions/13148429/how-to-change-the-order-of-dataframe-columns
df.sort_values(['tip','size'])

HTML

total_bill
tip
sex
smoker
day

df.corr() for correlation checks

Wikipedia on Correlation (opens in a new tab)

df.corr()

HTML

total_bill
tip
size
price_per_person
CC Number

df[['total_bill','tip']].corr()

HTML

total_bill
tip
total_bill
1.000000
0.675734

idxmin and idxmax

df.head()

HTML

total_bill
tip
sex
smoker
day

df['total_bill'].max()

RESULT

50.81

df['total_bill'].idxmax()

RESULT

df['total_bill'].idxmin()

RESULT

df.iloc[67]

RESULT

total_bill                      3.07
tip                                1
sex                           Female
smoker                           Yes
day                              Sat

df.iloc[170]

RESULT

total_bill                     50.81
tip                               10
sex                             Male
smoker                           Yes
day                              Sat

value_counts

Nice method to quickly get a count per category. Only makes sense on categorical columns.

df.head()

HTML

total_bill
tip
sex
smoker
day

df['sex'].value_counts()

RESULT

Male      157
Female     87
Name: sex, dtype: int64

replace

Quickly replace values with another one.

df.head()

HTML

total_bill
tip
sex
smoker
day

df['Tip Quality'].replace(to_replace='Other',value='Ok')

RESULT

0            Ok
1            Ok
2            Ok
3            Ok
4            Ok

df['Tip Quality'] = df['Tip Quality'].replace(to_replace='Other',value='Ok')

df.head()

HTML

total_bill
tip
sex
smoker
day

unique

df['size'].unique()

RESULT

array([2, 3, 4, 1, 6, 5], dtype=int64)

df['size'].nunique()

RESULT

df['time'].unique()

RESULT

array(['Dinner', 'Lunch'], dtype=object)

map

my_map = {'Dinner':'D','Lunch':'L'}

df['time'].map(my_map)

RESULT

df.head()

HTML

total_bill
tip
sex
smoker
day

Duplicates

.duplicated() and .drop_duplicates()

# Returns True for the 1st instance of a duplicated row
df.duplicated()

RESULT

0      False
1      False
2      False
3      False
4      False

simple_df = pd.DataFrame([1,2,2],['a','b','c'])

simple_df

HTML

0
a
1
b
2

simple_df.duplicated()

RESULT

a    False
b    False
c     True
dtype: bool

simple_df.drop_duplicates()

HTML

0
a
1
b
2

between

left: A scalar value that defines the left boundary right: A scalar value that defines the right boundary inclusive: A Boolean value which is True by default. If False, it excludes the two passed arguments while checking.

df['total_bill'].between(10,20,inclusive=True)

RESULT

0       True
1       True
2      False
3      False
4      False

df[df['total_bill'].between(10,20,inclusive=True)]

HTML

total_bill
tip
sex
smoker
day

sample

df.sample(5)

HTML

total_bill
tip
sex
smoker
day

df.sample(frac=0.1)

HTML

total_bill
tip
sex
smoker
day

nlargest and nsmallest

df.nlargest(10,'tip')

HTML

total_bill
tip
sex
smoker
day

02 Conditional Filtering 04 Missing Data