🚀
Pandas
03 Useful Methods
++++
Data Science
May 2026×Notebook lesson

Notebook converted from Jupyter for blog publishing.

03-Useful-Methods

Driptanil Datta
Driptanil DattaSoftware Developer

Useful Methods

Let's cover some useful methods and functions built in to pandas. This is actually just a small sampling of the functions and methods available in Pandas, but they are some of the most commonly used. The documentation (opens in a new tab) is a great resource to continue exploring more methods and functions (we will introduce more further along in the course). Here is a list of functions and methods we'll cover here (click on one to jump to that section in this notebook.):

Make sure to view the video lessons to get the full explanation!

The .apply() method

Here we will learn about a very useful method known as apply on a DataFrame. This allows us to apply and broadcast custom functions on a DataFrame column

import pandas as pd
import numpy as np
df = pd.read_csv('tips.csv')
df.head()
HTML
MORE
total_bill
tip
sex
smoker
day

apply with a function

df.info()
STDOUT
MORE
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
def last_four(num):
    return str(num)[-4:]
df['CC Number'][0]
RESULT
3560325168603410
last_four(3560325168603410)
RESULT
'3410'
df['last_four'] = df['CC Number'].apply(last_four)
df.head()
HTML
MORE
total_bill
tip
sex
smoker
day

Using .apply() with more complex functions

df['total_bill'].mean()
RESULT
19.78594262295082
def yelp(price):
    if price < 10:
        return '$'
    elif price >= 10 and price < 30:
        return '$$'
    else:
        return '$$$'
df['Expensive'] = df['total_bill'].apply(yelp)
# df

apply with lambda

def simple(num):
    return num*2
lambda num: num*2
RESULT
<function __main__.<lambda>(num)>
df['total_bill'].apply(lambda bill:bill*0.18)
RESULT
MORE
0      3.0582
1      1.8612
2      3.7818
3      4.2624
4      4.4262

apply that uses multiple columns

Note, there are several ways to do this:

https://stackoverflow.com/questions/19914937/applying-function-with-multiple-arguments-to-create-a-new-pandas-column (opens in a new tab)

df.head()
HTML
MORE
total_bill
tip
sex
smoker
day
def quality(total_bill,tip):
    if tip/total_bill  > 0.25:
        return "Generous"
    else:
        return "Other"
df['Tip Quality'] = df[['total_bill','tip']].apply(lambda df: quality(df['total_bill'],df['tip']),axis=1)
df.head()
HTML
MORE
total_bill
tip
sex
smoker
day
import numpy as np
df['Tip Quality'] = np.vectorize(quality)(df['total_bill'], df['tip'])
df.head()
HTML
MORE
total_bill
tip
sex
smoker
day

So, which one is faster?

import timeit 
  
# code snippet to be executed only once 
setup = '''
import numpy as np
import pandas as pd
df = pd.read_csv('tips.csv')
def quality(total_bill,tip):
    if tip/total_bill  > 0.25:
        return "Generous"
    else:
        return "Other"
'''
  
# code snippet whose execution time is to be measured 
stmt_one = ''' 
df['Tip Quality'] = df[['total_bill','tip']].apply(lambda df: quality(df['total_bill'],df['tip']),axis=1)
'''
 
stmt_two = '''
df['Tip Quality'] = np.vectorize(quality)(df['total_bill'], df['tip'])
'''
timeit.timeit(setup = setup, 
                    stmt = stmt_one, 
                    number = 1000)
RESULT
5.0198852999999986
timeit.timeit(setup = setup, 
                    stmt = stmt_two, 
                    number = 1000)
RESULT
0.21840849999999534

Wow! Vectorization is much faster! Keep np.vectorize() in mind for the future.

Full Details: https://docs.scipy.org/doc/numpy/reference/generated/numpy.vectorize.html (opens in a new tab)

df.describe for statistical summaries

df.describe()
HTML
MORE
total_bill
tip
size
price_per_person
CC Number
df.describe().transpose()
HTML
MORE
count
mean
std
min
25%

sort_values()

df.sort_values('tip')
HTML
MORE
total_bill
tip
sex
smoker
day
# Helpful if you want to reorder after a sort
# https://stackoverflow.com/questions/13148429/how-to-change-the-order-of-dataframe-columns
df.sort_values(['tip','size'])
HTML
MORE
total_bill
tip
sex
smoker
day

df.corr() for correlation checks

Wikipedia on Correlation (opens in a new tab)

df.corr()
HTML
MORE
total_bill
tip
size
price_per_person
CC Number
df[['total_bill','tip']].corr()
HTML
MORE
total_bill
tip
total_bill
1.000000
0.675734

idxmin and idxmax

df.head()
HTML
MORE
total_bill
tip
sex
smoker
day
df['total_bill'].max()
RESULT
50.81
df['total_bill'].idxmax()
RESULT
170
df['total_bill'].idxmin()
RESULT
67
df.iloc[67]
RESULT
MORE
total_bill                      3.07
tip                                1
sex                           Female
smoker                           Yes
day                              Sat
df.iloc[170]
RESULT
MORE
total_bill                     50.81
tip                               10
sex                             Male
smoker                           Yes
day                              Sat

value_counts

Nice method to quickly get a count per category. Only makes sense on categorical columns.

df.head()
HTML
MORE
total_bill
tip
sex
smoker
day
df['sex'].value_counts()
RESULT
Male      157
Female     87
Name: sex, dtype: int64

replace

Quickly replace values with another one.

df.head()
HTML
MORE
total_bill
tip
sex
smoker
day
df['Tip Quality'].replace(to_replace='Other',value='Ok')
RESULT
MORE
0            Ok
1            Ok
2            Ok
3            Ok
4            Ok
df['Tip Quality'] = df['Tip Quality'].replace(to_replace='Other',value='Ok')
df.head()
HTML
MORE
total_bill
tip
sex
smoker
day

unique

df['size'].unique()
RESULT
array([2, 3, 4, 1, 6, 5], dtype=int64)
df['size'].nunique()
RESULT
6
df['time'].unique()
RESULT
array(['Dinner', 'Lunch'], dtype=object)

map

my_map = {'Dinner':'D','Lunch':'L'}
df['time'].map(my_map)
RESULT
MORE
0      D
1      D
2      D
3      D
4      D
df.head()
HTML
MORE
total_bill
tip
sex
smoker
day

Duplicates

.duplicated() and .drop_duplicates()

# Returns True for the 1st instance of a duplicated row
df.duplicated()
RESULT
MORE
0      False
1      False
2      False
3      False
4      False
simple_df = pd.DataFrame([1,2,2],['a','b','c'])
simple_df
HTML
MORE
0
a
1
b
2
simple_df.duplicated()
RESULT
a    False
b    False
c     True
dtype: bool
simple_df.drop_duplicates()
HTML
0
a
1
b
2

between

left: A scalar value that defines the left boundary right: A scalar value that defines the right boundary inclusive: A Boolean value which is True by default. If False, it excludes the two passed arguments while checking.

df['total_bill'].between(10,20,inclusive=True)
RESULT
MORE
0       True
1       True
2      False
3      False
4      False
df[df['total_bill'].between(10,20,inclusive=True)]
HTML
MORE
total_bill
tip
sex
smoker
day

sample

df.sample(5)
HTML
MORE
total_bill
tip
sex
smoker
day
df.sample(frac=0.1)
HTML
MORE
total_bill
tip
sex
smoker
day

nlargest and nsmallest

df.nlargest(10,'tip')
HTML
MORE
total_bill
tip
sex
smoker
day

Drip

Driptanil Datta

Software Developer

Building full-stack systems, one commit at a time. This blog is a centralized learning archive for developers.

Legal Notes
Disclaimer

The content provided on this blog is for educational and informational purposes only. While I strive for accuracy, all information is provided "as is" without any warranties of completeness, reliability, or accuracy. Any action you take upon the information found on this website is strictly at your own risk.

Copyright & IP

Certain technical content, interview questions, and datasets are curated from external educational sources to provide a centralized learning resource. Respect for original authorship is maintained; no copyright infringement is intended. All trademarks, logos, and brand names are the property of their respective owners.

System Operational

© 2026 Driptanil Datta. All rights reserved.