🚀
Pandas
04 Missing Data
++++
Data Science
May 2026×Notebook lesson

Notebook converted from Jupyter for blog publishing.

04-Missing-Data

Driptanil Datta
Driptanil DattaSoftware Developer

Missing Data

Make sure to review the video for a full discussion on the strategies of dealing with missing data.


What Null/NA/nan objects look like:

Source: https://github.com/pandas-dev/pandas/issues/28095 (opens in a new tab)

A new pd.NA value (singleton) is introduced to represent scalar missing values. Up to now, pandas used several values to represent missing data: np.nan is used for this for float data, np.nan or None for object-dtype data and pd.NaT for datetime-like data. The goal of pd.NA is to provide a “missing” indicator that can be used consistently across data types. pd.NA is currently used by the nullable integer and boolean data types and the new string data type

import numpy as np
import pandas as pd
np.nan
RESULT
nan
pd.NA
RESULT
<NA>
pd.NaT
RESULT
NaT


Note! Typical comparisons should be avoided with Missing Values

This is generally because the logic here is, since we don't know these values, we can't know if they are equal to each other.

np.nan == np.nan
RESULT
False
np.nan in [np.nan]
RESULT
True
np.nan is np.nan
RESULT
True
pd.NA == pd.NA
RESULT
<NA>

Data

People were asked to score their opinions of actors from a 1-10 scale before and after watching one of their movies. However, some data is missing.

df = pd.read_csv('movie_scores.csv')
df
HTML
MORE
first_name
last_name
age
sex
pre_movie_score

Checking and Selecting for Null Values

df
HTML
MORE
first_name
last_name
age
sex
pre_movie_score
df.isnull()
HTML
MORE
first_name
last_name
age
sex
pre_movie_score
df.notnull()
HTML
MORE
first_name
last_name
age
sex
pre_movie_score
df['first_name']
RESULT
MORE
0      Tom
1      NaN
2     Hugh
3    Oprah
4     Emma
df[df['first_name'].notnull()]
HTML
MORE
first_name
last_name
age
sex
pre_movie_score
df[(df['pre_movie_score'].isnull()) & df['sex'].notnull()]
HTML
MORE
first_name
last_name
age
sex
pre_movie_score

Drop Data

df
HTML
MORE
first_name
last_name
age
sex
pre_movie_score
help(df.dropna)
STDOUT
MORE
Help on method dropna in module pandas.core.frame:

dropna(axis=0, how='any', thresh=None, subset=None, inplace=False) method of pandas.core.frame.DataFrame instance
    Remove missing values.
    
df.dropna()
HTML
MORE
first_name
last_name
age
sex
pre_movie_score
df.dropna(thresh=1)
HTML
MORE
first_name
last_name
age
sex
pre_movie_score
df.dropna(axis=1)
HTML
0
1
2
3
4
df.dropna(thresh=4,axis=1)
HTML
MORE
first_name
last_name
age
sex
0

Fill Data

df
HTML
MORE
first_name
last_name
age
sex
pre_movie_score
df.fillna("NEW VALUE!")
HTML
MORE
first_name
last_name
age
sex
pre_movie_score
df['first_name'].fillna("Empty")
RESULT
MORE
0      Tom
1    Empty
2     Hugh
3    Oprah
4     Emma
df['first_name'] = df['first_name'].fillna("Empty")
df
HTML
MORE
first_name
last_name
age
sex
pre_movie_score
df['pre_movie_score'].mean()
RESULT
7.0
df['pre_movie_score'].fillna(df['pre_movie_score'].mean())
RESULT
MORE
0    8.0
1    7.0
2    7.0
3    6.0
4    7.0
df.fillna(df.mean())
HTML
MORE
first_name
last_name
age
sex
pre_movie_score

Filling with Interpolation

Be careful with this technique, you should try to really understand whether or not this is a valid choice for your data. You should also note there are several methods available, the default is a linear method.

Full Docs on this Method: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html (opens in a new tab)

airline_tix = {'first':100,'business':np.nan,'economy-plus':50,'economy':30}
ser = pd.Series(airline_tix)
ser
RESULT
first           100.0
business          NaN
economy-plus     50.0
economy          30.0
dtype: float64
ser.interpolate()
RESULT
first           100.0
business         75.0
economy-plus     50.0
economy          30.0
dtype: float64
ser.interpolate(method='spline')
ERROR
MORE
ValueError: Index column must be numeric or datetime type when using spline method other than linear. Try setting a numeric or datetime index column before interpolating.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-163-106f2287918c> in <module>
----> 1 ser.interpolate(method='spline')
df = pd.DataFrame(ser,columns=['Price'])
df
HTML
MORE
Price
first
100.0
business
NaN
df.interpolate()
HTML
MORE
Price
first
100.0
business
75.0
df = df.reset_index()
df
HTML
MORE
index
Price
0
first
100.0
df.interpolate(method='spline',order=2)
HTML
MORE
index
Price
0
first
100.000000
Drip

Driptanil Datta

Software Developer

Building full-stack systems, one commit at a time. This blog is a centralized learning archive for developers.

Legal Notes
Disclaimer

The content provided on this blog is for educational and informational purposes only. While I strive for accuracy, all information is provided "as is" without any warranties of completeness, reliability, or accuracy. Any action you take upon the information found on this website is strictly at your own risk.

Copyright & IP

Certain technical content, interview questions, and datasets are curated from external educational sources to provide a centralized learning resource. Respect for original authorship is maintained; no copyright infringement is intended. All trademarks, logos, and brand names are the property of their respective owners.

System Operational

© 2026 Driptanil Datta. All rights reserved.