++++

Data Science

May 2026×Notebook lesson

Notebook converted from Jupyter for blog publishing.

04-Missing-Data

Driptanil DattaSoftware Developer

Missing Data

Make sure to review the video for a full discussion on the strategies of dealing with missing data.

What Null/NA/nan objects look like:

Source: https://github.com/pandas-dev/pandas/issues/28095 (opens in a new tab)

A new pd.NA value (singleton) is introduced to represent scalar missing values. Up to now, pandas used several values to represent missing data: np.nan is used for this for float data, np.nan or None for object-dtype data and pd.NaT for datetime-like data. The goal of pd.NA is to provide a “missing” indicator that can be used consistently across data types. pd.NA is currently used by the nullable integer and boolean data types and the new string data type

import numpy as np
import pandas as pd

np.nan

RESULT

nan

pd.NA

RESULT

<NA>

pd.NaT

RESULT

NaT

Note! Typical comparisons should be avoided with Missing Values

This is generally because the logic here is, since we don't know these values, we can't know if they are equal to each other.

np.nan == np.nan

RESULT

False

np.nan in [np.nan]

RESULT

True

np.nan is np.nan

RESULT

True

pd.NA == pd.NA

RESULT

<NA>

Data

People were asked to score their opinions of actors from a 1-10 scale before and after watching one of their movies. However, some data is missing.

df = pd.read_csv('movie_scores.csv')

df

HTML

first_name
last_name
age
sex
pre_movie_score

Checking and Selecting for Null Values

df

HTML

first_name
last_name
age
sex
pre_movie_score

df.isnull()

HTML

first_name
last_name
age
sex
pre_movie_score

df.notnull()

HTML

first_name
last_name
age
sex
pre_movie_score

df['first_name']

RESULT

0      Tom
1      NaN
2     Hugh
3    Oprah
4     Emma

df[df['first_name'].notnull()]

HTML

first_name
last_name
age
sex
pre_movie_score

df[(df['pre_movie_score'].isnull()) & df['sex'].notnull()]

HTML

first_name
last_name
age
sex
pre_movie_score

Drop Data

df

HTML

first_name
last_name
age
sex
pre_movie_score

help(df.dropna)

STDOUT

Help on method dropna in module pandas.core.frame:

dropna(axis=0, how='any', thresh=None, subset=None, inplace=False) method of pandas.core.frame.DataFrame instance
    Remove missing values.

df.dropna()

HTML

first_name
last_name
age
sex
pre_movie_score

df.dropna(thresh=1)

HTML

first_name
last_name
age
sex
pre_movie_score

df.dropna(axis=1)

HTML

df.dropna(thresh=4,axis=1)

HTML

first_name
last_name
age
sex
0

Fill Data

df

HTML

first_name
last_name
age
sex
pre_movie_score

df.fillna("NEW VALUE!")

HTML

first_name
last_name
age
sex
pre_movie_score

df['first_name'].fillna("Empty")

RESULT

0      Tom
1    Empty
2     Hugh
3    Oprah
4     Emma

df['first_name'] = df['first_name'].fillna("Empty")

df

HTML

first_name
last_name
age
sex
pre_movie_score

df['pre_movie_score'].mean()

RESULT

7.0

df['pre_movie_score'].fillna(df['pre_movie_score'].mean())

RESULT

df.fillna(df.mean())

HTML

first_name
last_name
age
sex
pre_movie_score

Filling with Interpolation

Be careful with this technique, you should try to really understand whether or not this is a valid choice for your data. You should also note there are several methods available, the default is a linear method.

Full Docs on this Method: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html (opens in a new tab)

airline_tix = {'first':100,'business':np.nan,'economy-plus':50,'economy':30}

ser = pd.Series(airline_tix)

ser

RESULT

first           100.0
business          NaN
economy-plus     50.0
economy          30.0
dtype: float64

ser.interpolate()

RESULT

first           100.0
business         75.0
economy-plus     50.0
economy          30.0
dtype: float64

ser.interpolate(method='spline')

ERROR

ValueError: Index column must be numeric or datetime type when using spline method other than linear. Try setting a numeric or datetime index column before interpolating.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-163-106f2287918c> in <module>
----> 1 ser.interpolate(method='spline')

df = pd.DataFrame(ser,columns=['Price'])

df

HTML

Price
first
100.0
business
NaN

df.interpolate()

HTML

Price
first
100.0
business
75.0

df = df.reset_index()

df

HTML

index
Price
0
first
100.0

df.interpolate(method='spline',order=2)

HTML

index
Price
0
first
100.000000

03 Useful Methods 05 Groupby Operations and Multiindex