++++Notebook converted from Jupyter for blog publishing.
04-Missing-Data
Missing Data
Make sure to review the video for a full discussion on the strategies of dealing with missing data.
What Null/NA/nan objects look like:
Source: https://github.com/pandas-dev/pandas/issues/28095 (opens in a new tab)
A new pd.NA value (singleton) is introduced to represent scalar missing values. Up to now, pandas used several values to represent missing data: np.nan is used for this for float data, np.nan or None for object-dtype data and pd.NaT for datetime-like data. The goal of pd.NA is to provide a “missing” indicator that can be used consistently across data types. pd.NA is currently used by the nullable integer and boolean data types and the new string data type
import numpy as np
import pandas as pdnp.nannanpd.NA<NA>pd.NaTNaTNote! Typical comparisons should be avoided with Missing Values
- https://towardsdatascience.com/navigating-the-hell-of-nans-in-python-71b12558895b (opens in a new tab)
- https://stackoverflow.com/questions/20320022/why-in-numpy-nan-nan-is-false-while-nan-in-nan-is-true (opens in a new tab)
This is generally because the logic here is, since we don't know these values, we can't know if they are equal to each other.
np.nan == np.nanFalsenp.nan in [np.nan]Truenp.nan is np.nanTruepd.NA == pd.NA<NA>Data
People were asked to score their opinions of actors from a 1-10 scale before and after watching one of their movies. However, some data is missing.
df = pd.read_csv('movie_scores.csv')dffirst_name
last_name
age
sex
pre_movie_scoreChecking and Selecting for Null Values
dffirst_name
last_name
age
sex
pre_movie_scoredf.isnull()first_name
last_name
age
sex
pre_movie_scoredf.notnull()first_name
last_name
age
sex
pre_movie_scoredf['first_name']0 Tom
1 NaN
2 Hugh
3 Oprah
4 Emmadf[df['first_name'].notnull()]first_name
last_name
age
sex
pre_movie_scoredf[(df['pre_movie_score'].isnull()) & df['sex'].notnull()]first_name
last_name
age
sex
pre_movie_scoreDrop Data
dffirst_name
last_name
age
sex
pre_movie_scorehelp(df.dropna)Help on method dropna in module pandas.core.frame:
dropna(axis=0, how='any', thresh=None, subset=None, inplace=False) method of pandas.core.frame.DataFrame instance
Remove missing values.
df.dropna()first_name
last_name
age
sex
pre_movie_scoredf.dropna(thresh=1)first_name
last_name
age
sex
pre_movie_scoredf.dropna(axis=1)0
1
2
3
4df.dropna(thresh=4,axis=1)first_name
last_name
age
sex
0Fill Data
dffirst_name
last_name
age
sex
pre_movie_scoredf.fillna("NEW VALUE!")first_name
last_name
age
sex
pre_movie_scoredf['first_name'].fillna("Empty")0 Tom
1 Empty
2 Hugh
3 Oprah
4 Emmadf['first_name'] = df['first_name'].fillna("Empty")dffirst_name
last_name
age
sex
pre_movie_scoredf['pre_movie_score'].mean()7.0df['pre_movie_score'].fillna(df['pre_movie_score'].mean())0 8.0
1 7.0
2 7.0
3 6.0
4 7.0df.fillna(df.mean())first_name
last_name
age
sex
pre_movie_scoreFilling with Interpolation
Be careful with this technique, you should try to really understand whether or not this is a valid choice for your data. You should also note there are several methods available, the default is a linear method.
Full Docs on this Method: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html (opens in a new tab)
airline_tix = {'first':100,'business':np.nan,'economy-plus':50,'economy':30}ser = pd.Series(airline_tix)serfirst 100.0
business NaN
economy-plus 50.0
economy 30.0
dtype: float64ser.interpolate()first 100.0
business 75.0
economy-plus 50.0
economy 30.0
dtype: float64ser.interpolate(method='spline')ValueError: Index column must be numeric or datetime type when using spline method other than linear. Try setting a numeric or datetime index column before interpolating.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-163-106f2287918c> in <module>
----> 1 ser.interpolate(method='spline')df = pd.DataFrame(ser,columns=['Price'])dfPrice
first
100.0
business
NaNdf.interpolate()Price
first
100.0
business
75.0df = df.reset_index()dfindex
Price
0
first
100.0df.interpolate(method='spline',order=2)index
Price
0
first
100.000000