🚀
Feature Engineering
00 Dealing with Outliers
++++
Data Science
May 2026×Notebook lesson

Notebook converted from Jupyter for blog publishing.

00-Dealing-with-Outliers

Driptanil Datta
Driptanil DattaSoftware Developer

Dealing with Outliers

In statistics, an outlier is a data point that differs significantly from other observations.An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. An outlier can cause serious problems in statistical analyses.

Remember that even if a data point is an outlier, its still a data point! Carefully consider your data, its sources, and your goals whenver deciding to remove an outlier. Each case is different!

Lecture Goals

  • Understand different mathmatical definitions of outliers
  • Use Python tools to recognize outliers and remove them

Useful Links


Imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Generating Data

# Choose a mean,standard deviation, and number of samples
 
def create_ages(mu=50,sigma=13,num_samples=100,seed=42):
 
    # Set a random seed in the same cell as the random call to get the same values as us
    # We set seed to 42 (42 is an arbitrary choice from Hitchhiker's Guide to the Galaxy)
    np.random.seed(seed)
 
    sample_ages = np.random.normal(loc=mu,scale=sigma,size=num_samples)
    sample_ages = np.round(sample_ages,decimals=0)
    
    return sample_ages
sample = create_ages()
sample
RESULT
MORE
array([56., 48., 58., 70., 47., 47., 71., 60., 44., 57., 44., 44., 53.,
       25., 28., 43., 37., 54., 38., 32., 69., 47., 51., 31., 43., 51.,
       35., 55., 42., 46., 42., 74., 50., 36., 61., 34., 53., 25., 33.,
       53., 60., 52., 48., 46., 31., 41., 44., 64., 54., 27., 54., 45.,
       41., 58., 63., 62., 39., 46., 54., 63., 44., 48., 36., 34., 61.,

Visualize and Describe the Data

sns.distplot(sample,bins=10,kde=False)
RESULT
<AxesSubplot:>
PLOT
Output 1
sns.boxplot(sample)
RESULT
<AxesSubplot:>
PLOT
Output 2
ser = pd.Series(sample)
ser.describe()
RESULT
MORE
count    100.00000
mean      48.66000
std       11.82039
min       16.00000
25%       42.00000

Trimming or Fixing Based Off Domain Knowledge

If we know we're dealing with a dataset pertaining to voting age (18 years old in the USA), then it makes sense to either drop anything less than that OR fix values lower than 18 and push them up to 18.

ser[ser > 18]
RESULT
MORE
0     56.0
1     48.0
2     58.0
3     70.0
4     47.0
# It dropped one person
len(ser[ser > 18])
RESULT
99
def fix_values(age):
    
    if age < 18:
        return 18
    else:
        return age
# "Fixes" one person's age
ser.apply(fix_values)
RESULT
MORE
0     56.0
1     48.0
2     58.0
3     70.0
4     47.0
len(ser.apply(fix_values))
RESULT
100

There are many ways to identify and remove outliers:

Ames Data Set

Let's explore any extreme outliers in our Ames Housing Data Set

df = pd.read_csv("../DATA/Ames_Housing_Data.csv")
df.head()
HTML
MORE
PID
MS SubClass
MS Zoning
Lot Frontage
Lot Area
sns.heatmap(df.corr())
RESULT
<AxesSubplot:>
PLOT
Output 3
df.corr()['SalePrice'].sort_values()
RESULT
MORE
PID               -0.246521
Enclosed Porch    -0.128787
Kitchen AbvGr     -0.119814
Overall Cond      -0.101697
MS SubClass       -0.085092
sns.distplot(df["SalePrice"])
RESULT
<AxesSubplot:xlabel='SalePrice'>
PLOT
Output 4
sns.scatterplot(x='Overall Qual',y='SalePrice',data=df)
RESULT
<AxesSubplot:xlabel='Overall Qual', ylabel='SalePrice'>
PLOT
Output 5
df[(df['Overall Qual']>8) & (df['SalePrice']<200000)]
HTML
MORE
PID
MS SubClass
MS Zoning
Lot Frontage
Lot Area
sns.scatterplot(x='Gr Liv Area',y='SalePrice',data=df)
RESULT
<AxesSubplot:xlabel='Gr Liv Area', ylabel='SalePrice'>
PLOT
Output 6
df[(df['Gr Liv Area']>4000) & (df['SalePrice']<400000)]
HTML
MORE
PID
MS SubClass
MS Zoning
Lot Frontage
Lot Area
df[(df['Gr Liv Area']>4000) & (df['SalePrice']<400000)].index
RESULT
Int64Index([1498, 2180, 2181], dtype='int64')
ind_drop = df[(df['Gr Liv Area']>4000) & (df['SalePrice']<400000)].index
df = df.drop(ind_drop,axis=0)
sns.scatterplot(x='Gr Liv Area',y='SalePrice',data=df)
RESULT
<AxesSubplot:xlabel='Gr Liv Area', ylabel='SalePrice'>
PLOT
Output 7
sns.scatterplot(x='Overall Qual',y='SalePrice',data=df)
RESULT
<AxesSubplot:xlabel='Overall Qual', ylabel='SalePrice'>
PLOT
Output 8
df.to_csv("../DATA/Ames_outliers_removed.csv",index=False)

Drip

Driptanil Datta

Software Developer

Building full-stack systems, one commit at a time. This blog is a centralized learning archive for developers.

Legal Notes
Disclaimer

The content provided on this blog is for educational and informational purposes only. While I strive for accuracy, all information is provided "as is" without any warranties of completeness, reliability, or accuracy. Any action you take upon the information found on this website is strictly at your own risk.

Copyright & IP

Certain technical content, interview questions, and datasets are curated from external educational sources to provide a centralized learning resource. Respect for original authorship is maintained; no copyright infringement is intended. All trademarks, logos, and brand names are the property of their respective owners.

System Operational

© 2026 Driptanil Datta. All rights reserved.