🚀
DBSCAN
01 Dbscan Hyperparameters
++++
Data Science
May 2026×Notebook lesson

Notebook converted from Jupyter for blog publishing.

01-DBSCAN-Hyperparameters

Driptanil Datta
Driptanil DattaSoftware Developer

DBSCAN Hyperparameters

Let's explore the hyperparameters for DBSCAN and how they can change results!

DBSCAN and Clustering Examples

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
two_blobs = pd.read_csv('../DATA/cluster_two_blobs.csv')
two_blobs_outliers = pd.read_csv('../DATA/cluster_two_blobs_outliers.csv')
sns.scatterplot(data=two_blobs,x='X1',y='X2')
RESULT
<AxesSubplot:xlabel='X1', ylabel='X2'>
PLOT
Output 1
# plt.figure(figsize=(10,6),dpi=200)
sns.scatterplot(data=two_blobs_outliers,x='X1',y='X2')
RESULT
<AxesSubplot:xlabel='X1', ylabel='X2'>
PLOT
Output 2

Label Discovery

def display_categories(model,data):
    labels = model.fit_predict(data)
    sns.scatterplot(data=data,x='X1',y='X2',hue=labels,palette='Set1')

DBSCAN

from sklearn.cluster import DBSCAN
help(DBSCAN)
STDOUT
MORE
Help on class DBSCAN in module sklearn.cluster._dbscan:

class DBSCAN(sklearn.base.ClusterMixin, sklearn.base.BaseEstimator)
 |  DBSCAN(eps=0.5, *, min_samples=5, metric='euclidean', metric_params=None, algorithm='auto', leaf_size=30, p=None, n_jobs=None)
 |  
dbscan = DBSCAN()
display_categories(dbscan,two_blobs)
PLOT
Output 3
 
display_categories(dbscan,two_blobs_outliers)
PLOT
Output 4

Epsilon

eps : float, default=0.5 | The maximum distance between two samples for one to be considered | as in the neighborhood of the other. This is not a maximum bound | on the distances of points within a cluster. This is the most | important DBSCAN parameter to choose appropriately for your data set | and distance function.

# Tiny Epsilon --> Tiny Max Distance --> Everything is an outlier (class=-1)
dbscan = DBSCAN(eps=0.001)
display_categories(dbscan,two_blobs_outliers)
PLOT
Output 5
# Huge Epsilon --> Huge Max Distance --> Everything is in the same cluster (class=0)
dbscan = DBSCAN(eps=10)
display_categories(dbscan,two_blobs_outliers)
PLOT
Output 6
# How to find a good epsilon?
plt.figure(figsize=(10,6),dpi=200)
dbscan = DBSCAN(eps=1)
display_categories(dbscan,two_blobs_outliers)
PLOT
Output 7
dbscan.labels_
RESULT
array([ 0,  1,  0, ..., -1, -1, -1], dtype=int64)
dbscan.labels_ == -1
RESULT
array([False, False, False, ...,  True,  True,  True])
np.sum(dbscan.labels_ == -1)
RESULT
3
100 * np.sum(dbscan.labels_ == -1) / len(dbscan.labels_)
RESULT
0.29910269192422734

Charting reasonable Epsilon values

# bend the knee! https://raghavan.usc.edu/papers/kneedle-simplex11.pdf
# np.arange(start=0.01,stop=10,step=0.01)
outlier_percent = []
number_of_outliers = []
 
for eps in np.linspace(0.001,10,100):
    
    # Create Model
    dbscan = DBSCAN(eps=eps)
    dbscan.fit(two_blobs_outliers)
    
    # Log Number of Outliers
    number_of_outliers.append(np.sum(dbscan.labels_ == -1))
    
    # Log percentage of points that are outliers
    perc_outliers = 100 * np.sum(dbscan.labels_ == -1) / len(dbscan.labels_)
    
    outlier_percent.append(perc_outliers)
sns.lineplot(x=np.linspace(0.001,10,100),y=outlier_percent)
plt.ylabel("Percentage of Points Classified as Outliers")
plt.xlabel("Epsilon Value")
RESULT
Text(0.5, 0, 'Epsilon Value')
PLOT
Output 8
sns.lineplot(x=np.linspace(0.001,10,100),y=number_of_outliers)
plt.ylabel("Number of Points Classified as Outliers")
plt.xlabel("Epsilon Value")
plt.xlim(0,1)
RESULT
(0.0, 1.0)
PLOT
Output 9

Do we want to think in terms of percentage targeting instead?

If so, you could "target" a percentage, like choose a range producing 1%-5% as outliers.

sns.lineplot(x=np.linspace(0.001,10,100),y=outlier_percent)
plt.ylabel("Percentage of Points Classified as Outliers")
plt.xlabel("Epsilon Value")
plt.ylim(0,5)
plt.xlim(0,2)
plt.hlines(y=1,xmin=0,xmax=2,colors='red',ls='--')
RESULT
<matplotlib.collections.LineCollection at 0x19a401a0af0>
PLOT
Output 10
# How to find a good epsilon?
dbscan = DBSCAN(eps=0.4)
display_categories(dbscan,two_blobs_outliers)
PLOT
Output 11

Do we want to think in terms of number of outliers targeting instead?

If so, you could "target" a number of outliers, such as 3 points as outliers.

sns.lineplot(x=np.linspace(0.001,10,100),y=number_of_outliers)
plt.ylabel("Number of Points Classified as Outliers")
plt.xlabel("Epsilon Value")
plt.ylim(0,10)
plt.xlim(0,6)
plt.hlines(y=3,xmin=0,xmax=10,colors='red',ls='--')
RESULT
<matplotlib.collections.LineCollection at 0x19a40070670>
PLOT
Output 12
# How to find a good epsilon?
dbscan = DBSCAN(eps=0.75)
display_categories(dbscan,two_blobs_outliers)
PLOT
Output 13

Minimum Samples

| min_samples : int, default=5 | The number of samples (or total weight) in a neighborhood for a point | to be considered as a core point. This includes the point itself.

How to choose minimum number of points?

https://stats.stackexchange.com/questions/88872/a-routine-to-choose-eps-and-minpts-for-dbscan (opens in a new tab)

outlier_percent = []
 
for n in np.arange(1,100):
    
    # Create Model
    dbscan = DBSCAN(min_samples=n)
    dbscan.fit(two_blobs_outliers)
    
    # Log percentage of points that are outliers
    perc_outliers = 100 * np.sum(dbscan.labels_ == -1) / len(dbscan.labels_)
    
    outlier_percent.append(perc_outliers)
sns.lineplot(x=np.arange(1,100),y=outlier_percent)
plt.ylabel("Percentage of Points Classified as Outliers")
plt.xlabel("Minimum Number of Samples")
RESULT
Text(0.5, 0, 'Minimum Number of Samples')
PLOT
Output 14
num_dim = two_blobs_outliers.shape[1]
 
dbscan = DBSCAN(min_samples=2*num_dim)
display_categories(dbscan,two_blobs_outliers)
PLOT
Output 15
num_dim = two_blobs_outliers.shape[1]
 
dbscan = DBSCAN(eps=0.75,min_samples=2*num_dim)
display_categories(dbscan,two_blobs_outliers)
PLOT
Output 16
dbscan = DBSCAN(min_samples=1)
display_categories(dbscan,two_blobs_outliers)
PLOT
Output 17
dbscan = DBSCAN(eps=0.75,min_samples=1)
display_categories(dbscan,two_blobs_outliers)
PLOT
Output 18

Drip

Driptanil Datta

Software Developer

Building full-stack systems, one commit at a time. This blog is a centralized learning archive for developers.

Legal Notes
Disclaimer

The content provided on this blog is for educational and informational purposes only. While I strive for accuracy, all information is provided "as is" without any warranties of completeness, reliability, or accuracy. Any action you take upon the information found on this website is strictly at your own risk.

Copyright & IP

Certain technical content, interview questions, and datasets are curated from external educational sources to provide a centralized learning resource. Respect for original authorship is maintained; no copyright infringement is intended. All trademarks, logos, and brand names are the property of their respective owners.

System Operational

© 2026 Driptanil Datta. All rights reserved.