🚀
Hierarchical Clustering
00 Hierarchical Clustering
++++
Data Science
May 2026×Notebook lesson

Notebook converted from Jupyter for blog publishing.

00-Hierarchical-Clustering

Driptanil Datta
Driptanil DattaSoftware Developer

Hierarchal Clustering

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

The Data

df = pd.read_csv('../DATA/cluster_mpg.csv')
df = df.dropna()
df.head()
HTML
MORE
mpg
cylinders
displacement
horsepower
weight
df.describe()
HTML
MORE
mpg
cylinders
displacement
horsepower
weight
df['origin'].value_counts()
RESULT
usa       245
japan      79
europe     68
Name: origin, dtype: int64
df_w_dummies = pd.get_dummies(df.drop('name',axis=1))
df_w_dummies
HTML
MORE
mpg
cylinders
displacement
horsepower
weight

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df_w_dummies)
scaled_data
RESULT
MORE
array([[0.2393617 , 1.        , 0.61757106, ..., 0.        , 0.        ,
        1.        ],
       [0.15957447, 1.        , 0.72868217, ..., 0.        , 0.        ,
        1.        ],
       [0.2393617 , 1.        , 0.64599483, ..., 0.        , 0.        ,
scaled_df = pd.DataFrame(scaled_data,columns=df_w_dummies.columns)
plt.figure(figsize=(15,8))
sns.heatmap(scaled_df,cmap='magma');
PLOT
Output 1
sns.clustermap(scaled_df,row_cluster=False)
RESULT
<seaborn.matrix.ClusterGrid at 0x1a8a23ef2b0>
PLOT
Output 2
sns.clustermap(scaled_df,col_cluster=False)
RESULT
<seaborn.matrix.ClusterGrid at 0x1a8a45f21c0>
PLOT
Output 3

Using Scikit-Learn

from sklearn.cluster import AgglomerativeClustering
model = AgglomerativeClustering(n_clusters=4)
cluster_labels = model.fit_predict(scaled_df)
cluster_labels
RESULT
MORE
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 0, 0, 0, 3, 2, 2, 2,
       2, 2, 0, 1, 1, 1, 1, 3, 0, 3, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 0, 2, 2, 2, 3, 3, 2, 0, 3, 0, 2, 0, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 3, 1, 1, 1, 1, 2, 2, 2, 2, 0, 3, 3, 0, 3, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 2, 1, 1, 1, 1, 0, 3, 0, 3,
plt.figure(figsize=(12,4),dpi=200)
sns.scatterplot(data=df,x='mpg',y='weight',hue=cluster_labels)
RESULT
<AxesSubplot:xlabel='mpg', ylabel='weight'>
PLOT
Output 4

Exploring Number of Clusters with Dendrograms

Make sure to read the documentation online! https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html (opens in a new tab)

Assuming every point starts as its own cluster

model = AgglomerativeClustering(n_clusters=None,distance_threshold=0)
cluster_labels = model.fit_predict(scaled_df)
cluster_labels
RESULT
MORE
array([247, 252, 360, 302, 326, 381, 384, 338, 300, 279, 217, 311, 377,
       281, 232, 334, 272, 375, 354, 333, 317, 345, 329, 289, 305, 383,
       290, 205, 355, 269, 202, 144, 245, 297, 386, 358, 199, 337, 330,
       339, 293, 352, 283, 196, 253, 168, 378, 331, 201, 268, 256, 361,
       250, 197, 246, 371, 324, 230, 203, 261, 380, 376, 308, 389, 332,
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster import hierarchy

Linkage Model

linkage_matrix = hierarchy.linkage(model.children_)
linkage_matrix
RESULT
MORE
array([[ 67.        , 161.        ,   1.41421356,   2.        ],
       [ 10.        ,  45.        ,   1.41421356,   2.        ],
       [ 47.        ,  99.        ,   1.41421356,   2.        ],
       ...,
       [340.        , 777.        ,  56.40035461, 389.        ],
plt.figure(figsize=(20,10))
# Warning! This plot will take awhile!!
dn = hierarchy.dendrogram(linkage_matrix)
PLOT
Output 5
plt.figure(figsize=(20,10))
dn = hierarchy.dendrogram(linkage_matrix,truncate_mode='lastp',p=48)
PLOT
Output 6

Choosing a Threshold Distance

What is the distance between two points?

scaled_df.describe()
HTML
MORE
mpg
cylinders
displacement
horsepower
weight
scaled_df['mpg'].idxmax()
RESULT
320
scaled_df['mpg'].idxmin()
RESULT
28
# https://stackoverflow.com/questions/1401712/how-can-the-euclidean-distance-be-calculated-with-numpy
a = scaled_df.iloc[320]
b = scaled_df.iloc[28]
dist = np.linalg.norm(a-b)
dist
RESULT
2.3852929970374714

Max possible distance?

Recall Euclidean distance: https://en.wikipedia.org/wiki/Euclidean_distance (opens in a new tab)

np.sqrt(len(scaled_df.columns))
RESULT
3.1622776601683795

Creating a Model Based on Distance Threshold

  • distance_threshold
    • The linkage distance threshold above which, clusters will not be merged.
model = AgglomerativeClustering(n_clusters=None,distance_threshold=2)
cluster_labels = model.fit_predict(scaled_data)
cluster_labels
RESULT
MORE
array([ 3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  1,  4,  4,
        4,  1,  0,  0,  0,  0,  0,  4,  3,  3,  3,  3,  1,  7,  1,  4,  4,
        4,  4,  4,  3,  3,  3,  3,  3,  3,  3,  4,  7,  4,  4,  7,  0,  0,
        0,  1,  1,  0,  7,  1,  7,  0,  7,  7,  3,  3,  3,  3,  3,  3,  3,
        3,  3,  1,  3,  3,  3,  3,  0,  0,  0,  0,  7,  1,  1,  7,  1,  3,
np.unique(cluster_labels)
RESULT
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10], dtype=int64)

Linkage Matrix

Source: https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html#scipy.cluster.hierarchy.linkage (opens in a new tab)

A (n-1) by 4 matrix Z is returned. At the i-th iteration, clusters with indices Z[i, 0] and Z[i, 1] are combined to form cluster n + i. A cluster with an index less than n corresponds to one of the original observations. The distance between clusters Z[i, 0] and Z[i, 1] is given by Z[i, 2]. The fourth value Z[i, 3] represents the number of original observations in the newly formed cluster.

linkage_matrix = hierarchy.linkage(model.children_)
linkage_matrix
RESULT
MORE
array([[ 67.        , 161.        ,   1.41421356,   2.        ],
       [ 10.        ,  45.        ,   1.41421356,   2.        ],
       [ 47.        ,  99.        ,   1.41421356,   2.        ],
       ...,
       [340.        , 777.        ,  56.40035461, 389.        ],
plt.figure(figsize=(20,10))
dn = hierarchy.dendrogram(linkage_matrix,truncate_mode='lastp',p=11)
PLOT
Output 7
Drip

Driptanil Datta

Software Developer

Building full-stack systems, one commit at a time. This blog is a centralized learning archive for developers.

Legal Notes
Disclaimer

The content provided on this blog is for educational and informational purposes only. While I strive for accuracy, all information is provided "as is" without any warranties of completeness, reliability, or accuracy. Any action you take upon the information found on this website is strictly at your own risk.

Copyright & IP

Certain technical content, interview questions, and datasets are curated from external educational sources to provide a centralized learning resource. Respect for original authorship is maintained; no copyright infringement is intended. All trademarks, logos, and brand names are the property of their respective owners.

System Operational

© 2026 Driptanil Datta. All rights reserved.