🚀
Random Forests
00 Random Forest Classification
++++
Data Science
May 2026×Notebook lesson

Notebook converted from Jupyter for blog publishing.

00-Random-Forest-Classification

Driptanil Datta
Driptanil DattaSoftware Developer

Random Forest - Classification

The Data

We will be using the same dataset through our discussions on classification with tree-methods (Decision Tree,Random Forests, and Gradient Boosted Trees) in order to compare performance metrics across these related models.

We will work with the "Palmer Penguins" dataset, as it is simple enough to help us fully understand how changing hyperparameters can change classification results.

Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

Gorman KB, Williams TD, Fraser WR (2014) Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis). PLoS ONE 9(3): e90081. doi:10.1371/journal.pone.0090081

Summary: The data folder contains two CSV files. For intro courses/examples, you probably want to use the first one (penguins_size.csv).

  • penguins_size.csv: Simplified data from original penguin data sets. Contains variables:

    • species: penguin species (Chinstrap, Adélie, or Gentoo)
    • culmen_length_mm: culmen length (mm)
    • culmen_depth_mm: culmen depth (mm)
    • flipper_length_mm: flipper length (mm)
    • body_mass_g: body mass (g)
    • island: island name (Dream, Torgersen, or Biscoe) in the Palmer Archipelago (Antarctica)
    • sex: penguin sex
  • (Not used) penguins_lter.csv: Original combined data for 3 penguin species

Note: The culmen is "the upper ridge of a bird's beak"

Our goal is to create a model that can help predict a species of a penguin based on physical attributes, then we can use that model to help researchers classify penguins in the field, instead of needing an experienced biologist

Imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("../DATA/penguins_size.csv")
df = df.dropna()
df.head()
HTML
MORE
species
island
culmen_length_mm
culmen_depth_mm
flipper_length_mm

Train | Test Split

X = pd.get_dummies(df.drop('species',axis=1),drop_first=True)
y = df['species']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

Random Forest Classification

from sklearn.ensemble import RandomForestClassifier
help(RandomForestClassifier)
STDOUT
MORE
Help on class RandomForestClassifier in module sklearn.ensemble._forest:

class RandomForestClassifier(ForestClassifier)
 |  RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)
 |  
# Use 10 random trees
model = RandomForestClassifier(n_estimators=10,max_features='auto',random_state=101)
model.fit(X_train,y_train)
RESULT
RandomForestClassifier(n_estimators=10, random_state=101)
preds = model.predict(X_test)

Evaluation

from sklearn.metrics import confusion_matrix,classification_report,plot_confusion_matrix,accuracy_score
confusion_matrix(y_test,preds)
RESULT
array([[39,  2,  0],
       [ 1, 22,  0],
       [ 0,  0, 37]], dtype=int64)
plot_confusion_matrix(model,X_test,y_test)
RESULT
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1384588b0d0>
PLOT
Output 1

Feature Importance

Very useful attribute of the trained model!

model.feature_importances_
RESULT
array([0.35324545, 0.13320651, 0.1985798 , 0.12074795, 0.14244127,
       0.03781403, 0.00677831, 0.00718669])

Choosing correct number of trees

Let's explore if continually adding more trees improves performance...

test_error = []
 
for n in range(1,40):
    # Use n random trees
    model = RandomForestClassifier(n_estimators=n,max_features='auto')
    model.fit(X_train,y_train)
    test_preds = model.predict(X_test)
    test_error.append(1-accuracy_score(test_preds,y_test))
plt.plot(range(1,40),test_error,label='Test Error')
plt.legend()
RESULT
<matplotlib.legend.Legend at 0x138491ef760>
PLOT
Output 2

Clearly there are diminishing returns, on such a small dataset, we've pretty much extracted all the information we can after about 5 trees.

Random Forest - HyperParameter Exploration

https://archive.ics.uci.edu/ml/datasets/banknote+authentication (opens in a new tab)

df = pd.read_csv("../DATA/data_banknote_authentication.csv")
df.head()
HTML
MORE
Variance_Wavelet
Skewness_Wavelet
Curtosis_Wavelet
Image_Entropy
Class
sns.pairplot(df,hue='Class')
RESULT
<seaborn.axisgrid.PairGrid at 0x13849319fa0>
PLOT
Output 3
X = df.drop("Class",axis=1)
y = df["Class"]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=101)
from sklearn.model_selection import GridSearchCV
n_estimators=[64,100,128,200]
max_features= [2,3,4]
bootstrap = [True,False]
oob_score = [True,False]
param_grid = {'n_estimators':n_estimators,
             'max_features':max_features,
             'bootstrap':bootstrap,
             'oob_score':oob_score}  # Note, oob_score only makes sense when bootstrap=True!
rfc = RandomForestClassifier()
grid = GridSearchCV(rfc,param_grid)
grid.fit(X_train,y_train)
STDERR
MORE
c:\users\marcial\anaconda_new\envs\ml_master\lib\site-packages\sklearn\model_selection\_validation.py:548: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "c:\users\marcial\anaconda_new\envs\ml_master\lib\site-packages\sklearn\model_selection\_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\users\marcial\anaconda_new\envs\ml_master\lib\site-packages\sklearn\ensemble\_forest.py", line 351, in fit
RESULT
GridSearchCV(estimator=RandomForestClassifier(oob_score=True),
             param_grid={'bootstrap': [True, False], 'max_features': [2, 3, 4],
                         'n_estimators': [64, 100, 128, 200]})
grid.best_params_
RESULT
{'bootstrap': True, 'max_features': 2, 'n_estimators': 64}
predictions = grid.predict(X_test)
print(classification_report(y_test,predictions))
STDOUT
MORE
              precision    recall  f1-score   support

           0       1.00      0.98      0.99       124
           1       0.98      1.00      0.99        82
plot_confusion_matrix(grid,X_test,y_test)
RESULT
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x138493358e0>
PLOT
Output 4
# No underscore, reports back original oob_score parameter
grid.best_estimator_.oob_score
RESULT
True
# With underscore, reports back fitted attribute of oob_score
grid.best_estimator_.oob_score_
RESULT
0.9939965694682675

Understanding Number of Estimators (Trees)

Let's plot out error vs. Number of Estimators

from sklearn.metrics import accuracy_score
errors = []
misclassifications = []
 
for n in range(1,64):
    rfc = RandomForestClassifier( n_estimators=n,bootstrap=True,max_features= 2)
    rfc.fit(X_train,y_train)
    preds = rfc.predict(X_test)
    err = 1 - accuracy_score(preds,y_test)
    n_missed = np.sum(preds != y_test) # watch the video to understand this line!!
    errors.append(err)
    misclassifications.append(n_missed)
plt.plot(range(1,64),errors)
RESULT
[<matplotlib.lines.Line2D at 0x13849748310>]
PLOT
Output 5
plt.plot(range(1,64),misclassifications)
RESULT
[<matplotlib.lines.Line2D at 0x13849791c10>]
PLOT
Output 6
Drip

Driptanil Datta

Software Developer

Building full-stack systems, one commit at a time. This blog is a centralized learning archive for developers.

Legal Notes
Disclaimer

The content provided on this blog is for educational and informational purposes only. While I strive for accuracy, all information is provided "as is" without any warranties of completeness, reliability, or accuracy. Any action you take upon the information found on this website is strictly at your own risk.

Copyright & IP

Certain technical content, interview questions, and datasets are curated from external educational sources to provide a centralized learning resource. Respect for original authorship is maintained; no copyright infringement is intended. All trademarks, logos, and brand names are the property of their respective owners.

System Operational

© 2026 Driptanil Datta. All rights reserved.