++++

Data Science

May 2026×Notebook lesson

Notebook converted from Jupyter for blog publishing.

01-Gradient-Boosting

Driptanil DattaSoftware Developer

Gradient Boosting and GridSearch

The Data

Mushroom Hunting: Edible or Poisonous?

Data Source: https://archive.ics.uci.edu/ml/datasets/Mushroom (opens in a new tab)

This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ``leaflets three, let it be'' for Poisonous Oak and Ivy.

Attribute Information:

cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y
bruises?: bruises=t,no=f
odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s
gill-attachment: attached=a,descending=d,free=f,notched=n
gill-spacing: close=c,crowded=w,distant=d
gill-size: broad=b,narrow=n
gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y
stalk-shape: enlarging=e,tapering=t
stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=?
stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
veil-type: partial=p,universal=u
veil-color: brown=n,orange=o,white=w,yellow=y
ring-number: none=n,one=o,two=t
ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z
spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y
population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y
habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d

Imports

import numpy as np
import pandas as pd
 
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("../DATA/mushrooms.csv")

df.head()

HTML

class
cap-shape
cap-surface
cap-color
bruises

Data Prep

X = df.drop('class',axis=1)

y = df['class']

X = pd.get_dummies(X,drop_first=True)

X.head()

HTML

cap-shape_c
cap-shape_f
cap-shape_k
cap-shape_s
cap-shape_x

y.head()

RESULT

Train Test Split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=101)

Gradient Boosting and Grid Search with CV

from sklearn.ensemble import GradientBoostingClassifier

help(GradientBoostingClassifier)

STDOUT

Help on class GradientBoostingClassifier in module sklearn.ensemble._gb:

class GradientBoostingClassifier(sklearn.base.ClassifierMixin, BaseGradientBoosting)
 |  GradientBoostingClassifier(*, loss='deviance', learning_rate=0.1, n_estimators=100, subsample=1.0, criterion='friedman_mse', min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, min_impurity_split=None, init=None, random_state=None, max_features=None, verbose=0, max_leaf_nodes=None, warm_start=False, presort='deprecated', validation_fraction=0.1, n_iter_no_change=None, tol=0.0001, ccp_alpha=0.0)
 |

from sklearn.model_selection import GridSearchCV

param_grid = {"n_estimators":[1,5,10,20,40,100],'max_depth':[3,4,5,6]}

gb_model = GradientBoostingClassifier()

grid = GridSearchCV(gb_model,param_grid)

Fit to Training Data with CV Search

grid.fit(X_train,y_train)

RESULT

GridSearchCV(estimator=GradientBoostingClassifier(),
             param_grid={'max_depth': [3, 4, 5, 6],
                         'n_estimators': [1, 5, 10, 20, 40, 100]})

grid.best_params_

RESULT

{'max_depth': 3, 'n_estimators': 100}

Performance

from sklearn.metrics import classification_report,plot_confusion_matrix,accuracy_score

predictions = grid.predict(X_test)

predictions

RESULT

array(['p', 'e', 'p', ..., 'p', 'p', 'e'], dtype=object)

print(classification_report(y_test,predictions))

STDOUT

              precision    recall  f1-score   support

           e       1.00      1.00      1.00       655
           p       1.00      1.00      1.00       564

grid.best_estimator_.feature_importances_

RESULT

array([2.91150176e-04, 1.55427847e-17, 2.67658844e-21, 0.00000000e+00,
       1.11459235e-16, 1.05030313e-03, 3.26837862e-18, 9.23288948e-17,
       3.33934930e-18, 0.00000000e+00, 1.27133255e-17, 0.00000000e+00,
       3.56629935e-17, 2.46527883e-21, 0.00000000e+00, 5.60405971e-07,
       2.31055039e-03, 5.13955090e-02, 1.84253604e-04, 1.40371481e-02,

feat_import = grid.best_estimator_.feature_importances_

imp_feats = pd.DataFrame(index=X.columns,data=feat_import,columns=['Importance'])

imp_feats

HTML

Importance
cap-shape_c
2.911502e-04
cap-shape_f
1.554278e-17

imp_feats.sort_values("Importance",ascending=False)

HTML

Importance
odor_n
0.614744
stalk-root_c
0.135977

imp_feats.describe().transpose()

HTML

count
mean
std
min
25%

imp_feats = imp_feats[imp_feats['Importance'] > 0.000527]

imp_feats.sort_values('Importance')

HTML

Importance
population_y
0.000550
stalk-color-above-ring_w
0.000575

plt.figure(figsize=(14,6),dpi=200)
sns.barplot(data=imp_feats.sort_values('Importance'),x=imp_feats.sort_values('Importance').index,y='Importance')
plt.xticks(rotation=90);

PLOT

00 Adaboost Boosted Trees