++++

Data Science

May 2026×Notebook lesson

Notebook converted from Jupyter for blog publishing.

00-Logistic-Regression

Driptanil DattaSoftware Developer

Logistic Regression

Imports

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Data

An experiment was conducted on 5000 participants to study the effects of age and physical health on hearing loss, specifically the ability to hear high pitched tones. This data displays the result of the study in which participants were evaluated and scored for physical ability and then had to take an audio test (pass/no pass) which evaluated their ability to hear high frequencies. The age of the user was also noted. Is it possible to build a model that would predict someone's liklihood to hear the high frequency sound based solely on their features (age and physical score)?

Features
- age - Age of participant in years
- physical_score - Score achieved during physical exam
Label/Target
- test_result - 0 if no pass, 1 if test passed

df = pd.read_csv('../DATA/hearing_test.csv')

df.head()

HTML

age
physical_score
test_result
0
33.0

Exploratory Data Analysis and Visualization

Feel free to explore the data further on your own.

df.info()

STDOUT

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----

df.describe()

HTML

age
physical_score
test_result
count
5000.000000

df['test_result'].value_counts()

RESULT

1    3000
0    2000
Name: test_result, dtype: int64

sns.countplot(data=df,x='test_result')

RESULT

<AxesSubplot:xlabel='test_result', ylabel='count'>

PLOT

sns.boxplot(x='test_result',y='age',data=df)

RESULT

<AxesSubplot:xlabel='test_result', ylabel='age'>

PLOT

sns.boxplot(x='test_result',y='physical_score',data=df)

RESULT

<AxesSubplot:xlabel='test_result', ylabel='physical_score'>

PLOT

sns.scatterplot(x='age',y='physical_score',data=df,hue='test_result')

RESULT

<AxesSubplot:xlabel='age', ylabel='physical_score'>

PLOT

sns.pairplot(df,hue='test_result')

RESULT

<seaborn.axisgrid.PairGrid at 0x19ceae2fd08>

PLOT

sns.heatmap(df.corr(),annot=True)

RESULT

<AxesSubplot:>

PLOT

sns.scatterplot(x='physical_score',y='test_result',data=df)

RESULT

<AxesSubplot:xlabel='physical_score', ylabel='test_result'>

PLOT

sns.scatterplot(x='age',y='test_result',data=df)

RESULT

<AxesSubplot:xlabel='age', ylabel='test_result'>

PLOT

Easily discover new plot types with a google search! Searching for "3d matplotlib scatter plot" quickly takes you to: https://matplotlib.org/3.1.1/gallery/mplot3d/scatter3d.html (opens in a new tab)

from mpl_toolkits.mplot3d import Axes3D 
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df['age'],df['physical_score'],df['test_result'],c=df['test_result'])

RESULT

<mpl_toolkits.mplot3d.art3d.Path3DCollection at 0x19ceaf878c8>

PLOT

Train | Test Split and Scaling

X = df.drop('test_result',axis=1)
y = df['test_result']

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=101)

scaler = StandardScaler()

scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)

Logistic Regression Model

from sklearn.linear_model import LogisticRegression

# help(LogisticRegression)

# help(LogisticRegressionCV)

log_model = LogisticRegression()

log_model.fit(scaled_X_train,y_train)

RESULT

LogisticRegression()

Coefficient Interpretation

Things to remember:

These coeffecients relate to the odds and can not be directly interpreted as in linear regression.
We trained on a scaled version of the data
It is much easier to understand and interpret the relationship between the coefficients than it is to interpret the coefficients relationship with the probability of the target/label class.

Make sure to watch the video explanation, also check out the links below:

The odds ratio

For a continuous independent variable the odds ratio can be defined as:

This exponential relationship provides an interpretation for $\beta _{1}$

The odds multiply by ${e^\beta _{1}}$ for every 1-unit increase in x.

log_model.coef_

RESULT

array([[-0.94953524,  3.45991194]])

This means:

We can expect the odds of passing the test to decrease (the original coeff was negative) per unit increase of the age.
We can expect the odds of passing the test to increase (the original coeff was positive) per unit increase of the physical score.
Based on the ratios with each other, the physical_score indicator is a stronger predictor than age.

Model Performance on Classification Tasks

from sklearn.metrics import accuracy_score,confusion_matrix,classification_report,plot_confusion_matrix

y_pred = log_model.predict(scaled_X_test)

accuracy_score(y_test,y_pred)

RESULT

0.93

confusion_matrix(y_test,y_pred)

RESULT

array([[172,  21],
       [ 14, 293]], dtype=int64)

plot_confusion_matrix(log_model,scaled_X_test,y_test)

RESULT

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x19ceb65e588>

PLOT

# Scaled so highest value=1
plot_confusion_matrix(log_model,scaled_X_test,y_test,normalize='true')

RESULT

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x19ceb691b88>

PLOT

print(classification_report(y_test,y_pred))

STDOUT

              precision    recall  f1-score   support

           0       0.92      0.89      0.91       193
           1       0.93      0.95      0.94       307

X_train.iloc[0]

RESULT

age               32.0
physical_score    43.0
Name: 141, dtype: float64

y_train.iloc[0]

RESULT

# 0% probability of 0 class
# 100% probability of 1 class
log_model.predict_proba(X_train.iloc[0].values.reshape(1, -1))

RESULT

array([[0., 1.]])

log_model.predict(X_train.iloc[0].values.reshape(1, -1))

RESULT

array([1], dtype=int64)

Evaluating Curves and AUC

Make sure to watch the video on this!

from sklearn.metrics import precision_recall_curve,plot_precision_recall_curve,plot_roc_curve

plot_precision_recall_curve(log_model,scaled_X_test,y_test)

RESULT

<sklearn.metrics._plot.precision_recall_curve.PrecisionRecallDisplay at 0x19cec76dac8>

PLOT

plot_roc_curve(log_model,scaled_X_test,y_test)

RESULT

<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x19ceb5c4288>

PLOT

Cross Val and LinReg Project 01 Multi Class Logistic Regression