++++

Data Science

May 2026×Notebook lesson

Notebook converted from Jupyter for blog publishing.

03-PCA-Exercise-Project-Solutions

Driptanil DattaSoftware Developer

Principal Component Analysis - Project Exercise Solutions

GOAL: Figure out which handwritten digits are most differentiated with PCA.

Imagine you are working on an image recognition service for a postal service. It would be very useful to be able to read in the digits automatically, even if they are handwritten. (Quick note, this is very much how modern postal services work for a long time now and its actually more accurate than a human). The manager of the postal service wants to know which handwritten numbers are the hardest to tell apart, so he can focus on getting more labeled examples of that data. You will have a dataset of hand written digits (a very famous data set) and you will perform PCA to get better insight into which numbers are easily separable from the rest.

Data

Background:

E. Alpaydin, Fevzi. Alimoglu Department of Computer Engineering Bogazici University, 80815 Istanbul Turkey alpaydin '@' boun.edu.tr

Data Set Information from Original Authors:

We create a digit database by collecting 250 samples from 44 writers. The samples written by 30 writers are used for training, cross-validation and writer dependent testing, and the digits written by the other 14 are used for writer independent testing. This database is also available in the UNIPEN format.

We use a WACOM PL-100V pressure sensitive tablet with an integrated LCD display and a cordless stylus. The input and display areas are located in the same place. Attached to the serial port of an Intel 486 based PC, it allows us to collect handwriting samples. The tablet sends $x$ and $y$ tablet coordinates and pressure level values of the pen at fixed time intervals (sampling rate) of 100 miliseconds.

These writers are asked to write 250 digits in random order inside boxes of 500 by 500 tablet pixel resolution. Subject are monitored only during the first entry screens. Each screen contains five boxes with the digits to be written displayed above. Subjects are told to write only inside these boxes. If they make a mistake or are unhappy with their writing, they are instructed to clear the content of a box by using an on-screen button. The first ten digits are ignored because most writers are not familiar with this type of input devices, but subjects are not aware of this.

SOURCE: https://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits (opens in a new tab)

Complete the Tasks in bold below

TASK: Run the cells below to import the libraries and relevant data set.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

digits = pd.read_csv('../DATA/digits.csv')

digits

HTML

pixel_0_0
pixel_0_1
pixel_0_2
pixel_0_3
pixel_0_4

TASK: Create a new DataFrame called pixels that consists only of the pixel feature values by dropping the number_label column.

#CODE HERE

pixels = digits.drop('number_label',axis=1)

pixels

HTML

pixel_0_0
pixel_0_1
pixel_0_2
pixel_0_3
pixel_0_4

Displaying an Image

TASK: Grab a single image row representation by getting the first row of the pixels DataFrame.

#CODE HERE

single_image = pixels.iloc[0]

single_image

RESULT

pixel_0_0     0.0
pixel_0_1     0.0
pixel_0_2     5.0
pixel_0_3    13.0
pixel_0_4     9.0

TASK: Convert this single row Series into a numpy array.

#CODE HERE

single_image.to_numpy()

RESULT

array([ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.,  0.,  0., 13., 15., 10.,
       15.,  5.,  0.,  0.,  3., 15.,  2.,  0., 11.,  8.,  0.,  0.,  4.,
       12.,  0.,  0.,  8.,  8.,  0.,  0.,  5.,  8.,  0.,  0.,  9.,  8.,
        0.,  0.,  4., 11.,  0.,  1., 12.,  7.,  0.,  0.,  2., 14.,  5.,
       10., 12.,  0.,  0.,  0.,  0.,  6., 13., 10.,  0.,  0.,  0.])

TASK: Reshape this numpy array into an (8,8) array.

#CODE HERE

single_image.to_numpy().shape

RESULT

(64,)

single_image.to_numpy().reshape(8,8)

RESULT

array([[ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.],
       [ 0.,  0., 13., 15., 10., 15.,  5.,  0.],
       [ 0.,  3., 15.,  2.,  0., 11.,  8.,  0.],
       [ 0.,  4., 12.,  0.,  0.,  8.,  8.,  0.],
       [ 0.,  5.,  8.,  0.,  0.,  9.,  8.,  0.],

TASK: Use Matplotlib or Seaborn to display the array as an image representation of the number drawn. Remember your palette or cmap choice would change the colors, but not the actual pixel values.

#CODE HERE

plt.imshow(single_image.to_numpy().reshape(8,8))

RESULT

<matplotlib.image.AxesImage at 0x1d45ca0e608>

PLOT

plt.imshow(single_image.to_numpy().reshape(8,8),cmap='gray')

RESULT

<matplotlib.image.AxesImage at 0x1d45c508f88>

PLOT

sns.heatmap(single_image.to_numpy().reshape(8,8),annot=True,cmap='gray')

RESULT

<AxesSubplot:>

PLOT

Now let's move on to PCA.

Scaling Data

TASK: Use Scikit-Learn to scale the pixel feature dataframe.

#CODE HERE

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaled_pixels = scaler.fit_transform(pixels)

scaled_pixels

RESULT

array([[ 0.        , -0.33501649, -0.04308102, ..., -1.14664746,
        -0.5056698 , -0.19600752],
       [ 0.        , -0.33501649, -1.09493684, ...,  0.54856067,
        -0.5056698 , -0.19600752],
       [ 0.        , -0.33501649, -1.09493684, ...,  1.56568555,

PCA

TASK: Perform PCA on the scaled pixel data set with 2 components.

from sklearn.decomposition import PCA

pca_model = PCA(n_components=2)

pca_pixels = pca_model.fit_transform(scaled_pixels)

TASK: How much variance is explained by 2 principal components.

#CODE HERE

np.sum(pca_model.explained_variance_ratio_)

RESULT

0.21594970492246052

TASK: Create a scatterplot of the digits in the 2 dimensional PCA space, color/label based on the original number_label column in the original dataset.

#CODE HERE

plt.figure(figsize=(10,6),dpi=150)
labels = digits['number_label'].values
sns.scatterplot(pca_pixels[:,0],pca_pixels[:,1],hue=labels,palette='Set1')
plt.legend(loc=(1.05,0))

RESULT

<matplotlib.legend.Legend at 0x1d45c6c33c8>

PLOT

TASK: Which numbers are the most "distinct"?

# You should see label #4 as being the most separated group, 
# implying its the most distinct, similar situation for #2, #6 and #9.

Bonus Challenge

TASK: Create an "interactive" 3D plot of the result of PCA with 3 principal components. Lot's of ways to do this, including different libraries like plotly or bokeh, but you can actually do this just with Matplotlib and Jupyter Notebook. Search Google and StackOverflow if you get stuck, lots of solutions are posted online.

#CODE HERE

from sklearn.decomposition import PCA

pca_model = PCA(n_components=3)

pca_pixels = pca_model.fit_transform(scaled_pixels)

from mpl_toolkits import mplot3d

plt.figure(figsize=(8,8),dpi=150)
ax = plt.axes(projection='3d')
ax.scatter3D(pca_pixels[:,0],pca_pixels[:,1],pca_pixels[:,2],c=df['number_label']);

PLOT

%matplotlib notebook

ax = plt.axes(projection='3d')
ax.scatter3D(pca_pixels[:,0],pca_pixels[:,1],pca_pixels[:,2],c=df['number_label']);

RESULT

<IPython.core.display.Javascript object>

Great Job!

02 Pca Exercise Project PCA Principal Component Analysis