++++

Data Science

May 2026×Notebook lesson

Notebook converted from Jupyter for blog publishing.

02-Dealing-with-Categorical-Data

Driptanil DattaSoftware Developer

Dealing with Categorical Data

Many machine learning models can not deal with categorical data set as strings. For example linear regression can not apply a a Beta Coefficent to colors like "red" or "blue". Instead we need to convert these categories into "dummy" variables, otherwise known as "one-hot" encoding.

Imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Data

We will open the .csv file that has been "cleaned" to remove outliers and NaN from the previous lectures.

df = pd.read_csv("../DATA/Ames_NO_Missing_Data.csv")

df.head()

HTML

MS SubClass
MS Zoning
Lot Frontage
Lot Area
Street

Data Description

with open('../DATA/Ames_Housing_Feature_Description.txt','r') as f: 
    print(f.read())

STDOUT

MSSubClass: Identifies the type of dwelling involved in the sale.	

        20	1-STORY 1946 & NEWER ALL STYLES
        30	1-STORY 1945 & OLDER
        40	1-STORY W/FINISHED ATTIC ALL AGES

Numerical Column to Categorical

We need to be careful when it comes to encoding categories as numbers. We want to make sure that the numerical relationship makes sense for a model. For example, the encoding MSSubClass is essentially just a number code per class:

MSSubClass: Identifies the type of dwelling involved in the sale.

20 1-STORY 1946 & NEWER ALL STYLES 30 1-STORY 1945 & OLDER 40 1-STORY W/FINISHED ATTIC ALL AGES 45 1-1/2 STORY - UNFINISHED ALL AGES 50 1-1/2 STORY FINISHED ALL AGES 60 2-STORY 1946 & NEWER 70 2-STORY 1945 & OLDER 75 2-1/2 STORY ALL AGES 80 SPLIT OR MULTI-LEVEL 85 SPLIT FOYER 90 DUPLEX - ALL STYLES AND AGES 120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER 150 1-1/2 STORY PUD - ALL AGES 160 2-STORY PUD - 1946 & NEWER 180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER 190 2 FAMILY CONVERSION - ALL STYLES AND AGES

The number itself does not appear to have a relationship to the other numbers. While 30 > 20 is True, it doesn't really make sense that "1-STORY 1945 & OLDER" > "1-STORY 1946 & NEWER ALL STYLES". Keep in mind, this isn't always the case, for example 1st class seats versus 2nd class seats encoded as 1 and 2. Make sure you fully understand your data set to examine what needs to be converted/changed.

MSSubClass

# Convert to String
df['MS SubClass'] = df['MS SubClass'].apply(str)

Creating "Dummy" Variables

Avoiding MultiCollinearity and the Dummy Variable Trap

https://stats.stackexchange.com/questions/144372/dummy-variable-trap (opens in a new tab)

person_state =  pd.Series(['Dead','Alive','Dead','Alive','Dead','Dead'])

person_state

RESULT

0     Dead
1    Alive
2     Dead
3    Alive
4     Dead

pd.get_dummies(person_state)

HTML

Alive
Dead
0
0
1

pd.get_dummies(person_state,drop_first=True)

HTML

Dead
0
1
1
0

Creating Dummy Variables from Object Columns

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html (opens in a new tab)

df.select_dtypes(include='object')

HTML

MS SubClass
MS Zoning
Street
Lot Shape
Land Contour

df_nums = df.select_dtypes(exclude='object')
df_objs = df.select_dtypes(include='object')

df_nums.info()

STDOUT

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2925 entries, 0 to 2924
Data columns (total 36 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----

df_objs.info()

STDOUT

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2925 entries, 0 to 2924
Data columns (total 40 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  -----

Converting

df_objs = pd.get_dummies(df_objs,drop_first=True)

final_df = pd.concat([df_nums,df_objs],axis=1)

final_df

HTML

Lot Frontage
Lot Area
Overall Qual
Overall Cond
Year Built

Final Thoughts

Keep in mind, we don't know if 274 columns is very useful. More columns doesn't necessarily lead to better results. In fact, we may want to further remove columns (or later on use a model with regularization to choose important columns for us). What we have done here has greatly expanded the ratio of rows to columns, which may actually lead to worse performance (however you don't know until you've actually compared multiple models/approaches).

final_df.corr()['SalePrice'].sort_values()

RESULT

Exter Qual_TA       -0.591459
Kitchen Qual_TA     -0.527461
Fireplace Qu_None   -0.481740
Bsmt Qual_TA        -0.453022
Garage Finish_Unf   -0.422363

OverallQual: Rates the overall material and finish of the house

10 Very Excellent 9 Excellent 8 Very Good 7 Good 6 Above Average 5 Average 4 Below Average 3 Fair 2 Poor 1 Very Poor

Most likely a human realtor rated this "Overall Qual" column, which means it highly likely takes into account many of the other features. It also means that any future house we intend to predict a price for will need this "Overall Qual" feature, which implies that every new house on the market that will be priced with our ML model will still require a human person!

Save Final DF

final_df.to_csv('../DATA/AMES_Final_DF.csv')

01 Dealing with Missing Data Feature Engineering