++++Notebook converted from Jupyter for blog publishing.
02-Dealing-with-Categorical-Data
Dealing with Categorical Data
Many machine learning models can not deal with categorical data set as strings. For example linear regression can not apply a a Beta Coefficent to colors like "red" or "blue". Instead we need to convert these categories into "dummy" variables, otherwise known as "one-hot" encoding.
Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as snsData
We will open the .csv file that has been "cleaned" to remove outliers and NaN from the previous lectures.
df = pd.read_csv("../DATA/Ames_NO_Missing_Data.csv")df.head()MS SubClass
MS Zoning
Lot Frontage
Lot Area
StreetData Description
with open('../DATA/Ames_Housing_Feature_Description.txt','r') as f:
print(f.read())MSSubClass: Identifies the type of dwelling involved in the sale.
20 1-STORY 1946 & NEWER ALL STYLES
30 1-STORY 1945 & OLDER
40 1-STORY W/FINISHED ATTIC ALL AGESNumerical Column to Categorical
We need to be careful when it comes to encoding categories as numbers. We want to make sure that the numerical relationship makes sense for a model. For example, the encoding MSSubClass is essentially just a number code per class:
MSSubClass: Identifies the type of dwelling involved in the sale.
20 1-STORY 1946 & NEWER ALL STYLES 30 1-STORY 1945 & OLDER 40 1-STORY W/FINISHED ATTIC ALL AGES 45 1-1/2 STORY - UNFINISHED ALL AGES 50 1-1/2 STORY FINISHED ALL AGES 60 2-STORY 1946 & NEWER 70 2-STORY 1945 & OLDER 75 2-1/2 STORY ALL AGES 80 SPLIT OR MULTI-LEVEL 85 SPLIT FOYER 90 DUPLEX - ALL STYLES AND AGES 120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER 150 1-1/2 STORY PUD - ALL AGES 160 2-STORY PUD - 1946 & NEWER 180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER 190 2 FAMILY CONVERSION - ALL STYLES AND AGES
The number itself does not appear to have a relationship to the other numbers. While 30 > 20 is True, it doesn't really make sense that "1-STORY 1945 & OLDER" > "1-STORY 1946 & NEWER ALL STYLES". Keep in mind, this isn't always the case, for example 1st class seats versus 2nd class seats encoded as 1 and 2. Make sure you fully understand your data set to examine what needs to be converted/changed.
MSSubClass
# Convert to String
df['MS SubClass'] = df['MS SubClass'].apply(str)Creating "Dummy" Variables
Avoiding MultiCollinearity and the Dummy Variable Trap
https://stats.stackexchange.com/questions/144372/dummy-variable-trap (opens in a new tab)
person_state = pd.Series(['Dead','Alive','Dead','Alive','Dead','Dead'])person_state0 Dead
1 Alive
2 Dead
3 Alive
4 Deadpd.get_dummies(person_state)Alive
Dead
0
0
1pd.get_dummies(person_state,drop_first=True)Dead
0
1
1
0Creating Dummy Variables from Object Columns
df.select_dtypes(include='object')MS SubClass
MS Zoning
Street
Lot Shape
Land Contourdf_nums = df.select_dtypes(exclude='object')
df_objs = df.select_dtypes(include='object')df_nums.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2925 entries, 0 to 2924
Data columns (total 36 columns):
# Column Non-Null Count Dtype
--- ------ -------------- ----- df_objs.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2925 entries, 0 to 2924
Data columns (total 40 columns):
# Column Non-Null Count Dtype
--- ------ -------------- ----- Converting
df_objs = pd.get_dummies(df_objs,drop_first=True)final_df = pd.concat([df_nums,df_objs],axis=1)final_dfLot Frontage
Lot Area
Overall Qual
Overall Cond
Year BuiltFinal Thoughts
Keep in mind, we don't know if 274 columns is very useful. More columns doesn't necessarily lead to better results. In fact, we may want to further remove columns (or later on use a model with regularization to choose important columns for us). What we have done here has greatly expanded the ratio of rows to columns, which may actually lead to worse performance (however you don't know until you've actually compared multiple models/approaches).
final_df.corr()['SalePrice'].sort_values()Exter Qual_TA -0.591459
Kitchen Qual_TA -0.527461
Fireplace Qu_None -0.481740
Bsmt Qual_TA -0.453022
Garage Finish_Unf -0.422363OverallQual: Rates the overall material and finish of the house
10 Very Excellent 9 Excellent 8 Very Good 7 Good 6 Above Average 5 Average 4 Below Average 3 Fair 2 Poor 1 Very Poor
Most likely a human realtor rated this "Overall Qual" column, which means it highly likely takes into account many of the other features. It also means that any future house we intend to predict a price for will need this "Overall Qual" feature, which implies that every new house on the market that will be priced with our ML model will still require a human person!
Save Final DF
final_df.to_csv('../DATA/AMES_Final_DF.csv')