++++Notebook converted from Jupyter for blog publishing.
01-DataFrames
DataFrames
Throughout the course, most of our data exploration will be done with DataFrames. DataFrames are an extremely powerful tool and a natural extension of the Pandas Series. By definition all a DataFrame is:
A Pandas DataFrame consists of multiple Pandas Series that share index values.
Imports
import numpy as np
import pandas as pdCreating a DataFrame from Python Objects
# help(pd.DataFrame)# Make sure the seed is in the same cell as the random call
# https://stackoverflow.com/questions/21494489/what-does-numpy-random-seed0-do
np.random.seed(101)
mydata = np.random.randint(0,101,(4,3))mydataarray([[95, 11, 81],
[70, 63, 87],
[75, 9, 77],
[40, 4, 63]])myindex = ['CA','NY','AZ','TX']mycolumns = ['Jan','Feb','Mar']df = pd.DataFrame(data=mydata)
df0
1
2
0
95df = pd.DataFrame(data=mydata,index=myindex)
df0
1
2
CA
95df = pd.DataFrame(data=mydata,index=myindex,columns=mycolumns)
dfJan
Feb
Mar
CA
95df.info()<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, CA to TX
Data columns (total 3 columns):
Jan 4 non-null int32
Feb 4 non-null int32Reading a .csv file for a DataFrame
NOTE: We will go over all kinds of data inputs and outputs (.html, .csv, .xlxs , etc...) later on in the course! For now we just need to read in a simple .csv file.
CSV
Comma Separated Values files are text files that use commas as field delimeters.
Unless you're running the virtual environment included with the course, you may need to install xlrd and openpyxl.
In your terminal/command prompt run:
conda install xlrd conda install openpyxl
Then restart Jupyter Notebook. (or use pip install if you aren't using the Anaconda Distribution)
Understanding File Paths
You have two options when reading a file with pandas:
-
If your .py file or .ipynb notebook is located in the exact same folder location as the .csv file you want to read, simply pass in the file name as a string, for example:
df = pd.read_csv('some_file.csv')
-
Pass in the entire file path if you are located in a different directory. The file path must be 100% correct in order for this to work. For example:
df = pd.read_csv("C:\Users\myself\files\some_file.csv")
Print your current directory file path with pwd
pwd'C:\\Users\\Marcial\\Pierian-Data-Courses\\Machine-Learning-MasterClass\\03-Pandas'List the files in your current directory with ls
ls Volume in drive C has no label.
Volume Serial Number is 3652-BD2F
Directory of C:\Users\Marcial\Pierian-Data-Courses\Machine-Learning-MasterClass\03-Pandas
df = pd.read_csv('tips.csv')dftotal_bill
tip
sex
smoker
dayAbout this DataSet (in case you are interested)
-
Description
- One waiter recorded information about each tip he received over a period of a few months working in one restaurant. He collected several variables:
-
Format
- A data frame with 244 rows and 7 variables
-
Details
- tip in dollars,
- bill in dollars,
- sex of the bill payer,
- whether there were smokers in the party,
- day of the week,
- time of day,
- size of the party.
In all he recorded 244 tips. The data was reported in a collection of case studies for business statistics (Bryant & Smith 1995).
- References
- Bryant, P. G. and Smith, M (1995) Practical Data Analysis: Case Studies in Business Statistics. Homewood, IL: Richard D. Irwin Publishing:
- Note: We created some additional columns with Fake data, including Name, CC Number, and Payment ID.
DataFrames
Obtaining Basic Information About DataFrame
df.columnsIndex(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size',
'price_per_person', 'Payer Name', 'CC Number', 'Payment ID'],
dtype='object')df.indexRangeIndex(start=0, stop=244, step=1)df.head(3)total_bill
tip
sex
smoker
daydf.tail(3)total_bill
tip
sex
smoker
daydf.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 11 columns):
total_bill 244 non-null float64
tip 244 non-null float64len(df)244df.describe()total_bill
tip
size
price_per_person
CC Numberdf.describe().transpose()count
mean
std
min
25%Selection and Indexing
Let's learn how to retrieve information from a DataFrame.
COLUMNS
We will begin be learning how to extract information based on the columns
df.head()total_bill
tip
sex
smoker
dayGrab a Single Column
df['total_bill']0 16.99
1 10.34
2 21.01
3 23.68
4 24.59type(df['total_bill'])pandas.core.series.SeriesGrab Multiple Columns
# Note how its a python list of column names! Thus the double brackets.
df[['total_bill','tip']]total_bill
tip
0
16.99
1.01Create New Columns
df['tip_percentage'] = 100* df['tip'] / df['total_bill']df.head()total_bill
tip
sex
smoker
daydf['price_per_person'] = df['total_bill'] / df['size']df.head()total_bill
tip
sex
smoker
dayhelp(np.round)Help on function round_ in module numpy:
round_(a, decimals=0, out=None)
Round an array to the given number of decimals.
Adjust Existing Columns
# Because pandas is based on numpy, we get awesome capabilities with numpy's universal functions!
df['price_per_person'] = np.round(df['price_per_person'],2)df.head()total_bill
tip
sex
smoker
dayRemove Columns
# df.drop('tip_percentage',axis=1)df = df.drop("tip_percentage",axis=1)df.head()total_bill
tip
sex
smoker
dayIndex Basics
Before going over the same retrieval tasks for rows, let's build some basic understanding of the pandas DataFrame Index.
df.head()total_bill
tip
sex
smoker
daydf.indexRangeIndex(start=0, stop=244, step=1)df.set_index('Payment ID')total_bill
tip
sex
smoker
daydf.head()total_bill
tip
sex
smoker
daydf = df.set_index('Payment ID')df.head()total_bill
tip
sex
smoker
daydf = df.reset_index()df.head()Payment ID
total_bill
tip
sex
smokerROWS
Let's now explore these same concepts but with Rows.
df.head()Payment ID
total_bill
tip
sex
smokerdf = df.set_index('Payment ID')df.head()total_bill
tip
sex
smoker
dayGrab a Single Row
# Integer Based
df.iloc[0]total_bill 16.99
tip 1.01
sex Female
smoker No
day Sun# Name Based
df.loc['Sun2959']total_bill 16.99
tip 1.01
sex Female
smoker No
day SunGrab Multiple Rows
df.iloc[0:4]total_bill
tip
sex
smoker
daydf.loc[['Sun2959','Sun5260']]total_bill
tip
sex
smoker
dayRemove Row
Typically are datasets will be large enough that we won't remove rows like this since we won't know thier row location for some specific condition, instead, we drop rows based on conditions such as missing data or column values. The next lecture will cover this in a lot more detail.
df.head()total_bill
tip
sex
smoker
daydf.drop('Sun2959',axis=0).head()total_bill
tip
sex
smoker
day# Error if you have a named index!
# df.drop(0,axis=0).head()Insert a New Row
Pretty rare to add a single row like this. Usually you use pd.concat() to add many rows at once. You could use the .append() method with a list of pd.Series() objects, but you won't see us do this with realistic real-world data.
one_row = df.iloc[0]one_rowtotal_bill 16.99
tip 1.01
sex Female
smoker No
day Suntype(one_row)pandas.core.series.Seriesdf.tail()total_bill
tip
sex
smoker
daydf.append(one_row).tail()total_bill
tip
sex
smoker
day