++++

Data Science

May 2026×Notebook lesson

Notebook converted from Jupyter for blog publishing.

01-DataFrames

Driptanil DattaSoftware Developer

DataFrames

Throughout the course, most of our data exploration will be done with DataFrames. DataFrames are an extremely powerful tool and a natural extension of the Pandas Series. By definition all a DataFrame is:

A Pandas DataFrame consists of multiple Pandas Series that share index values.

Imports

import numpy as np
import pandas as pd

Creating a DataFrame from Python Objects

# help(pd.DataFrame)

# Make sure the seed is in the same cell as the random call
# https://stackoverflow.com/questions/21494489/what-does-numpy-random-seed0-do
np.random.seed(101)
mydata = np.random.randint(0,101,(4,3))

mydata

RESULT

array([[95, 11, 81],
       [70, 63, 87],
       [75,  9, 77],
       [40,  4, 63]])

myindex = ['CA','NY','AZ','TX']

mycolumns = ['Jan','Feb','Mar']

df = pd.DataFrame(data=mydata)
df

HTML

df = pd.DataFrame(data=mydata,index=myindex)
df

HTML

df = pd.DataFrame(data=mydata,index=myindex,columns=mycolumns)
df

HTML

Jan
Feb
Mar
CA
95

df.info()

STDOUT

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, CA to TX
Data columns (total 3 columns):
Jan    4 non-null int32
Feb    4 non-null int32

Reading a .csv file for a DataFrame

NOTE: We will go over all kinds of data inputs and outputs (.html, .csv, .xlxs , etc...) later on in the course! For now we just need to read in a simple .csv file.

CSV

Comma Separated Values files are text files that use commas as field delimeters.
Unless you're running the virtual environment included with the course, you may need to install xlrd and openpyxl.
In your terminal/command prompt run:

conda install xlrd conda install openpyxl

Then restart Jupyter Notebook. (or use pip install if you aren't using the Anaconda Distribution)

Understanding File Paths

You have two options when reading a file with pandas:

If your .py file or .ipynb notebook is located in the exact same folder location as the .csv file you want to read, simply pass in the file name as a string, for example:

df = pd.read_csv('some_file.csv')
Pass in the entire file path if you are located in a different directory. The file path must be 100% correct in order for this to work. For example:

df = pd.read_csv("C:\Users\myself\files\some_file.csv")

Print your current directory file path with pwd

pwd

RESULT

'C:\\Users\\Marcial\\Pierian-Data-Courses\\Machine-Learning-MasterClass\\03-Pandas'

List the files in your current directory with ls

ls

STDOUT

 Volume in drive C has no label.
 Volume Serial Number is 3652-BD2F

 Directory of C:\Users\Marcial\Pierian-Data-Courses\Machine-Learning-MasterClass\03-Pandas

df = pd.read_csv('tips.csv')

df

HTML

total_bill
tip
sex
smoker
day

About this DataSet (in case you are interested)

Description
- One waiter recorded information about each tip he received over a period of a few months working in one restaurant. He collected several variables:
Format
- A data frame with 244 rows and 7 variables
Details
- tip in dollars,
- bill in dollars,
- sex of the bill payer,
- whether there were smokers in the party,
- day of the week,
- time of day,
- size of the party.

In all he recorded 244 tips. The data was reported in a collection of case studies for business statistics (Bryant & Smith 1995).

References
- Bryant, P. G. and Smith, M (1995) Practical Data Analysis: Case Studies in Business Statistics. Homewood, IL: Richard D. Irwin Publishing:
Note: We created some additional columns with Fake data, including Name, CC Number, and Payment ID.

DataFrames

Obtaining Basic Information About DataFrame

df.columns

RESULT

Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size',
       'price_per_person', 'Payer Name', 'CC Number', 'Payment ID'],
      dtype='object')

df.index

RESULT

RangeIndex(start=0, stop=244, step=1)

df.head(3)

HTML

total_bill
tip
sex
smoker
day

df.tail(3)

HTML

total_bill
tip
sex
smoker
day

df.info()

STDOUT

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 11 columns):
total_bill          244 non-null float64
tip                 244 non-null float64

len(df)

RESULT

df.describe()

HTML

total_bill
tip
size
price_per_person
CC Number

df.describe().transpose()

HTML

count
mean
std
min
25%

Selection and Indexing

Let's learn how to retrieve information from a DataFrame.

COLUMNS

We will begin be learning how to extract information based on the columns

df.head()

HTML

total_bill
tip
sex
smoker
day

Grab a Single Column

df['total_bill']

RESULT

0      16.99
1      10.34
2      21.01
3      23.68
4      24.59

type(df['total_bill'])

RESULT

pandas.core.series.Series

Grab Multiple Columns

# Note how its a python list of column names! Thus the double brackets.
df[['total_bill','tip']]

HTML

total_bill
tip
0
16.99
1.01

Create New Columns

df['tip_percentage'] = 100* df['tip'] / df['total_bill']

df.head()

HTML

total_bill
tip
sex
smoker
day

df['price_per_person'] = df['total_bill'] / df['size']

df.head()

HTML

total_bill
tip
sex
smoker
day

help(np.round)

STDOUT

Help on function round_ in module numpy:

round_(a, decimals=0, out=None)
    Round an array to the given number of decimals.

Adjust Existing Columns

# Because pandas is based on numpy, we get awesome capabilities with numpy's universal functions!
df['price_per_person'] = np.round(df['price_per_person'],2)

df.head()

HTML

total_bill
tip
sex
smoker
day

Remove Columns

# df.drop('tip_percentage',axis=1)

df = df.drop("tip_percentage",axis=1)

df.head()

HTML

total_bill
tip
sex
smoker
day

Index Basics

Before going over the same retrieval tasks for rows, let's build some basic understanding of the pandas DataFrame Index.

df.head()

HTML

total_bill
tip
sex
smoker
day

df.index

RESULT

RangeIndex(start=0, stop=244, step=1)

df.set_index('Payment ID')

HTML

total_bill
tip
sex
smoker
day

df.head()

HTML

total_bill
tip
sex
smoker
day

df = df.set_index('Payment ID')

df.head()

HTML

total_bill
tip
sex
smoker
day

df = df.reset_index()

df.head()

HTML

Payment ID
total_bill
tip
sex
smoker

ROWS

Let's now explore these same concepts but with Rows.

df.head()

HTML

Payment ID
total_bill
tip
sex
smoker

df = df.set_index('Payment ID')

df.head()

HTML

total_bill
tip
sex
smoker
day

Grab a Single Row

# Integer Based
df.iloc[0]

RESULT

total_bill                       16.99
tip                               1.01
sex                             Female
smoker                              No
day                                Sun

# Name Based
df.loc['Sun2959']

RESULT

total_bill                       16.99
tip                               1.01
sex                             Female
smoker                              No
day                                Sun

Grab Multiple Rows

df.iloc[0:4]

HTML

total_bill
tip
sex
smoker
day

df.loc[['Sun2959','Sun5260']]

HTML

total_bill
tip
sex
smoker
day

Remove Row

Typically are datasets will be large enough that we won't remove rows like this since we won't know thier row location for some specific condition, instead, we drop rows based on conditions such as missing data or column values. The next lecture will cover this in a lot more detail.

df.head()

HTML

total_bill
tip
sex
smoker
day

df.drop('Sun2959',axis=0).head()

HTML

total_bill
tip
sex
smoker
day

# Error if you have a named index!
# df.drop(0,axis=0).head()

Insert a New Row

Pretty rare to add a single row like this. Usually you use pd.concat() to add many rows at once. You could use the .append() method with a list of pd.Series() objects, but you won't see us do this with realistic real-world data.

one_row = df.iloc[0]

one_row

RESULT

total_bill                       16.99
tip                               1.01
sex                             Female
smoker                              No
day                                Sun

type(one_row)

RESULT

pandas.core.series.Series

df.tail()

HTML

total_bill
tip
sex
smoker
day

df.append(one_row).tail()

HTML

total_bill
tip
sex
smoker
day

00 Series 02 Conditional Filtering