🚀
Naive Bayes and NLP
02 Text Classification Assessment
++++
Data Science
May 2026×Notebook lesson

Notebook converted from Jupyter for blog publishing.

02-Text-Classification-Assessment

Driptanil Datta
Driptanil DattaSoftware Developer

Text Classification Assessment

Goal: Given a set of text movie reviews that have been labeled negative or positive

For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/ (opens in a new tab)

Complete the tasks in bold below!

Task: Perform imports and load the dataset into a pandas DataFrame For this exercise you can load the dataset from '../DATA/moviereviews.csv'{:python}.

# CODE HERE
import numpy as np
import pandas as pd
df = pd.read_csv('../DATA/moviereviews.csv')
df.head()
HTML
MORE
label
review
0
neg
how do films like mouse hunt get into theatres...

TASK: Check to see if there are any missing values in the dataframe.

#CODE HERE

TASK: Remove any reviews that are NaN

TASK: Check to see if any reviews are blank strings and not just NaN. Note: This means a review text could just be: "" or " " or some other larger blank string. How would you check for this? Note: There are many ways! Once you've discovered the reviews that are blank strings, go ahead and remove them as well. Click me for a big hint (opens in a new tab)

TASK: Confirm the value counts per label:

#CODE HERE

EDA on Bag of Words

Bonus Task: Can you figure out how to use a CountVectorizer model to get the top 20 words (that are not english stop words) per label type? Note, this is a bonus task as we did not show this in the lectures. But a quick cursory Google search should put you on the right path. Click me for a big hint (opens in a new tab)

#CODE HERE

Training and Data

TASK: Split the data into features and a label (X and y) and then preform a train/test split. You may use whatever settings you like. To compare your results to the solution notebook, use test_size=0.20, random_state=101{:python}

#CODE HERE

Training a Mode

TASK: Create a PipeLine that will both create a TF-IDF Vector out of the raw text data and fit a supervised learning model of your choice. Then fit that pipeline on the training data.

#CODE HERE

TASK: Create a classification report and plot a confusion matrix based on the results of your PipeLine.

#CODE HERE

Great job!

Drip

Driptanil Datta

Software Developer

Building full-stack systems, one commit at a time. This blog is a centralized learning archive for developers.

Legal Notes
Disclaimer

The content provided on this blog is for educational and informational purposes only. While I strive for accuracy, all information is provided "as is" without any warranties of completeness, reliability, or accuracy. Any action you take upon the information found on this website is strictly at your own risk.

Copyright & IP

Certain technical content, interview questions, and datasets are curated from external educational sources to provide a centralized learning resource. Respect for original authorship is maintained; no copyright infringement is intended. All trademarks, logos, and brand names are the property of their respective owners.

System Operational

© 2026 Driptanil Datta. All rights reserved.