🚀
Naive Bayes and NLP
03 Text Classification Assessment Solution
++++
Data Science
May 2026×Notebook lesson

Notebook converted from Jupyter for blog publishing.

03-Text-Classification-Assessment-Solution

Driptanil Datta
Driptanil DattaSoftware Developer

Text Classification Assessment - Solution

Goal: Given a set of text movie reviews that have been labeled negative or positive

For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/ (opens in a new tab)

Complete the tasks in bold below!

Task: Perform imports and load the dataset into a pandas DataFrame For this exercise you can load the dataset from '../DATA/moviereviews.csv'{:python}.

# CODE HERE
import numpy as np
import pandas as pd
df = pd.read_csv('../DATA/moviereviews.csv')
df.head()
HTML
MORE
label
review
0
neg
how do films like mouse hunt get into theatres...

TASK: Check to see if there are any missing values in the dataframe.

#CODE HERE
# Check for NaN values:
df.isnull().sum()
RESULT
label      0
review    35
dtype: int64

TASK: Remove any reviews that are NaN

df = df.dropna()

TASK: Check to see if any reviews are blank strings and not just NaN. Note: This means a review text could just be: "" or " " or some other larger blank string. How would you check for this? Note: There are many ways! Once you've discovered the reviews that are blank strings, go ahead and remove them as well. Click me for a big hint (opens in a new tab)

df['review'].str.isspace().sum()
RESULT
27
df[df['review'].str.isspace()]
HTML
MORE
label
review
57
neg
71
df = df[~df['review'].str.isspace()]
df.info()
STDOUT
MORE
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1938 entries, 0 to 1999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 

TASK: Confirm the value counts per label:

#CODE HERE
df['label'].value_counts()
RESULT
pos    969
neg    969
Name: label, dtype: int64

EDA on Bag of Words

Bonus Task: Can you figure out how to use a CountVectorizer model to get the top 20 words (that are not english stop words) per label type? Note, this is a bonus task as we did not show this in the lectures. But a quick cursory Google search should put you on the right path. Click me for a big hint (opens in a new tab)

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words='english')
matrix = cv.fit_transform(df[df['label']=='neg']['review'])
freqs = zip(cv.get_feature_names(), matrix.sum(axis=0).tolist()[0])    
# sort from largest to smallest
print("Top 20 words used for Negative reviews.")
print(sorted(freqs, key=lambda x: -x[1])[:20])
STDOUT
Top 20 words used for Negative reviews.
[('film', 4063), ('movie', 3131), ('like', 1808), ('just', 1480), ('time', 1127), ('good', 1117), ('bad', 997), ('character', 926), ('story', 908), ('plot', 888), ('characters', 838), ('make', 813), ('really', 743), ('way', 734), ('little', 696), ('don', 683), ('does', 666), ('doesn', 648), ('action', 635), ('scene', 634)]
matrix = cv.fit_transform(df[df['label']=='pos']['review'])
freqs = zip(cv.get_feature_names(), matrix.sum(axis=0).tolist()[0])    
# sort from largest to smallest
print("Top 20 words used for Positive reviews.")
print(sorted(freqs, key=lambda x: -x[1])[:20])
STDOUT
Top 20 words used for Positive reviews.
[('film', 5002), ('movie', 2389), ('like', 1721), ('just', 1273), ('story', 1199), ('good', 1193), ('time', 1175), ('character', 1037), ('life', 1032), ('characters', 957), ('way', 864), ('films', 851), ('does', 828), ('best', 788), ('people', 769), ('make', 764), ('little', 751), ('really', 731), ('man', 728), ('new', 702)]

Training and Data

TASK: Split the data into features and a label (X and y) and then preform a train/test split. You may use whatever settings you like. To compare your results to the solution notebook, use test_size=0.20, random_state=101&#123;:python&#125;

#CODE HERE
from sklearn.model_selection import train_test_split
 
X = df['review']
y = df['label']
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)

Training a Mode

TASK: Create a PipeLine that will both create a TF-IDF Vector out of the raw text data and fit a supervised learning model of your choice. Then fit that pipeline on the training data.

#CODE HERE
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
pipe = Pipeline([('tfidf', TfidfVectorizer()),('svc', LinearSVC()),])
# Feed the training data through the pipeline
pipe.fit(X_train, y_train)
RESULT
Pipeline(steps=[('tfidf', TfidfVectorizer()), ('svc', LinearSVC())])

TASK: Create a classification report and plot a confusion matrix based on the results of your PipeLine.

#CODE HERE
from sklearn.metrics import classification_report,plot_confusion_matrix
preds = pipe.predict(X_test)
print(classification_report(y_test,preds))
STDOUT
MORE
              precision    recall  f1-score   support

         neg       0.81      0.86      0.83       191
         pos       0.85      0.81      0.83       197
plot_confusion_matrix(pipe,X_test,y_test)
RESULT
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1f370d0b790>
PLOT
Output 1

Great job!

Drip

Driptanil Datta

Software Developer

Building full-stack systems, one commit at a time. This blog is a centralized learning archive for developers.

Legal Notes
Disclaimer

The content provided on this blog is for educational and informational purposes only. While I strive for accuracy, all information is provided "as is" without any warranties of completeness, reliability, or accuracy. Any action you take upon the information found on this website is strictly at your own risk.

Copyright & IP

Certain technical content, interview questions, and datasets are curated from external educational sources to provide a centralized learning resource. Respect for original authorship is maintained; no copyright infringement is intended. All trademarks, logos, and brand names are the property of their respective owners.

System Operational

© 2026 Driptanil Datta. All rights reserved.