++++

Data Science

May 2026×Notebook lesson

Notebook converted from Jupyter for blog publishing.

03-Text-Classification-Assessment-Solution

Driptanil DattaSoftware Developer

Text Classification Assessment - Solution

Goal: Given a set of text movie reviews that have been labeled negative or positive

For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/ (opens in a new tab)

Complete the tasks in bold below!

Task: Perform imports and load the dataset into a pandas DataFrame For this exercise you can load the dataset from '../DATA/moviereviews.csv'{:python}.

# CODE HERE

import numpy as np
import pandas as pd

df = pd.read_csv('../DATA/moviereviews.csv')

df.head()

HTML

label
review
0
neg
how do films like mouse hunt get into theatres...

TASK: Check to see if there are any missing values in the dataframe.

#CODE HERE

# Check for NaN values:
df.isnull().sum()

RESULT

label      0
review    35
dtype: int64

TASK: Remove any reviews that are NaN

df = df.dropna()

TASK: Check to see if any reviews are blank strings and not just NaN. Note: This means a review text could just be: "" or " " or some other larger blank string. How would you check for this? Note: There are many ways! Once you've discovered the reviews that are blank strings, go ahead and remove them as well. Click me for a big hint (opens in a new tab)

df['review'].str.isspace().sum()

RESULT

df[df['review'].str.isspace()]

HTML

label
review
57
neg
71

df = df[~df['review'].str.isspace()]

df.info()

STDOUT

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1938 entries, 0 to 1999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  -----

TASK: Confirm the value counts per label:

#CODE HERE

df['label'].value_counts()

RESULT

pos    969
neg    969
Name: label, dtype: int64

EDA on Bag of Words

Bonus Task: Can you figure out how to use a CountVectorizer model to get the top 20 words (that are not english stop words) per label type? Note, this is a bonus task as we did not show this in the lectures. But a quick cursory Google search should put you on the right path. Click me for a big hint (opens in a new tab)

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')

matrix = cv.fit_transform(df[df['label']=='neg']['review'])
freqs = zip(cv.get_feature_names(), matrix.sum(axis=0).tolist()[0])    
# sort from largest to smallest
print("Top 20 words used for Negative reviews.")
print(sorted(freqs, key=lambda x: -x[1])[:20])

STDOUT

Top 20 words used for Negative reviews.
[('film', 4063), ('movie', 3131), ('like', 1808), ('just', 1480), ('time', 1127), ('good', 1117), ('bad', 997), ('character', 926), ('story', 908), ('plot', 888), ('characters', 838), ('make', 813), ('really', 743), ('way', 734), ('little', 696), ('don', 683), ('does', 666), ('doesn', 648), ('action', 635), ('scene', 634)]

matrix = cv.fit_transform(df[df['label']=='pos']['review'])
freqs = zip(cv.get_feature_names(), matrix.sum(axis=0).tolist()[0])    
# sort from largest to smallest
print("Top 20 words used for Positive reviews.")
print(sorted(freqs, key=lambda x: -x[1])[:20])

STDOUT

Top 20 words used for Positive reviews.
[('film', 5002), ('movie', 2389), ('like', 1721), ('just', 1273), ('story', 1199), ('good', 1193), ('time', 1175), ('character', 1037), ('life', 1032), ('characters', 957), ('way', 864), ('films', 851), ('does', 828), ('best', 788), ('people', 769), ('make', 764), ('little', 751), ('really', 731), ('man', 728), ('new', 702)]

Training and Data

TASK: Split the data into features and a label (X and y) and then preform a train/test split. You may use whatever settings you like. To compare your results to the solution notebook, use test_size=0.20, random_state=101{:python}

#CODE HERE

from sklearn.model_selection import train_test_split
 
X = df['review']
y = df['label']
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)

Training a Mode

TASK: Create a PipeLine that will both create a TF-IDF Vector out of the raw text data and fit a supervised learning model of your choice. Then fit that pipeline on the training data.

#CODE HERE

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB

pipe = Pipeline([('tfidf', TfidfVectorizer()),('svc', LinearSVC()),])

# Feed the training data through the pipeline
pipe.fit(X_train, y_train)

RESULT

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('svc', LinearSVC())])

TASK: Create a classification report and plot a confusion matrix based on the results of your PipeLine.

#CODE HERE

from sklearn.metrics import classification_report,plot_confusion_matrix

preds = pipe.predict(X_test)

print(classification_report(y_test,preds))

STDOUT

              precision    recall  f1-score   support

         neg       0.81      0.86      0.83       191
         pos       0.85      0.81      0.83       197

plot_confusion_matrix(pipe,X_test,y_test)

RESULT

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1f370d0b790>

PLOT

Great job!

02 Text Classification Assessment Naive Bayes and NLP