02-Text-Classification-Assessment

Driptanil DattaSoftware Developer

Text Classification Assessment

Goal: Given a set of text movie reviews that have been labeled negative or positive

For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/ (opens in a new tab)

Complete the tasks in bold below!

Task: Perform imports and load the dataset into a pandas DataFrame For this exercise you can load the dataset from '../DATA/moviereviews.csv'{:python}.

# CODE HERE

import numpy as np
import pandas as pd

df = pd.read_csv('../DATA/moviereviews.csv')

df.head()

HTML

label
review
0
neg
how do films like mouse hunt get into theatres...

TASK: Check to see if there are any missing values in the dataframe.

#CODE HERE

TASK: Remove any reviews that are NaN

TASK: Check to see if any reviews are blank strings and not just NaN. Note: This means a review text could just be: "" or " " or some other larger blank string. How would you check for this? Note: There are many ways! Once you've discovered the reviews that are blank strings, go ahead and remove them as well. Click me for a big hint (opens in a new tab)

TASK: Confirm the value counts per label:

#CODE HERE

EDA on Bag of Words

Bonus Task: Can you figure out how to use a CountVectorizer model to get the top 20 words (that are not english stop words) per label type? Note, this is a bonus task as we did not show this in the lectures. But a quick cursory Google search should put you on the right path. Click me for a big hint (opens in a new tab)

#CODE HERE

Training and Data

TASK: Split the data into features and a label (X and y) and then preform a train/test split. You may use whatever settings you like. To compare your results to the solution notebook, use test_size=0.20, random_state=101{:python}

#CODE HERE

Training a Mode

TASK: Create a PipeLine that will both create a TF-IDF Vector out of the raw text data and fit a supervised learning model of your choice. Then fit that pipeline on the training data.

#CODE HERE

TASK: Create a classification report and plot a confusion matrix based on the results of your PipeLine.

#CODE HERE

Great job!

01 Text Classification 03 Text Classification Assessment Solution