download notebook
view notebook w/ solutions

Natural language processing: The bag-of-words algorithm

files needed = ('spam.csv', 'newsgroups.zip)

In this lecture we're going to shift gears from dealing with numerical data to text data.

Working with text as data is known as Natural Language Processing (NLP). A common use of NLP is categorizing a set of text. Perhaps the most ubiquitous example is a spam filter. It reads the text of a message and determines if it is "spam" or "ham."

We'll employ one simple NLP algorithm known as the bag-of-words algorithm to classify SMS messages as spam. There are more sophisticated methods but this will give us the big idea. More-sophisticated methods typically amount to tweaks on how we process the data or more complex classifier (discrete) models.

Spam Messages

Spam emails/messages belong to the broad category of unsolicited messages received by a user. Spam occupies unwanted space and bandwidth, amplifies the threat of viruses, and in general exploits a user's connection to social networks. Plus, they're annoying.

Our goal is to classify a message as spam (unwanted message) or ham (wanted message).

Languages are harder for algorithms to interpret and analyze than numeric data since:

Sentences are not of fixed lengths, but most algorithms require a standard input vector size.
Most algorithms cannot understand words as input: hence, each word needs to be represented by some numeric value.

So our method is:

Preprocessing: Clean up the text. This is the new stuff.
Estimate a model on the training data: Let $\text{word}_{ji}$ be the number of times the word $j$ occurs in message $i$
$$\text{Pr}(\text{message}_{i}=1|\text{words}) = \text{logit}(\beta_{0} + \beta_{1}\text{word}_{1i}+ \beta_{2}\text{word}_{2i}+\cdots)$$
Use the estimated model to filter incoming messages

I'm using the machine-learning term "classifier" in this notebook, but everywhere you see the word "classifier," you can replace it with "Logit" and nothing changes. Similarly, you can replace the word "features" with "exogenous variables."

import numpy as np
import pandas as pd

A simple example

Three messages.

# Corpus is a fancy word for a collection (or a body) of text.
# Label marks a message as spam (1) or not spam (0).

corpus = [('Text 1', 'You have won a prize. Call today to claim.', 1),
          ('Text 2', 'It is your mother. Call me.', 0),
          ('Text 3', 'Are you around today? I need a favor.', 1)]
data = pd.DataFrame(corpus, columns=['Document Number','Text of Documents', 'Label'])
data.head()

Even though python is a good with text, we will still need to convert our text into some numeric data to get a classifier model to analyze it. Let's create a matrix with the word counts. Each row of the matrix is an observation (a message) and each column is a word. The cells in the matrix are the number of times that word is found in the message.

The scikit package gives us the CountVectorizer to do this for us.

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data['Text of Documents'])

print(X.toarray())

Note that X is an array-like object. We are in the realm of scikit, which doesn't use DataFrames. Let's turn this back into a DataFrame, though, so we can see things clearly.

cols = vectorizer.get_feature_names_out()

exog = pd.DataFrame(X.toarray(), columns=cols, index=data['Document Number'])

exog.head()

The exog DataFrame contains our features and the data['Label'] column contains our outcome variable. We now have the data ready to estimate a classifier model (e.g., a logit regression).

$$\text{Pr(Label=1|exog)} = \Lambda( \beta_0 + \beta_1\text{are} + \beta_1\text{are} + \beta_1\text{around} + \beta_1\text{call} + \cdots)$$

In statsmodels, this would be:

sm.Logit(data['label'], exog).fit()

This dataset is too small to actually fit a model, so let's move on to something bigger.

Note that this methodology of turning text into data is not limited to classification problems. For example, we could use this approach to connect stock performance with FOMC statements to predict how the Federal Reserve's statements on the economy influence the S&P500, the Dow Jones, individual stocks, and government treasury prices. NLP is a broad topic and a lot of fun.

The new package

Text data have a whole set of problems to deal with: misspelling, different versions of words, capitalization. We will use the natural language tool kit (nltk) to help us process the text data. It comes with anaconda, but if you need to install it:

pip install --user nltk

import nltk

Detecting spam messages

The dataset that we are using is an SMS spam collection dataset. It contains over 5,500 messages in English. There are two columns. The first column corresponds to the actual text message. The second column tells us whether the text is 'ham' or 'spam'.

dataset = pd.read_csv('spam.csv')
dataset.rename(columns = {'v1': 'labels', 'v2': 'message'}, inplace = True)
dataset['label'] = dataset['labels'].map({'ham': 0, 'spam': 1})
dataset

Data Preprocessing

This is the part that makes nlp different from working with numeric data. We need to clean up the text and turn it into a feature matrix.

'I am helping raise $100 for UW Madison'                     original
'i am helping raise $100 for uw madison'                     homogenize capitalization
'i am helping raise for uw madison'                          remove non-alphabetic characters
['i', 'am', 'helping', 'raise',  'for', 'uw', 'madison']     tokenize
['helping', 'raise', 'uw', 'madison']                        remove stop words
['help', 'raise', 'uw', 'madison']                           stem and lem
'help raise uw madison'                                      back to a single string

Then create the feature matrix

help	raise	uw	madison
1	1	1	1

1. Homogenize the capitalization

We don't want to worry about 'Hello' not being equal to 'hello'. Let's make everything lowercase.

# 1. Homogenize capitalization
dataset['message'] = dataset['message'].str.lower()
dataset.tail()

2. Remove non-alphabetic characters

Our algorithm will only use words to characterize messages. This is not necessary, but it simplifies our approach today. Perhaps messages with numbers in them are more likely to be spam?

We will remove them using a regular expression. We have not covered 'regex' (there is never enough time!) but regex is a powerful string search language that is a part of python and most other programming languages. I have a notebook on regex here which you can work though if you are interested.

The code to remove the non-alphabetic characters is

dataset['message'].str.replace('[^A-Za-z]', ' ', regex=True)

The regex part is the '[^A-Za-z]'. It says: "find everything that is not the letters A through Z or a through z." We replace the non-alphabetic stuff with a space.

# 2. Remove non-alphabetic characters
dataset['message'] = dataset['message'].str.replace('[^A-Za-z]', ' ', regex=True)
dataset.tail()

3. Tokenize the strings

Break the stings up into lists of words, which are easier to process. This is very similar to using .str.split(' '). Here we use the tokenizer method from nltk. It is a bit more sophisticated than a simple split.

from nltk.tokenize import word_tokenize as wt

We also need to download the punctuation dataset.

# 3. Tokenize the strings.

# Get the punctuation. 
nltk.download('punkt')

from nltk.tokenize import word_tokenize as wt 
dataset['message'] = dataset['message'].apply(wt)
dataset.tail()

4. Removing stop words

Now we eliminate stop words—words in the text that add no specific meaning. They often involve prepositions, helping verbs, and articles (e.g., in, the, an, is). Since these add no value to our model, let's get rid of them.

Fortunately, linguists have already identified stopwords so we can readily identify and exclude them

from nltk.corpus import stopwords
stop_wrds = stopwords.words('english')

stop_wrds is a list of English-language stop words.

We need to loop through the lists and check for stop words. I will write a small function that does the looping and then apply it to the DataFrame's column using .apply().

Again, we need to download the stopwords, first.

# 4. Remove stop words.
nltk.download('stopwords')
from nltk.corpus import stopwords

# I think there is a better way to do this using sets...
def remove_stops(x):
    stop_wrds = stopwords.words('english')
    temp = []
    for word in x:
        if word not in stop_wrds:
            temp.append(word)
    return temp

dataset['message'] = dataset['message'].apply(remove_stops)
dataset.tail()

5. Stemming and Lemmatization

Words like act, actor, and acting are all versions of the same root word (act). Stemming and lemmatization are techniques used to truncate words in order to get the stem or the base word. The difference between these two methods is that after stemming, the stem may not be an actual word, whereas lemmatization always produces a real world, which results in better interpretation of the corpora by humans.

For example, studies could be stemmed as studi (not a word), but will be lemmatized as study (an existing word). To be honest, this feels like a rabbit hole so I'm treating this stuff as a block box and trusting that the linguists are doing a good job.

Let's stem these words.

# 5. Stemming and lemmatization
from nltk.stem.porter import PorterStemmer

def stem_it(x):
    stemmer = PorterStemmer()
    return [stemmer.stem(w) for w in x]

dataset['message'] = dataset['message'].apply(stem_it) 
dataset.tail()

That seemed like a lot of work, but it always does when we are first learning something. Putting all the code together, the processing is simply:

dataset['message'] = dataset['message'].str.lower()
dataset['message'] = dataset['message'].str.replace('[^A-Za-z]', ' ', regex=True)
dataset['message'] = dataset['message'].apply(wt)
dataset['message'] = dataset['message'].apply(remove_stops)
dataset['message'] = dataset['message'].apply(stem_it)

You could even wrap all that up in a function, too...

Below we use the CountVectorizer() method. It can do some of this preprocessing, too.

Create the feature matrix

We are done preprocessing.

Turn the lists of words back into strings.
Create the feature matrix using CountVectorizer.

dataset['message'] = dataset['message'].str.join(' ')

# The matrix of word counts. I am limiting the feature matrix to 100 columns.

# Create the vectorizer
cv = CountVectorizer(max_features=100) 

# Fit the vectorizer to the dataset. X is the matrix of word counts.
X = cv.fit_transform(dataset['message'])
exog = pd.DataFrame(X.toarray())

# The outcome data.
endog = dataset['label']

#cv.vocabulary_

Estimate the logit model

I'm using the scikit learn package to estimate the logit model. This is estimating the same logit model as we did with statsmodels, but scikit learn has a faster implementation.

The syntax is a bit different, but its the same ideas

from sklearn.linear_model import LogisticRegression

res_spam = LogisticRegression(random_state=0, max_iter=1000).fit(exog, endog)

scikit does not have a .summary() function like statsmodels. In this kind of application, we are not concerned with the individual coefficients. We don't really care why a word is good at classifying spam, we just care that the model works. This is identification vs. prediction.

You can still recover all the parameters, etc, but not in a neat table.

How is our in-sample fit? Compare the predicted values to the actual values. The .score() method computes the share of the messages properly categorized.

res_spam.score(exog, endog)*100

The confusion matrix is (from the docs)

pred_table[i,j] refers to the number of times “i” was observed and the model predicted “j”. 
Correct predictions are along the diagonal.

from sklearn.metrics import confusion_matrix
#confusion_matrix(endog, res_spam.predict(exog))
pd.DataFrame(confusion_matrix(endog, res_spam.predict(exog)))

Not too shabby!

Filtering our messages

Let's test our spam filter on two messages, which I have already preprocessed.

'hello want bye'
'free give www'

I don't know if these are spam or not. I will use the estimated model to make the prediction.

# Use the same CountVectorizer. 
# It already has the vocabulary determined from our dataset.
# xx is our matrix of word counts
xx = cv.transform(['hello want bye', 'free  give www'])

# res_spam is our estimated model. 
# Use .predict() to make the predictions.
res_spam.predict(xx)

Our model thinks the first message is not spam, but the second is. I think 'free' and 'www' are driving the result...

Practice

We're going to practice using the "20 Newsgroup" data set which is

a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his "Newsweeder: Learning to filter netnews" paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

The goal is to categorize each post based on its content.

Download and load the file 'newsgroups.csv'. Only import the first 500 rows, Try the nrows option of .read_csv().

article is the message. category code is the newsgroup category code. category is the newsgroup category name.

Our goal: Create a classifier that predicts the category code of an article.

How many articles are there in each category. Looks like it's time for .groupby().
Process the text data. All the code to do this is gathered in the cell above the Create feature matrix heading above.
Turn the lists of words in article into strings.
Create the feature matrix. I used 100 features again.
Create the outcome variable. (the Series that contains the category codes)
Estimate the logit model
Use .score() to check the in-sample fit.
Compute the confusion matrix
Go back to step 1. and increase the number of rows to 1000. Rerun your code. Does the accuracy improve?

Go back to step 5. and add more features to your exogenous variables. Does the accuracy improve?

Do you see any patterns in the confusion matrix?

Want more?

This notebook was meant to give you an idea of how to turn words into data. We used a very simple "bag-of-words" model where we only considered how often a word appears. There are more complex methods (term frequency-inverse document frequency, ngrams,...) and alternative modeling (machine learning methods).

Besides classifying text, nlp is used to deduce the sentiment of a block text. Is the earnings call positive or negative? Is the Fed statement hawkish or dovish? Is the review positive or negative? nlp models form the basis for automatic translation software and voice recognition. Lots of cool stuff.

If you want to learn more, Dan Jurafsky and James H. Martin have a free book online that I think is quite good. As always, you can google "nlp tutorial" and see what is out there.