All Articles

Building and deploying a spam classifier

Building and deploying spam classifier (Part 1): Building a kickass spam classifier

What’s this all about?

So we’re in the age of AI and machine learning. cool! There’s tonnes and tonnes of articles, books and blog posts about advances in machine learning, new approaches and breakthroughs in deep learning, game playing records being broken by new reinforcement learning algorithms .. yada yada yada. But what does it all mean to the everyday app developer? There seems to be a huge gap between creating all these amazing new AI algorithms and using them. As an AI enthusiast, I love to learn about AI and daydream about what kinds of problems we can solve with AI. There’s a lot of work going on in the AI space right now and its both fascinating and exciting. But this post is not about building a kick-ass AI model (uhm maybe it is). Its about using one. This post is an attempt to show the set of steps necessary to build an application that harnesses the amazing power of machine learning.

What we are building

We’ll be building and deploying a spam classifier. A spam classifier is an AI model that can categorize an input message into spam or ham (not-spam). After building this classifier, we’ll deploy it as a web API and then we’ll build a React web application to use it.

spam bot - spam message identifier

Part 1 - The classifier

To build our shiny new classifier, we’ll use the amazing scikitlearn library. We’ll also use some other cool python libraries like Pandas,Numpy and Matplotlib. To build and test our classifier in an interactive environment, we’ll be working in a Jupyter notebook. The best way to get all these amazing python libraries in a nice, clean, manageable environment is by installing the Anaconda distribution.

Alright chum, time to start writing some code.

First, we need some data. Lets grab a pre-classified SMS message dataset from the Kaggle machine learning data repository. You can download the dataset as spam.csv

So we have our data. Cool. Next, lets see what it looks like. First, we’ll import it into a Pandas dataframe.

import pandas as pd
msg_df = pd.read_csv('spam.csv')

messages dataframe first row

Looks like the dataframe contains many extra columns that we don’t need so we’ll copy the columns we need into a new dataframe.

msg_df_2 = pd.Dataframe({label: msg_df['v1'].values, messages: msg_df['v2'].values})

photo 2

This looks much better. Now lets see a description of our dataset.


photo 3

From the results, we can see that we have two labels - spam and ham - and we have about 5571 instances. sweet!

Lets create some global lists to store our data. we’ll need these later to build our model.

msgs = []
labels = []
ulabels = []

def import_data():
    global msgs, labels, ulabels
    msgs = messagesdf['message']
    labels = messagesdf['label']
    ulabels = sorted(list(set(labels)))

%time import_data()

photo 5

This tells us how long it took to read our data.

Now lets find out about how much spam and ham we have in this dataset. We’ll create a function that helps us find the ratio of spam to ham in our dataset.

from collections import Counter

def count_data(labels, categories):
    c = Counter(categories)
    cont = dict(c)
    tot = sum(list(cont.values()))
    d = {
        "category": labels,
        "msgs": [cont[l] for l in labels],
        "percent": [cont[l]/tot for l in labels]
    return cont

cont = count_data(ulabels, labels)


Let’s use the fantastic pylab library to plot a pie chart showing how much spam compared to ham we have in our dataset.

First, let’s enable chart plotting in our jupyter notebook

%matplotlib inline


import pylab as pl

def categories_pie_plot(cont, tit):
    global ulabels
    sizes = [cont[l] for l in ulabels]
    pl.pie(sizes, explode=(0,0), labels = ulabels, autopct='%1.1f%%', shadow=True, startangle=90)
categories_pie_plot(cont, "Plotting categories")

We get a nice little pie chart that looks like this:

photo 6

So our dataset has way more ham than spam. No surprise there really. Most datasets have more entries of a particular class than others.

Now we’ll split our data into a training set and a test set. We’ll train our model using the training set and we’ll use the test set to evaluate the performance of our model.

from sklearn.utils import shuffle

X_train = []
y_train = []
X_test = []
y_test = []

def split_data():
    global msgs, labels
    global X_train, X_test, y_train, y_test, ulabels
    N = len(msgs)
    Ntrain = int(N * 0.7)
    msgs, labels = shuffle(msgs, labels, random_state=0)
    X_train = msgs[:Ntrain]
    y_train = labels[:Ntrain]
    X_test = msgs[Ntrain:]
    y_test = labels[Ntrain:]

%timeit split_data()

Notice that we’re using the scikitlearn shuffle function to shuffle our data before slicing out the training and test sets. We’re doing this to make sure that the spam-ham ratio in the training and test sets does not change from what what is in the original dataset.

Let’s verify that these ratios are preserved [in the training and test datasets]

For the training dataset

train_ratio = count_data(ulabels, y_train)

photo 8

categories_pie_plot(train_ratio, "Categories % in training set")

photo 9

And the testing dataset

test_ratio = count_data(ulabels, y_test)


categories_pie_plot(test_ratio, "Categories % in test set")

photo 11

So we can see that the shuffle function works really well.

Training our model

To train our model, we’ll use scikitlearn’s awesome Pipeline feature. Pipelines are structures that allow us to connect the output of a scikitlearn estimator or transformer to the input of another. Sort of like a … pipeline _^.

Our model pipeline contains three stages:

1. The CountVectorizer

Converts the input string to a word count vector which is a great numerical representation of a sentence or phrase in machine learning.

2. The TfidfTransformer

This converts the input phrase to a value which is calculated using the expression [text frequency * inverse document frequency]

3. The Multinomial NaiveBayes classifier

The scikitlearn Naive Bayes classifier works really well for text classification problems.

The output of our awesome pipeline is an array containing the prediction for the input vector.

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn import metrics
import numpy as np
import pprint

nlabels = len(ulabels)

nrows = nlabels
ncols = nlabels

X_train = X_train.astype('U')
X_test = X_test.astype('U')

text_clf = None 

def train_test():
    global X_train, y_train, X_test, y_test
    global text_clf
    text_clf = Pipeline([
        ('vect', CountVectorizer()), 
        ('tfidf', TfidfTransformer()), 
        ('clf', MultinomialNB())
    text_clf =, y_train)
    predicted = text_clf.predict(X_test)
    return predicted

%timeit predicted = train_test()

So we trained our classifier. Lets find out how well it performs on our test data. First, we’ll compute the accuracy

metrics.accuracy_score(y_test, predicted)

photo 12

That score looks pretty high. Lets compute a confusion matrix to see what the score for each class looks like.

mat = metrics.confusion_matrix(y_test, predicted, labels=ulabels)
cm = mat.astype('float') / mat.sum(axis=1)[:, np.newaxis]

photo 15

Let’s use the amazing matplotlib library to plot a pretty confusion matrix

import itertools
import matplotlib.pyplot as plt

def plot_confusion_matrix(cm, classes, title= 'Confusion matrix',
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
    thresh = cm.max() / 2
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, '{:5.2f}'.format(cm[i, j]), horizontalalignment='center', color='white' if cm[i,j] > thresh else "black")
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
plot_confusion_matrix(cm, ulabels, title='Confusion matrix')

photo 14

Looks like like our model is classifying about 31% of the spam messages as ham but it seems to be correctly classifying all our ham messages. Not bad. There are several ways we can improve this model:

  • By using a more balanced dataset
  • By using several different classifiers such as Logistic regression and Random forest classifiers and picking the one that gives the best performance.
  • By using k-fold cross-validation to tune the model parameters to squeeze out some accuracy from our classifier.

But we’re not going to do any of that. This post is about creating and deploying a machine learning model so onwards we march!

Serializing our model

Scikitlearn provides a neat way for us to serialize our entire model pipeline:

from sklearn.externals import joblib

with open('text-classifier.pkl', 'wb') as model_file:
    joblib.dump(text_clf, model_file)

After running this, we should see a sweet little ‘text-classifier.pkl’ file in our working directory. Let’s take it for a spin to see if it works.

saved_model = None
with open('text-classifier.pkl','rb') as model_file:
    global saved_model
    saved_model = joblib.load(model_file)
r1 = saved_model.predict(['free wkly comp'])
r2 = saved_model.predict(['where is the jetpack'])


Output: ['spam','ham']

This brings us to the end of my first post. In the next post, I’ll show you how to setup a web API using our saved model and build a react app to use the API.

The code for this article is available on Github


To write this article, I found this repository by Andres Soto Villaverde very helpful. Check it out and give him a like!