Project 2
spam filtering
Overview
You’ll be writing a reasonable quality spam filter that takes a raw email and outputs whether it’s likely spam or non-spam (sometimes called ham). I’ll list out the details of the main baseline algorithm used in spam filtering for you to implement, with examples. Your program will “train” on several examples of spam and non-spam and then “test” it on a bunch of emails that are labeled as spam or non-spam, computing the accuracy of the spam filter (the percentage of times your spam filter guessed correctly).
Early part
The full assignment isn’t ready for distribution, but if you’d like, you can do a part of it and the rest of the assignment will be ready by Tuesday the 29th.
The first step of the project is to load emails from files (one file per email) in a certain format. You should write a class to represent an email and that class should have a constructor that takes a filename, loads the file, and populates the object’s data.
The Email class should at least contain the following fields:
1.•From - should contain the email address of the sender. If you want to store the name of the person, put it in a separate field.
2.•Subject - should contain the subject line as a C or C++ string.
3.•Body - should contain all of the text of the email either as a C or C++ string.
Unfortunately, email formats can vary somewhat. Different email clients handle word wrap differently, as you’ll see when you look through the examples. The email headers that appear vary from email to email, etc. You’ll get the idea when you look through the examples I give and also when I release the full training set and you’re looking through it.
Frequency lists
If you get done with that, you can work on building a frequency list for the subject and separately for the body of each email. A frequency list is a mapping of words to the number of times the word occurred.
You can use the map class in STL to represent this data structure.
Examples of spam and non-spam raw format
Here are
4 emails in a zip file so you can see what email formats (roughly) look like and work on the first part. Note that some of those headers are specific to the email servers we use here, but at the least you can count on From, Subject, and the body. To unzip them under UNIX, there's the unzip command.
Parts of the project
1.Loading the emails
Even just loading the emails will be somewhat tricky. First off, make sure you download the data.
1.1.First off, you need to know which files to load! You’ll have a file called “training” which looks like:

(so make sure your program is setup like that, like have a project2 directory, the data directory for the data, and a code directory where you put the code and these index files)
1.2.So you need to load each of these files and loop with >> to load the filenames (don’t worry, they won’t have any spaces in the filenames)
1.3.For each file, you have to use file streams like ifstream (though it’s not so hard).
1.4.Then you’ll need some kind of loop with a getline function call (remember the getline for char[] and string are different!).
1.5.Then you have to make sure you can tell the difference between header lines and body lines (header lines are at the top and most start with a “word” followed by a colon).
1.6.For header lines, you’re going to have to see if it’s a header line you actually care about (From and Subject). If it is, you’re going to have to extract the part of the line after “From: “ or “Subject: “ and either store it in an Email class or process it.
1.7.For body lines, you’re going to have to either append to a char[] or string to save the body, or you can just process it right away.
2.Processing each field (extracting stuff, tokenizing, etc).
2.1.To process the From field, if it contains “<” then extract the part between “<” and “>”. Otherwise, just extract the whole thing.
2.2.To process the subject, you’re going to have to tokenize the part after “Subject:”. Tokenizing means that you split one string into multiple strings, usually split at all the whitespaces. There are two ways to do this, depending on whether you use a char[] or string class:
2.2.1.For char[], there’s the strtok function
2.2.2.For string, you can make an istringstream connected to the string, and the normal >> operator does tokenization anyway.
2.3.To process the body, you have to do the same tokenization process, but it’s a little but easier. Once you know you’re in the body, you can just use the standard >> operator on the file input stream and that’ll tokenize if for you.
3.Processing each email (computing some numbers)
3.1.You’re going to keep track of all of the words used in spam emails and non-spam emails. Same for sender addresses.
3.2.So for each spam email (either during or after loading it), you’re going to loop through all of the words in the subject and body and add to the number of times that word occurred in spam. Similar for non-spam. You can use a map data structure from STL to represent this. So you’ll have some data structure that represents information like this:
Spam:
enlarge: 50, stock: 40, watches: 40
Non-spam:
assignment: 10, drive: 5, house: 6, dog: 1, computer: 2, car: 2
3.3.You should have a separate, but similar, data structure for the senders of the emails. It would look something like this:
Spam:
Asdasdasd@aol.com: 1, kdjddj@blah.ru: 1, …
Non-spam:
no-reply@verizon.net: 4, jimbo@enron.com: 5, ...
4.Processing spam and non-spam training data (computing more numbers)
4.1.So now you have the frequency of each word in spam subjects/bodies and the same for non-spam. The next step is to convert those frequencies to probabilities. All you do is loop through each map and add up the values, then divide all the values by the total.
The reason you convert to probabilities is cause you might have 500 times more spam than non-spam, so innocuous words like “assignment” might actually be more frequent in spam, just cause there’s more of it, but those words make up less percentage of spam words.
4.2.Do the same thing for the From data structure and convert from frequencies to probabilities.
5.Applying the method to classify a single email
5.1.Suppose you load a single email and you’re testing to see if it’s spam or not. This is how you should do it for this project.
5.2.First off, you should maintain a score for spam and for nonspam.
5.3.When you see the From field, add the probability of it from the spam data structure to the spam score and add the probability of it from the nonspam data structure to the nonspam score.
5.4.When you see the Subject field, add the probability of each word in it similarly - for the spam score, add the probability of each word in the spam data structure. For the nonspam score, add the probability of each word in the nonspam data structure.
5.5.Same deal for the body of the email.
5.6.At the end of all that processing, you have a score for spam and a score for nonspam. You classify it as the one with the higher score.
6.Large-scale testing
6.1.You’ll load a testing file (same format as the training file) and classify each email as spam or nonspam. Keep track of how many times it was correct or incorrect for spam and for nonspam. The zip file with all of the testing emails and all of the index files is now available.
6.2.Output two different evaluations
6.2.1.Accuracy: 100% * (correct_spam + correct_nonspam) / (correct_spam + correct_nonspam + incorrect_spam + incorrect_nonspam)
6.2.2.Also output for the spam category the numbers and percentages correct and incorrect. Do the same also for the nonspam category.
Notes
•Some emails actually have multiple body parts. In these kinds of emails, you might have a plain text version of the email and an HTML version of the email and the email client chooses which to show. You don’t have to worry about this for your project; once the header ends, you can just pretend that everything else is body text. Sure it won’t be as accurate, but it should work 90% as well and take a lot less effort.
•You can assume that the header section ends when you see a blank line (it might have some space on the line, you’ll have to check and debug).
•Here are some functions that should help you:
•For chopping up each line with string, use a combination of string.find and string.substr
•For chopping up each line with char[], use a combination of strstr (for finding a string inside another) and strcpy (to copy only a portion of the string into a new string)
•For tokenizing, use istringstream for string and strtok for char[]
•For storing the frequencies, use map<string, int> (example of this on p861)
•I won’t give you the testing data at first (you can just use the training data files for testing too at first, it should classify very well)
•If you want to lowercase all of the words, that’s fine. If you do, just as a learning exercise, I suggest comparing the accuracy with lowercasing vs. without.