CSUF LogoCSUF Site Navigation
optics.csufresno.edu

FC6 Security & Spam spambayes

Department of Electrical and Computer Engineering
Assistant Professor Gregory R. Kriehn
Forums
Wiki
FC6 spambayes
spambayes is a statistical Bayesian anti-spam filter, initially based on the work of Paul Graham and his technical paper "A Plan for Spam". It is currently my preferred anti-spam filter, as it allows for "ham", "spam", and "unsure" classifications. I have never had a problem with false positives using spambayes, although once in a blue moon an e-mail will fall into the unsure classification when it is supposed to be ham (the rest of the messages that end up with the unsure classification are actually spam). For additional information with regard to spambayes, see the background information provided on the spambayes webpage.

At this point, I assume that if you are reading this that you already have your e-mail properly setup. If not, I strongly suggest reading the sendmail and fetchmail pages. As a review, I have chosen to setup my e-mail by using fetchmail to poll the University's mail server once a minute, where it is then forwarded via formail one e-mail at a time through procmail, which launches python to run the spambayes script sb_filter.py to classify the e-mail as either ham, spam, or unsure. fetchmail was discussed in detail in the fetchmail page (and to a lesser degree, formail and procmail as well), so spambayes will be primarily discussed here, although we will have to do some further editing to the ~/.procmailrc file.

Install spambayes
The first step in the installation process is to download the tar package, which can be found at:

http://spambayes.sourceforge.net/download.html
 
Scroll down to the spambayes-1.0.4.tar.gz file, and click on it for download. Because there is no rpm available, we will have to install the application by hand. The standard place to put user-installed packages is in /usr/local/, and I like to place source code in a directory called /usr/local/src/[application]. Let's do so now:

~> sudo mkdir /usr/local/src/spambayes
Copy the file over to the /usr/local/src/spambayes directory and change into the directory:
~> sudo cp ~/Desktop/spambayes-1.0.4.tar.gz /usr/local/src/spambayes/.
~> cd /usr/local/src/spambayes
The next step is to unzip and untar the package and delete the source file:
~> sudo tar vfzx spambayes-1.0.4.tar.gz
~> sudo rm spambayes-1.0.4.tar.gz
You should see the package unzip and untar itself, creating a subdirectory called spambayes-1.0.4. Unfortunately, spambayes typically has its ownership and permissions set incorrectly (which creates a number of security holes), so the first thing I like to do is fix the ownership:
~> sudo chown -R root.root spambayes-1.0.4
This command will recursively change the default ownership of the files to root. The next thing to do, which is a bit more tedious, is to correct the permissions. All files, except those that are executable, should have permissions that only allow the user to have read/write access, and with group and other members having only read access. Executable access should be granted only to executable files and to subdirectories. This can be done by recursively changing everything to read/write, read, read access, and then correcting the subdirectories and executable files:
~> sudo chmod -R 000 spambayes-1.0.4
~> sudo chmod -R ugo+r spambayes-1.0.4
~> sudo chmod -R u+w spambayes-1.0.4
Since spambayes-1.0.4 is a subdirectory, we need to make it executable so that we can use the cd command to change into the directory:
~> sudo chmod ugo+x spambayes-1.0.4
~> cd spambayes-1.0.4
Next, use ls to look at the contents of the spambayes-1.0.4 subdirectory and repeat the process of changing all of the listed subdirectories (and subsequent subdirectories) to have executable access using the "sudo chmod ugo+x [filename]" command. Any files that have a .sh extension (I believe there are 4 of them scattered throughout the subdirectories of spambayes-1.0.4) should be granted executable access as well, as they are executable shell scripts.

Once this process is complete, use cd to change into the spambayes-1.0.4 subdirectory. Using the ls command, you should notice a python script called setup.py.
 spambayes can now be installed with the following command:
~> sudo python setup.py install
Setup an e-mail training schedule
With spambayes installed, it is time to setup a cron job so that spambayes can train on ham and spam every night (well, I have it set to train every morning at 4:45 am). This is done using the sb_mboxtrain.py script. Edit /etc/crontab and add the following at the end of the file:

45 4 * * * root python /usr/local/src/spambayes/spambayes-1.0.4/scripts/sb_mboxtrain.py -d /home/[user]/.hammiedb -g /home/[user]/[path]/[to]/[local]/[mail]/[inbox] -s /home/[user]/[path]/[to]/[local]/[spam]/[box] >> /var/log/spambayes/spambayes.log
Please note that all of this information needs to span a single line in the /etc/crontab file. The -d option provides the location of the database that will be generated as you train your e-mail, which should be located in /home/[user]/.hammiedb. The -g option means that you are training e-mail in the subsequent inbox location as ham, and the -s option trains e-mail in the subsequent location as spam. The output of this process will be stored in a log file located in /var/log/spambayes/spambayes.log. As spambayes is first learning how to distinguish ham from spam, you will need to be very careful to dump/move e-mail to the appropriate location based upon the e-mail client you happen to be using. I tend to favor evolution, so this section of my /etc/crontab file looks like:
45 4 * * * root python /usr/local/src/spambayes/spambayes-1.0.4/scripts/sb_mboxtrain.py -d /home/kriehn/.hammiedb -g /home/kriehn/.evolution/mail/local/Inbox -s /home/kriehn/.evolution/mail/local/Inbox.sbd/Spam >> /var/log/spambayes/spambayes.log
Ham is assumed to be located in evolution's Inbox, while spam is located in a Spam folder. For a while, you will have to sort your messages by hand, until spambayes is able to distinguish your true ham from spam. You can train off old e-mail messages lying around as well to speed things up — see the spambayes Linux webpage for additional information.

Next, create the /var/log/spambayes directory:
~> sudo mkdir /var/log/spambayes
Let's also setup a log rotation file /etc/logrotate.d/spambayes that contains the following:
/var/log/spambayes/spambayes.log {
        notifempty
        weekly
        missingok
        rotate 4
}
Setup formail via ~/.forward
With spambayes fully installed, the next step is to verify that you have a .forward file in your home directory (/home/[user]/.forward) so that incoming e-mail is forwarded through procmail by formail. Make sure that you have the following in your .forward file:
"|exec /usr/bin/procmail -f-||exit 75 #[user]"
Setup procmail and ~/.procmailrc
When procmail is launched, it checks for a local .procmailrc configuration file located in your home directory. It is here that we will tell procmail to run the appropriate spambayes script sb_filter.py to filter incoming e-mail. Edit the /home/[user]/.procmailrc file that was created while setting up fetchmail, and add the following information:
# Next may be needed if you invoke programs from your procmailrc
# Details in Check Your $SHELL and $PATH in Troubleshooting below
SHELL=/bin/sh

# Directory for storing procmail configuration and log files
# You can name this environment variable anything you like
# (for example PROCMAILDIR) or, if you prefer, don't set it
# (but then don't refer to it!)
PMDIR=$HOME

# Put ## before LOGFILE if you want no logging (not recommended)
LOGFILE=$HOME/.maillog

# To insert a blank line between each message's log entry,
# uncomment next two lines (this is helpful for debugging)
LOG="
"

# Set to yes when debugging
VERBOSE=no

# Remove ## when debugging; set to no if you want minimal logging
## LOGABSTRACT=all

# Replace $HOME/mail with your mailbox directory
# Mutt and elm use $HOME/Mail
# Pine uses $HOME/mail
# Netscape Messenger uses $HOME/nsmail
# Some NNTP clients, such as slrn & nn, use $HOME/News
# Mailboxes in maildir format are often put in $HOME/Maildir
# NOTE: Upon reading the next line, Procmail does a chdir to $MAILDIR
#       and relative paths are relative to $MAILDIR

MAILDIR=$HOME/.evolution/mail/local/Inbox.sbd/ #Make sure this directory exists!
DEFAULT=/var/spool/mail/[user]

:0 fw:hamlock
| python /usr/local/src/spambayes/spambayes-1.0.4/scripts/sb_filter.py -d /home/[user]/.hammiedb

# Was it spam?
:0
* ^X-Spambayes-Classification: spam
${MAILDIR}/Spam

# Unsure?
:0
* ^X-Spambayes-Classification: unsure
${MAILDIR}/Unsure

# Put everything else in the INBOX
:0:
$DEFAULT
Note that "| python /usr/local/src/spambayes/spambayes-1.0.4/scripts/sb_filter.py -d /home/[user]/.hammiedb" should span a single line.  

Two mail directories are defined here
— MAILDIR, which contains the location of my Inbox of my e-mail client (Evolution), and DEFAULT, which contains the location of the mail spool (/var/spool/mail/[user]). As e-mail is fetched from the university server, procmail forwards the mail through the spambayes script sb_filter.py using python, which adds a metatag to the e-mail header indicating whether or not it thinks it is "ham", "spam", or is "unsure". For example, if it looks at the e-mail and thinks it is ham, the following metatag will be used:
X-Spambayes-Classification: ham;
With a metatag placed in every e-mail, procmail then checks to see if the tag contains the "spam" or "unsure" keywords. If so, procmail will dump the e-mail into the $MAILDIR/Spam, or the $MAILDIR/Unsure folders within my e-mail client. All other mail is placed back into /var/spool/mail/[user]. This way, I have pre-sorted my incoming mail, so that when I take a look at my mail spool directory, all of the spam has already been removed. I find this to be especially useful when I have to log onto my server remotely to check my mail using pine — the spam has already been removed and I no longer have to sort through it manually.

E-Mail Client Message Filters
As an alternative, after the metatags have been placed in every e-mail, you could just dump everything back to /var/spool/mail/[user] and setup your e-mail client to sort incoming e-mail via message filters. When your e-mail client picks up mail from the mail spool, the filters would then move e-mail into the local Inbox folder, a Spam folder, or an Unsure folder. The disadvantage of this, however, is that you must run your e-mail client before your e-mail can be sorted.

If you want to do this using evolution, click on Edit -> Message Filters to setup a spam and unsure filter. Then click on Add, and type in "Spam" for the Search name. Next choose Specific header from the drop down box, and type in "X-Spambayes-Classification" in the box adjacent to it. Choose contains from the next drop down box and type in "spam" in the box adjacent to it. Choose Move to Folder from the third drop down box, and choose your Spam directory after that. Then hit OK. You have just set up a rule that will check retrieved messages for the metatag X-Spambayes-Classification spam, and if it finds it, it will move the e-mail into your Spam folder. Set up a similar rule for the Unsure folder. The metatag you should be looking for is X-Spambayes-Classification unsure. Anything does not meet these two rules, the e-mail is ham, and will be dumped to your local Inbox.

The nice thing about spambayes is that with the nightly cron job setup, if an e-mail is misclassified, you can move the e-mail to the appropriate folder, and at 5:00 am, spambayes will re-train off the Inbox and Spam folders and correct itself from any prior mistakes. You can check to see if an e-mail has been trained by spambayes by looking for the X-Spambayes-Trained metatag. After a few days, you will find that spambayes makes very few mistakes in its initial classification of the e-mail. The biggest thing that you have to monitor is the Unsure folder, as it is important to move the e-mails in this folder either to your Inbox or Spam folder, depending on whether or not the e-mail is ham or spam. This way, spambayes can continue to refine its adaptive algorithm to train on the different types of e-mail you are receiving.

No more presidents of Zimbabwe offering $10,000,000.00 if you provide them with your bank account information. No more male augmentation pills, rol3x watch3z, and all that other junk overwhelming your Inbox. Life is good.