spambayes is a statistical Bayesian anti-spam filter, initially based on the work of Paul Graham and his technical paper "A Plan for Spam". It is currently my preferred anti-spam filter, as it allows for "ham", "spam", and "unsure" classifications. I have never had a problem with false positives using spambayes, although once in a blue moon an e-mail will fall into the unsure classification when it is supposed to be ham (the rest of the messages that end up with the unsure classification are actually spam). For additional information with regard to spambayes, see the background information provided on the spambayes webpage.
At this point, I assume that if you are reading this that you already have your e-mail properly setup. If not, I strongly suggest reading the sendmail and fetchmail pages. As a review, I have chosen to setup my e-mail by using fetchmail to poll the University's mail server once a minute, where it is then forwarded via formail one e-mail at a time through procmail, which launches python to run the spambayes script sb_filter.py to classify the e-mail as either ham, spam, or unsure. fetchmail was discussed in detail in the fetchmail page (and to a lesser degree, formail and procmail as well), so spambayes will be primarily discussed here, although we will have to do some further editing to the ~/.procmailrc file.
Install spambayes
The first step in the installation process is to download the tar package, which can be found at:
Scroll down to the spambayes-1.0.4.tar.gz file, and click on it for download. Because there is no rpm available, we will have to install the application by hand. The standard place to put user-installed packages is in /usr/local/, and I like to place source code in a directory called /usr/local/src/[application]. Let's do so now:
~> sudo
mkdir /usr/local/src/spambayes
Copy the file over to the /usr/local/src/spambayes directory and change into the directory: ~> sudo cp ~/Desktop/spambayes-1.0.4.tar.gz /usr/local/src/spambayes/.
~> cd /usr/local/src/spambayes
The next step is to unzip and
untar
the package and delete the source file:~> cd /usr/local/src/spambayes
~> sudo tar vfzx spambayes-1.0.4.tar.gz
~> sudo rm spambayes-1.0.4.tar.gz
You should see the package
unzip
and untar
itself, creating a subdirectory called spambayes-1.0.4.
Unfortunately, spambayes
typically has its ownership and permissions set incorrectly (which
creates a number of security holes), so the first thing I like to do is
fix the ownership:~> sudo rm spambayes-1.0.4.tar.gz
~>
sudo chown -R root.root spambayes-1.0.4
This command will recursively
change the default ownership of the files to root. The
next thing to do, which is a bit more tedious, is to correct the
permissions. All files, except those that are executable,
should
have permissions that only allow the user to have
read/write access, and with group
and other
members having only read access. Executable access
should be
granted only to executable files and to subdirectories. This
can
be done by recursively changing everything to read/write, read, read
access, and then correcting the subdirectories and executable files: ~>
sudo chmod -R 000 spambayes-1.0.4
~> sudo chmod -R ugo+r spambayes-1.0.4
~> sudo chmod -R u+w spambayes-1.0.4
Since spambayes-1.0.4
is a subdirectory, we need to make it executable so that we can use the
cd
command to change into the directory:~> sudo chmod -R ugo+r spambayes-1.0.4
~> sudo chmod -R u+w spambayes-1.0.4
~>
sudo chmod ugo+x spambayes-1.0.4
~> cd spambayes-1.0.4
Next, use ls to look at
the contents of the spambayes-1.0.4
subdirectory and repeat the process of changing all of the listed
subdirectories (and subsequent subdirectories) to have executable
access using the "sudo
chmod ugo+x [filename]" command. Any files that
have a .sh
extension (I believe there are 4 of them scattered throughout the
subdirectories of spambayes-1.0.4)
should be granted executable access as well, as they are
executable shell scripts.~> cd spambayes-1.0.4
Once this process is complete, use cd to change into the spambayes-1.0.4 subdirectory. Using the ls command, you should notice a python script called setup.py. spambayes can now be installed with the following command:
~>
sudo python setup.py install
Setup an e-mail training scheduleWith spambayes installed, it is time to setup a cron job so that spambayes can train on ham and spam every night (well, I have it set to train every morning at 4:45 am). This is done using the sb_mboxtrain.py script. Edit /etc/crontab and add the following at the end of the file:
45
4
* * * root python
/usr/local/src/spambayes/spambayes-1.0.4/scripts/sb_mboxtrain.py -d
/home/[user]/.hammiedb -g
/home/[user]/[path]/[to]/[local]/[mail]/[inbox] -s
/home/[user]/[path]/[to]/[local]/[spam]/[box] >>
/var/log/spambayes/spambayes.log
Please note that
all of this information
needs to span a single line in the /etc/crontab
file. The -d
option provides the location of the database that will be generated as
you train your e-mail, which should be located in /home/[user]/.hammiedb.
The -g
option means that you are training e-mail in the subsequent inbox
location as ham, and the -s
option trains e-mail in the subsequent location as spam. The
output of this process will be stored in a log file located in /var/log/spambayes/spambayes.log.
As spambayes
is first learning how to distinguish ham from spam, you will need to be
very careful to dump/move e-mail to the appropriate location based upon
the e-mail client you happen to be using. I tend to favor evolution, so
this section of my /etc/crontab
file looks like: 45
4
* * * root python
/usr/local/src/spambayes/spambayes-1.0.4/scripts/sb_mboxtrain.py -d
/home/kriehn/.hammiedb -g /home/kriehn/.evolution/mail/local/Inbox -s
/home/kriehn/.evolution/mail/local/Inbox.sbd/Spam >>
/var/log/spambayes/spambayes.log
Ham is assumed to
be located in evolution's
Inbox,
while
spam is located in a Spam
folder. For a while, you will have to sort your messages by
hand, until spambayes
is able to distinguish your true ham from spam. You can train off old
e-mail messages lying around as well
to speed things up — see the spambayes
Linux webpage for additional information.Next, create the /var/log/spambayes directory:
~>
sudo mkdir /var/log/spambayes
Let's also setup a log
rotation file /etc/logrotate.d/spambayes
that contains the following: /var/log/spambayes/spambayes.log
{
notifempty
weekly
missingok
rotate 4
}
Setup formail via ~/.forwardnotifempty
weekly
missingok
rotate 4
}
With spambayes fully installed, the next step is to verify that you have a .forward file in your home directory (/home/[user]/.forward) so that incoming e-mail is forwarded through procmail by formail. Make sure that you have the following in your .forward file:
"|exec
/usr/bin/procmail -f-||exit 75 #[user]"
Setup procmail and ~/.procmailrcWhen procmail is launched, it checks for a local .procmailrc configuration file located in your home directory. It is here that we will tell procmail to run the appropriate spambayes script sb_filter.py to filter incoming e-mail. Edit the /home/[user]/.procmailrc file that was created while setting up fetchmail, and add the following information:
#
Next may be needed if you invoke programs from your procmailrc
# Details in Check Your $SHELL and $PATH in Troubleshooting below
SHELL=/bin/sh
# Directory for storing procmail configuration and log files
# You can name this environment variable anything you like
# (for example PROCMAILDIR) or, if you prefer, don't set it
# (but then don't refer to it!)
PMDIR=$HOME
# Put ## before LOGFILE if you want no logging (not recommended)
LOGFILE=$HOME/.maillog
# To insert a blank line between each message's log entry,
# uncomment next two lines (this is helpful for debugging)
LOG="
"
# Set to yes when debugging
VERBOSE=no
# Remove ## when debugging; set to no if you want minimal logging
## LOGABSTRACT=all
# Replace $HOME/mail with your mailbox directory
# Mutt and elm use $HOME/Mail
# Pine uses $HOME/mail
# Netscape Messenger uses $HOME/nsmail
# Some NNTP clients, such as slrn & nn, use $HOME/News
# Mailboxes in maildir format are often put in $HOME/Maildir
# NOTE: Upon reading the next line, Procmail does a chdir to $MAILDIR
# and relative paths are relative to $MAILDIR
MAILDIR=$HOME/.evolution/mail/local/Inbox.sbd/ #Make sure this directory exists!
DEFAULT=/var/spool/mail/[user]
:0 fw:hamlock
| python /usr/local/src/spambayes/spambayes-1.0.4/scripts/sb_filter.py -d /home/[user]/.hammiedb
# Was it spam?
:0
* ^X-Spambayes-Classification: spam
${MAILDIR}/Spam
# Unsure?
:0
* ^X-Spambayes-Classification: unsure
${MAILDIR}/Unsure
# Put everything else in the INBOX
:0:
$DEFAULT
Note that "|
python
/usr/local/src/spambayes/spambayes-1.0.4/scripts/sb_filter.py -d
/home/[user]/.hammiedb" should span a single line. # Details in Check Your $SHELL and $PATH in Troubleshooting below
SHELL=/bin/sh
# Directory for storing procmail configuration and log files
# You can name this environment variable anything you like
# (for example PROCMAILDIR) or, if you prefer, don't set it
# (but then don't refer to it!)
PMDIR=$HOME
# Put ## before LOGFILE if you want no logging (not recommended)
LOGFILE=$HOME/.maillog
# To insert a blank line between each message's log entry,
# uncomment next two lines (this is helpful for debugging)
LOG="
"
# Set to yes when debugging
VERBOSE=no
# Remove ## when debugging; set to no if you want minimal logging
## LOGABSTRACT=all
# Replace $HOME/mail with your mailbox directory
# Mutt and elm use $HOME/Mail
# Pine uses $HOME/mail
# Netscape Messenger uses $HOME/nsmail
# Some NNTP clients, such as slrn & nn, use $HOME/News
# Mailboxes in maildir format are often put in $HOME/Maildir
# NOTE: Upon reading the next line, Procmail does a chdir to $MAILDIR
# and relative paths are relative to $MAILDIR
MAILDIR=$HOME/.evolution/mail/local/Inbox.sbd/ #Make sure this directory exists!
DEFAULT=/var/spool/mail/[user]
:0 fw:hamlock
| python /usr/local/src/spambayes/spambayes-1.0.4/scripts/sb_filter.py -d /home/[user]/.hammiedb
# Was it spam?
:0
* ^X-Spambayes-Classification: spam
${MAILDIR}/Spam
# Unsure?
:0
* ^X-Spambayes-Classification: unsure
${MAILDIR}/Unsure
# Put everything else in the INBOX
:0:
$DEFAULT
Two mail directories are defined here — MAILDIR, which contains the location of my Inbox of my e-mail client (Evolution), and DEFAULT, which contains the location of the mail spool (/var/spool/mail/[user]). As e-mail is fetched from the university server, procmail forwards the mail through the spambayes script sb_filter.py using python, which adds a metatag to the e-mail header indicating whether or not it thinks it is "ham", "spam", or is "unsure". For example, if it looks at the e-mail and thinks it is ham, the following metatag will be used:
X-Spambayes-Classification:
ham;
With a metatag
placed in every e-mail, procmail then checks to see if the tag contains the "spam" or "unsure" keywords. If so, procmail will dump the e-mail into the $MAILDIR/Spam, or the $MAILDIR/Unsure folders within my e-mail client. All other mail is placed back into /var/spool/mail/[user].
This way, I have pre-sorted my incoming mail, so that when I take a
look at my mail spool directory, all of the spam has already been
removed. I find this to be especially useful when I have to log onto my
server remotely to check my mail using pine
— the spam has already been removed and I no longer have to sort through it manually.E-Mail Client Message Filters
As an alternative, after the metatags have been placed in every e-mail, you could just dump everything back to /var/spool/mail/[user] and setup your e-mail client to sort incoming e-mail via message filters. When your e-mail client picks up mail from the mail spool, the filters would then move e-mail into the local Inbox folder, a Spam folder, or an Unsure folder. The disadvantage of this, however, is that you must run your e-mail client before your e-mail can be sorted.
If you want to do this using evolution, click on Edit -> Message Filters to setup a spam and unsure filter. Then click on Add, and type in "Spam" for the Search name. Next choose Specific header from the drop down box, and type in "X-Spambayes-Classification" in the box adjacent to it. Choose contains from the next drop down box and type in "spam" in the box adjacent to it. Choose Move to Folder from the third drop down box, and choose your Spam directory after that. Then hit OK. You have just set up a rule that will check retrieved messages for the metatag X-Spambayes-Classification spam, and if it finds it, it will move the e-mail into your Spam folder. Set up a similar rule for the Unsure folder. The metatag you should be looking for is X-Spambayes-Classification unsure. Anything does not meet these two rules, the e-mail is ham, and will be dumped to your local Inbox.
The nice thing about spambayes is that with the nightly cron job setup, if an e-mail is misclassified, you can move the e-mail to the appropriate folder, and at 5:00 am, spambayes will re-train off the Inbox and Spam folders and correct itself from any prior mistakes. You can check to see if an e-mail has been trained by spambayes by looking for the X-Spambayes-Trained metatag. After a few days, you will find that spambayes makes very few mistakes in its initial classification of the e-mail. The biggest thing that you have to monitor is the Unsure folder, as it is important to move the e-mails in this folder either to your Inbox or Spam folder, depending on whether or not the e-mail is ham or spam. This way, spambayes can continue to refine its adaptive algorithm to train on the different types of e-mail you are receiving.
No more presidents of Zimbabwe offering $10,000,000.00 if you provide them with your bank account information. No more male augmentation pills, rol3x watch3z, and all that other junk overwhelming your Inbox. Life is good.


