Deleting spam from a POP3 server
For most of my business-related mail, I use Outlook 2000 on a Window 2000 box. I had installed the Spambayes client for Outlook 2000 to combat increasing amounts of spam, stemming from one stupid email to a Usenet group. While Spambayes did a wonderful job of detecting spam, I still had to download the garbage to my Windows client before the spam could be detected. Also, if I was checking my personal email via the web, I had difficulty seeing the real mail amongst all the spam.
What I needed was a POP3 client that would check the POP server
every so often, and delete the spam before I even saw it. Since the
hydrus web and mail server was up 24x7, I had a platform to support
this putative POP3 spam purger. I searched on the web, but couldn't
find a ready-made solution. [N.B. I subsequently noticed that
Spambayes came with a sb_imapfilter.py script, which
offered the same kind of facilities I was after].
To start development and testing, I downloaded Spambayes onto my
Debian-based desktop machine. The installation was slightly
complicated because it required the Python distutils
modules. This is not installed as part of the standard Debian
Python package, but comes with the additional python23-dev
package.
Writing a Python script to act as a POP3 client seemed easy enough,
using the Python-supplied poplib module; the key was how
easy it would be to integrate the spam detection capabilities of
Spambayes.
As it turned out, it was remarkably easy; a testament to Python and the developers of Spambayes. I thought that it would not be possible to copy the training database from Windows to UNIX and use it. I was wrong - it worked just fine. For the purposes of research, I did create a training database from scratch using the existing set of ham and spam messages from Outlook. See below.
The first time I tested the pop client, I found that the percentages reported by Spambayes were not as high as I expected. I tracked this down to the fact that the downloaded mails did not contain an appropriate "From" header line. Once I forced this header into each mail, before passing it to Spambayes, the scores were more reasonable.
The next task was to port the solution to FreeBSD. There didn't
seem to be a port for Spambayes, so I copied across the installation
tar file from the Debian machine. No problems this time with
missing distutils modules. I copied the spam database I
had created under Debian, and started testing. One immediate
problem surfaced; Python could not open the spam database,
complaining that the module _bsddb could not be found. It
appeared I needed to install another port to support the BSD
database formats - databases/py-bsddb, which offers a set
of python wrappers for the BSD databases. Note this is an older
version of the python wrappers; the newer version can be found in
databases/py-bsddb3. It was because I had used Python 2.3
on Debian, that the database was in the older format.
Once the Python script seemed to work, all that remained was to create a crontab to run it at regular intervals (every 4 hours to start with).
To Be Done
- Beef up the error detection and reporting.
- The spam database is likely to drift out of date over time. I suspect I can fix this by periodically copying the latest version of the training database to crimson.
-
Deletion of spam may be a bit drastic. Rewrite to use IMAP
protocol to move spam to a folder on the server. (Note that the
Spambayes script
sb_imapfilter.pyseems to offer just this capability). - Use some sort of encryption to hide the username and password.
Popdespam.py source code
Here's the current source code for the popdespam.py script. Help yourself if you find it useful.
#!/usr/local/bin/python
"""
NAME
popdespam.py
SYNOPSIS
popdespam.py [-n] [-v]
-n do not actually delete any spam
-v verbose mode
DESCRIPTION
Deletes spam from a POP3 server, using scoring
provided by Spambayes. A summary of the messages found and
deleted is displayed.
"""
import getopt
import time
import poplib
import sys
from spambayes import hammie
#########################################
# key parameter settings
#########################################
server="" # name of pop server
username="" # pop server username
password="" # pop server password
spamdb="" # location of spambayes database
verbose=False
dodelete=True
seen_file="" # file to store last message number seen
max_size=100000 # ignore messages greater than this number of bytes
#########################################
# read command line options (if any)
try:
opts,args = getopt.getopt(sys.argv[1:],'vn')
for o,v in opts:
if o == '-v':
verbose = True
elif o == '-n': dodelete = False
except getopt.GetoptError,e:
print "%s: illegal argument: %s" % (sys.argv[0],e.opt)
sys.exit(1)
h = hammie.open(spamdb)
p = poplib.POP3(server)
status = p.user(username)
if verbose: print status
status = p.pass_(password)
if verbose: print status
stat = p.stat()
if verbose: print "# Messages:",stat[0],"; # bytes",stat[1]
nmsgs = stat[0]
# get highest message seen in last run
try:
seen = int(open(seen_file).read())
except IOError:
seen = 0
# if # of messages in mailbox now is less than seen, assume mailbox
# has been emptied since last run, therefore all messages must be
# scanned
if nmsgs < seen:
seen = 0
if verbose: print "last seen:",seen
msg_list = p.list()[1]
if verbose: print msg_list
ndel = 0
current_time = time.asctime(time.localtime(time.time()))
try:
for i in msg_list:
msg_index = i.split()
i = int(msg_index[0])
size = int(msg_index[1])
if i <= seen: continue
if size > max_size: continue
msgt = p.retr(i)
if verbose: print "Message:",i,"; # bytes:",msgt[2],
msg = "From xxx "+current_time
for line in msgt[1]:
msg = msg+"\n"+line
spamprob = h.score(msg)
if verbose: print "Score:",spamprob
if spamprob>.9 and dodelete:
ndel += 1
p.dele(i)
except poplib.error_proto,e:
print "[%s] (%s) poplib error: %s" % (current_time,server,e)
p.quit()
sys.exit(0)
status = p.quit()
if verbose: print status
print "[%s] (%s) %4d msgs; %4d deleted" % (current_time,server,nmsgs,ndel)
# save seen message number
open(seen_file,mode="w").write("%d" % (nmsgs - ndel,))
Training a UNIX version of Spambayes with Windows mail messages
Since my existing training database was on Windows, I decided to see how easy it would be to create a new one on the Debian machine, using the mail messages from my Windows mail client. It was relatively easy (if tedious) to export all the Outlook messages as text files and transfer them to Debian. However, the signature used to detect the start of a mail message in unix mailboxes, "^From name Day Mon nn HH:MM:SS Year" (where "^" indicates start of line), did not exist in the mail exported from Outlook. I had to spend some time creating the right headers, using Emacs and the existing information in the file. To create a training database, use a command of the form:
sb_mboxtrain.py -d sb.db -g ham.txt -s spam.txt