Script: Exporting Emails for Spamassassin Training

In trying to figure out the most effective way to use our tickets tagged as spam to train our spam filters, I came up with a solution that seems to work pretty well. The does require you to use the console, so fair warning for those of you who are faint of heart. :smiley:

I have posted a snippet on Github that you can use to look for tickets with a “spam” tag and then export the raw emails as files that you can then feed into sa-learn to train your Spamassassin filters.

Find tickets that are spam and write them to files

You can use it as follows:

  1. Create a folder that is writeable by Zammad
  2. Download the script
  3. Run the script zammad run rails r extract_spam.rb
  4. Move the files to your email server and train away

I wanted to post this here as it was not apparent how to do this after searching online.

6 Likes

Cool.

Some comments on the code:

  1. for ticket in Ticket.all do
    

    can be quite resource intensive, as it loads all tickets into memory before iterating over them.

    It is better to use something like

    Ticket.find_each do |ticket|
    

    which iterates of the tickets in batches of 1000.

  2. if not ticket.articles.first.as_raw.nil? then
      ...
      file.write("#{ticket.articles.first.as_raw.content}")
    

    should read

    raw_article = ticket.articles.first.as_raw
    unless raw_article
      ...
      file.write raw_article.content
    

    as tickets.articles.first will always query the database (this saves one query).

2 Likes

This is great - glad to have seen you actually implemented the idea. Spam in the helpdesk is a real pain point.

I’m curious if you have found an optimal way to deal with the detected spam in Zammad (not what your agents manually mark as spam, rather what you server has already detected as spam)

We are allowing emails with X-Spam-Flag: YES to come to Zammad, but then have a Zammad email filter that adds the ‘spam’ tag and closes those tickets automatically.

Perhaps we should use different tags for agent marked spam vs SpamAssassin marked spam that is auto-closed on arrival to avoid a feedback loop.

Anyways, we still have a fair amount of garbage tickets in our system but at least they are closed and if need be we could still find them.