spam

A bayesian spam filter implemented in Clojure, based on the one in Practical Common Lisp. Using some of the improvements on tokens from Paul Graham's Better Bayesian Filtering.

Usage

Tested on the ham and spam from SpamAssassin's Public Corpus. The program expects the emails to be tested on to be organized into the copus/spam and corpus/ham directories.

With a collection of emails in those directories, you can run the standalone jar with the following command, which takes 1 argument as the fraction of the corpus to use for cross validation:

java -jar -Xmx1g spam-filter-standalone.jar

The parameters at the beginning of core.clj file (min-spam-score and max-ham-score) can be tuned to change the behavior of the filter.

Results

Using 1/5 as the fraction for cross-validation, I obtained the following results:

          total : 1869  100.00 %
        correct : 1808   96.74 %
 false-positive : 3       0.16 %
 false-negative : 7       0.37 %
     missed-ham : 12      0.64 %
    missed-spam : 39      2.09 %

Increasing the cv fraction to 1/10 improved the results to some extent, as would be expected with a larger training set:

          total : 934   100.00 %
        correct : 912    97.64 %
 false-positive : 0       0.00 %
 false-negative : 3       0.32 %
     missed-ham : 5       0.54 %
    missed-spam : 14      1.50 %

missed-ham and missed-spam are the emails that the algorithm classified as :unsure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.markdown

README.markdown

spam

Usage

Results

License

Files

README.markdown

Latest commit

History

README.markdown

File metadata and controls

spam

Usage

Results

License