updates to CRM114_Mailfilter_HOWTO.txt
Hi, All.
I'm back to looking at CRM114, and I decided to update some of the
documentation, working from "the (possibly) slightly unstable latest
mainline version" -- thanks for the continued good work, and I hope the
updates prove useful. This is the second of two.
I mostly reworked the formatting to make it consistent, and made the
step titles consistent as well, fixing a few obvious typos in the
process.
/Jskud
--- crm114-20120205-ORIG/CRM114_Mailfilter_HOWTO.txt 2009-09-11 11:25:57.000000000 -0700
+++ crm114-20120205/CRM114_Mailfilter_HOWTO.txt 2012-02-06 19:52:36.000000000 -0800
@@ -7,7 +7,7 @@
The CRM114 & Mailfilter HOWTO
-Bill Yerazunis, 2003-09-18
- (last update 2009-03-02)
+ (last update 2012-02-06)
This is the CRM114 Mailfilter HOWTO. It describes how to set up CRM114
@@ -31,7 +31,7 @@
----------------------------------------------------------
-That said, I hope CRM114, Mailreaver, and Mailreaver is useful to you;
+That said, I hope CRM114, Mailfilter, and Mailreaver is useful to you;
it's been very useful to me. It's been keeping my mailbox clear of
clutter for since 2002; I'm convinced it has better performance than
I-the-human at killing spam without accidentally deleting important
@@ -64,12 +64,15 @@
- Bill Yerazunis (wsy@...)
--------------------------------------------------------------------
- Step 0: Scientes Inamicae (Know Thy Enemy)
+------------------------------------------------------------------------
+------------------------------------------------------------------------
+
+
+ Step 0: Scientes Inamicae (Know Thy Enemy)
-These are the major steps in using CRM114 Mailfilter. The steps are
-pretty simple:
+These are the other major steps in using CRM114 Mailfilter. The steps
+are pretty simple:
1) Downloading what you need
@@ -85,7 +88,7 @@
(editing one file, most likely change is ONE line, and we tell
you which one)
- 3) Setting up the needed auxilliary files
+ 4) Setting up other needed files
(not more than 2 files to edit of no more than 5 lines each,
plus typing one or two commands)
@@ -115,10 +118,11 @@
don't need to know this, but you may find it useful.
+------------------------------------------------------------------------
+------------------------------------------------------------------------
--------------------------------------------------------------------------
- Step 1: Downloading.
+ Step 1: Downloading What You Need
Get yourself a copy of a CRM114 kit. The kits can always be found by
visiting the CRM114 homepage at:
@@ -168,16 +172,14 @@
Download the kits you will need (at least one of .src.tar.gz or
.i386.tar.gz or .i386.rpm) and then proceed to "Step 2: Setting Up the
-Executables"
-
-
-
---------------------------------------------------------------------------
+Executables".
+------------------------------------------------------------------------
+------------------------------------------------------------------------
- Step 2: Setting Up the Executables
+ Step 2: Setting Up the Executables
In this step, you will install four binaries into your system.
The four binaries are:
@@ -262,8 +264,9 @@
Congratulations! You've now completed the installation of CRM114 and
-utilities from prebuilt binaries. Proceed to "Step 3: Setting Up Needed
-Files.
+utilities from prebuilt binaries. Proceed to "Step 3: Configuring
+Mailfilter or Mailreaver.
+
-----
@@ -382,14 +385,14 @@
Congratulations! You've now completed the installation of CRM114 and
-utilities from source. Move on to the next step - "Step 3: Setting Up
-Your .CSS Files" .
-
+utilities from source. Move on to the next step - "Step 3: Configuring
+Mailfilter or Mailreaver".
------------------------------------------------------------------------
------------------------------------------------------------------------
+
Step 3: Configuring Mailfilter or Mailreaver
In this step you will tell Mailfilter or MailReaver what you want it
@@ -500,11 +503,11 @@
Now, proceed to "Step 4: Setting Up Other Needed Files" .
---------------------------------------------------------------------
---------------------------------------------------------------------
+------------------------------------------------------------------------
+------------------------------------------------------------------------
- Step 4: Setting Up Other Needed Files
+ Step 4: Setting Up Other Needed Files
Now that the crm114 language is working, you need to set up your
.css files, your rewrites.mfp file, and your priolist.mfp file.
@@ -512,8 +515,8 @@
All of these files need to exist (either by being there, or by
being symlinked to) the directory where CRM114 will "run in"
when an actual mail comes in. Usually this is your per-user
-directory on the mail server (if your mail server is also your
-home directory, then it's there.). If this is inconvenient,
+directory on the mail server (if your mail server also provides your
+home directory, then it's there.). If this is inconvenient,
you can use the --fileprefix option on the command line to
tell CRM114 to "change over" to a different directory. The files
that need to be in the home (or --fileprefix) directory are:
@@ -582,9 +585,9 @@
whitelists", you can now say "yes, and they're even _prioritized_
blacklists and whitelists!".
+ -----
-
- Step 4 Part 1 - Setting up the Rewrites file.
+ Step 4 Part 1 - Setting up the Rewrites file.
To set up the rewrites.mfp file, edit the file "rewrites.mfp" and
replace the placeholders (in this case, "wsy", "merl.com", and
@@ -628,23 +631,23 @@
router, etc, add lines in rewrites.mfp for each email name, email
address, server, router, and so forth. This is something you really
_should_ do, if you have more than one email path leading to the
-account that leads to an account that is being filtered by CRM114 (if
+account that leads to an account that is being filtered by CRM114. (If
you don't, a lot of learning will have to be repeated for each path,
which will cost you accuracy and use up valuable feature slots in the
.css files that you could use in more valuable ways otherwise. On the
other hand, if you have multiple email addresses that all channel
-through one CRM114 fileset, and the addresses recieve very different
+through one CRM114 fileset, and the addresses receive very different
ratios of spam and nonspam (or, very differnt *types* of spam), then
it _might_ be to your advantage to not use rewrites.mfp, (just replace
it with an empty file), so that the extra statistical information of
-the incoming email address is not lost)
+the incoming email address is not lost.)
If all this confuses you to no end, just make rewrites.mfp be an
-empty file and everything should decently well.
+empty file and everything should work decently well.
- -----
+ -----
- Step 4 Part 2 - Setting up the .CSS files
+ Step 4 Part 2 - Setting up the .CSS files
You have a choice here. You can either build your own files from your
@@ -662,7 +665,7 @@
If your mail service runs on your local machine (say, you have just
one machine - and I do hope you have a firewall in that case), then
-mailfilter will almost certainly "run" in your home directory- the
+mailfilter will almost certainly "run" in your home directory - the
directory you're in when you log in.
If your mail service runs on a mail server (not your local machine),
@@ -695,7 +698,7 @@
Once you have these empty files you will have a high (50% or so)
error rate for the first few hours, till you have 'taught' CRM114
what your particular mix of spam and nonspam looks like. Proceed
-below to "Step 4: Configuring Mailfilter".
+below to "Step 5: Engaging Mailfilter".
Many people want to "preload" their spam collection into CRM114. This
used to be a bad idea. CRM114 is optimized for TOE learning - "Train
@@ -741,8 +744,8 @@
-----
- Step 4 Part 2 Method C - BETA TEST - Using mailtrainer.crm to
- Build .CSS Files
+ Step 4 Part 2 Method C - BETA TEST -
+ Using mailtrainer.crm to Build .CSS Files
New in 20060101 is the "mailtrainer.crm" program. This program
accepts two directories of "archetype" good and spam email, and runs
@@ -767,7 +770,7 @@
-----
- Step 4 Part 2 Method D - ALPHA TEST -- MAKEFILE Build And
+ Step 4 Part 2 Method D - ALPHA TEST -- MAKEFILE Build And
Preload .CSS Files From Fresh Spam and Nonspam
CAUTION - this applies ONLY to kits 20060606 and later!!! DO NOT DO
@@ -816,7 +819,7 @@
installs post 20060606 . Versions prior to that will hose you if
you do this.
- --------
+ -----
Step 4 Part 3 - Checking your installation
@@ -893,6 +896,7 @@
Note: this works fine for the default classifiers like Markov, OSB,
and OSB Unique, but _not_ for Winnow, Hyperspace, or Corellative
classifiers; for OSBF classifiers use osbf-util instead of cssutil.
+See ./CLASSIFY_DETAILS.txt for a description of the classifiers.
Type in:
@@ -959,10 +963,12 @@
there are similarities. That's pretty much typical- and it's a good sign
that your filtering should be quite accurate.
-Now, move on to "Step 4: Configuring Mailfilter".
+Now, move on to "Step 5: Engaging Mailfilter".
+
+
+------------------------------------------------------------------------
+------------------------------------------------------------------------
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
Step 5: Engaging Mailfilter
@@ -985,7 +991,7 @@
-----
- Step 5 Method A: For Procmail and Maildrop Users
+ Step 5 Method A: For Procmail and Maildrop Users
For Procmail users just add a procmail recipe to .procmailrc to run
CRM114 and mailfilter whenever your other procmail rules fail to
@@ -1011,7 +1017,8 @@
To use mailreaver instead of mailfilter, just put "mailreaver.crm"
in instead of "mailfilter.crm" .
-If you get the test message, proceed to "Step 6: Training CRM114".
+If you get the test message, proceed to "Step 6: Training CRM114 and
+Mailfilter".
-----
@@ -1059,7 +1066,6 @@
----------------------------------------------------------------------------
----------------------------------------------------------------------------
-
Advanced Topic: Huge Emails and Denial Of Service Avoidance
CRM114 has a number of built-in anti-Denial-of-Service (anti-DoS)
@@ -1089,10 +1095,9 @@
mail/crm-spam
-
-----
- Step 5 Method B: The .forward hook file
+ Step 5 Method B: The .forward hook file
For .forward hook users you should be aware that you should NOT put a
direct link to crm in /etc/smrsh; since crm can do arbitrary things,
@@ -1118,7 +1123,8 @@
----
Once you have engaged CRM114 mailfilter, you now get to train it to
-recognize spam and nonspam. Proceed to "Step 6: Training CRM114".
+recognize spam and nonspam. Proceed to "Step 6: Training CRM114 and
+Mailfilter".
Note: CRM114 contains a design decision that you may have to play
with. Instead of doing memory management games, which both consume
@@ -1138,7 +1144,9 @@
a buffer-shuffling dance to minimize time spent reclaiming and
compactifying memory.
----------------------------------------------------------------------------
+
+------------------------------------------------------------------------
+------------------------------------------------------------------------
Step 6: Training CRM114 and Mailfilter
@@ -1162,16 +1170,15 @@
interchangeably here; the instructions say "mailfilter.crm" but
mailreaver.crm works exactly the same way from the user point of view.
- * Mail-to-Myself with In-Line Commands to retrain (Method A)
- * shell commands to retrain (Method B)
- * Mutt direct interface (Method C)
- * Some Other Interface (Method D)
-
+ * Method A: mail-to-myself with in-line commands to retrain
+ * Method B: shell commands to retrain
+ * Method C: Mutt direct interface
+ * Method D: some other interface
-Whatever Way You Train : try to train _approximately_ equal amounts of spam and
-nonspam. If you are within 50% one way or the other, performance will
-be very good.
+Whatever Way You Train: try to train _approximately_ equal amounts of
+spam and nonspam. If you are within 50% one way or the other,
+performance will be very good.
If you are running mailfilter.crm:
@@ -1198,8 +1205,9 @@
error per thousand).
+ -----
- Step 6 Method A: Mail-to-Myself
+ Step 6 Method A: Mail-to-Myself
The first way is to use the in-line command feature. Just forward
the mistake back to yourself, with full headers (except edit out any
@@ -1247,25 +1255,24 @@
If you are a mailreaver user, you also have a priority system you can
access, either by editing your priolist.mfp file directly or by
sending youself email in the following forms (where mypwd is the
-command passworda_regex_pattern is what will be used for priority
+command password, and a_regex_pattern is what will be used for priority
matching. Priority matches can occur in both the headers and body of
the text.)
command mypwd maxprio +a_regex_pattern - sets a maximum priority GOOD
command mypwd maxprio -a_regex_pattern - sets a maximum priority SPAM
- command mypwd minprio +a_regex_pattern - sets a maximum priority GOOD
- command mypwd minprio -a_regex_pattern - sets a maximum priority SPAM
+ command mypwd minprio +a_regex_pattern - sets a minimum priority GOOD
+ command mypwd minprio -a_regex_pattern - sets a minimum priority SPAM
command mypwd delprio a_regex_pattern - deletes the first priority
- list entry that fully matches
- the regex pattern
-
-
+ list entry that fully matches
+ the regex pattern
+ -----
- Step 6 Method B: Shell commands to retrain
+ Step 6 Method B: Shell commands to retrain
- >> For mailfilter users (mailreaver is different - skip to below! <<
+ >> For mailfilter users (mailreaver is different - skip to below! <<)
The second way to train in spam and nonspam is to use mailfilter.crm's
shell command line options. When you find a spam that was mistakenly
@@ -1286,7 +1293,7 @@
[[ If you are using mailreaver.crm instead of mailfilter.crm, and
cacheing is enabled, you don't even need to pipe in the full text in,
all that's needed is either the intact X-CRM114-CacheID: line or the
- Message-ID line containing an intact sfid. That's another reason to
+ Message-ID line containing an intact SFID. That's another reason to
switch to mailreaver! :) ]]
>> For mailreaver.crm users <<
@@ -1294,7 +1301,7 @@
You're in luck, assuming you have taken the default and left cacheing
turned on. All you need to pipe into mailreaver for training is any
text or text fragment containing an intact X-CRM114-CacheID: line or
-the Message-ID line containing an intact sfid; mailreaver will go get
+the Message-ID line containing an intact SFID; mailreaver will go get
the exact incoming text of the message and train it, so you don't need
to worry about munged headers.
@@ -1352,9 +1359,9 @@
file; instead use the file so noted.
+ -----
-
- Part 6 Method C: For Mutt Users
+ Step 6 Method C: For Mutt Users
(Contributed by Mathieu Doidy and Joost van Baal:)
@@ -1375,8 +1382,9 @@
* esc-h will tag a message, falsely classified as spam, as ham.
+ -----
- Part 6 Method D: Some Other Method
+ Step 6 Method D: Some Other Method
There are at least five other ways to retrain CRM114. Some interface
@@ -1458,12 +1466,11 @@
of daily use and about a gigabyte of email).
-
------------------------------------------------------------------------
-
+------------------------------------------------------------------------
+------------------------------------------------------------------------
- Step 7: Adding Priority Lists, Whitelists, and Blacklists
+ Step 7: Adding Priority Lists, Whitelists, and Blacklists
If you really want, you can add white, black, and priority lists
to CRM114. Most people don't need them, but there are always
@@ -1497,10 +1504,11 @@
Lastly (well, actually firstly, because prio-listing happens before
whitelisting or blacklisting) any mail that matches any regex in
-priolist.mfp . The format of priolist.mfp is that the first character
-on the line is a + or a -, which indicates "whitelist" or "blacklist",
-and the rest of the line is a regex. These regexes are tested
-in the order given in the file. An empty file is perfectly acceptable.
+priolist.mfp is handled. The format of priolist.mfp is that the first
+character on the line is a + or a -, which indicates "whitelist" or
+"blacklist", and the rest of the line is a regex. These regexes are
+tested in the order given in the file. An empty file is perfectly
+acceptable.
For examples of how to set up the whitelist, blacklist, and priolist
files, see the included "whitelist.mfp.example", "blacklist.mfp.example",
@@ -1513,10 +1521,11 @@
add, otherwise you may get a rude surprise some day.
-----------------------------------------------------------------
+------------------------------------------------------------------------
+------------------------------------------------------------------------
- Step 8: Useful Utilities
+ Step 8: Useful Utilities
You don't _need_ to know the stuff in this section to set up and use
CRM114 and mailfilter or mailreaver, but it might be useful to you- or
@@ -1539,7 +1548,7 @@
The cssutil utility:
-
+ -------------------
Usage is
@@ -1560,9 +1569,7 @@
-s css-size - if no cssfile found, create new
cssfile with this many buckets.
-S css-size - same as -s, but round up to next
- 2^n + 1 boundary.
-
-
+ 2^k + 1 boundary.
The cssdiff utility
@@ -1572,8 +1579,7 @@
./cssdiff somefile.css anotherfile.css
-which writes out a summary of how two different .css files are.
-
+which writes out a summary of how different two .css files are.
The cssmerge utility
@@ -1600,19 +1606,16 @@
-s NNNN -new file length, if needed
-
-
-
- Enlarging a .css file
- ---------------------
+ Enlarging a .css file
+ ---------------------
One of the advantages of CRM114 is that the .css files are relatively
small and of fixed size; they don't grow out of control and never need
trimming if you use <microgroom>, which is the default.
The disadvantage of this is that if your spam/nonspam discrimination
-is too convoluted, it won't be able to sort them out ( in trek-speak
-this is a high-order nonlinearity in the discrimination function ).
+is too convoluted, it won't be able to sort them out (in trek-speak
+this is a high-order nonlinearity in the discrimination function).
The fix in this situation is to increase the dimensionality of the
feature space. The number of dimensions is about 1/12 the number of
bytes in the .css files; this works well at about a million dimensions
@@ -1640,7 +1643,7 @@
You can even combine steps 1 and 2, because newer versions of cssmerge
will create a new file if needed (the -s N flag sets the number of slots
-in the new file; -S N does the same thing but rounds up to a 2^N+1
+in the new file; -S N does the same thing but rounds up to a 2^k+1
boundary, which is recommended ).
For example, here's how to increase the size of the spam.css file
@@ -1657,6 +1660,7 @@
--------------------------------------------------------------------
APPENDIX 1
+
Using mailtrainer.crm
@@ -1902,12 +1906,11 @@
improve your accuracy still more.
+------------------------------------------------------------------------
----------------------------------------------------------------------
-
-That's all! If you have errors or updates (or find bugs!) please
-let me know; the best way is to join the CRM114-general mailing list; it's
-on the webpage:
+That's all! If you have errors or updates (or find bugs!) please let me
+know; the best way is to join the CRM114-general mailing list; it's on
+the webpage:
http://crm114.sourceforge.net
@@ -1920,3 +1923,5 @@
Enjoy, and good luck.
-Bill Yerazunis
+
+[]
------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d