C. Fischer | 6 Mar 13:58
Favicon

ifile + MIME

in these days, with complicated MIME messages and the applicability of ifile
to the anti-spam domain, it seems ifile should grok MIME.

<URL:http://www.ivarch.com/programs/qsf.shtml>

qsf has quite sensible MIME handling:  only MIME types "text/*" are
classified, with HTML tags stripped and proper qp and base64 decoding.

question:  would it suffice to delete matching "<...>" pairs, or do they
sometimes get escaped in some way, or is it legal to qp/base64 encode them?

i'm thinking of taking qsfs MIME parser and add it to ifiles lexer.

  clemens
C. Fischer | 7 Mar 12:48
Favicon

usage of ifiles threshold option?

could somebody please give an example of using ifiles `-T' (--threshold)
option?  i want to know how to derive a specific number for it.

  clemens
C. Fischer | 7 Mar 12:47
Favicon

naive bayes algorithm in ifile?

another idea i'm toying with is making a (portable) standard-prolog
implementation of naive bayes for (email/usenet) text classification.  the
free prologs have improved much over the years, and i want to know if a prolog
implementation is fast enough.

given n categories, t[i]; i {1..n} tokens per category, m[i]; i {1..n}
messages per category and for every token a record (age, c:i); i {1..n}, could
somebody please give a simple, english description of the algorithm needed to
classify a message?  i need to understand how token ageing can be used to keep
the database small, containing only the tokens that contribute the most to
classification and dropping the rest.

do i really need floating point operations or can i get away with integer
arithmetic?  could rational numbers be a better solution?

  clemens
C. Fischer | 7 Mar 12:46
Favicon

anon cvs access?

what happened to ifiles CVS repository?

,----
| /src/ifile
| 0 p2 # cvs -z0 up
|  -> main loop with CVSROOT=:pserver:anoncvs <at> subversions.gnu.org:/cvsroot/ifile
|  -> Connecting to subversions.gnu.org(199.232.41.3):2401
| cvs [update aborted]: connect to subversions.gnu.org(199.232.41.3):2401 failed: Operation timed out
|  -> Lock_Cleanup()
`----

has something like the host or the organization of the repo changed?

  clemens
Paolo | 8 Mar 14:44

Re: usage of ifiles threshold option?

On Mon, Mar 07, 2005 at 12:48:13PM +0100, C. Fischer wrote:
> could somebody please give an example of using ifiles `-T' (--threshold)
> option?  i want to know how to derive a specific number for it.

hello Clemens,

-T was introduced to allow for a 'grey zone' between the 2 winning 
categories (among 2 or more in the database). I.e., in a sense, it makes
1 further bin 'on the fly', into which the test item is thrown, whenever 
the 2 topmost ranks are closer than the threshold, in relative terms, 
according the the formula you get with --help:

R=(r0-r1)/(r0+r1), R*1000 < THRESH

if THRESH > 0.
Actually, you get 2 'grey zones', as you'd get a response like cat1,cat2
or cat2,cat1 according to which rank is absolute max.
In spam filtering, eg you can do a coarse classification with large 
threshold, and less comp.-expansive preprocessing, then reprocess with 
with narrower threshold, better preproc, MIME decoding etc. what makes 
into the 'unsure' bin on 2st pass.

- In previous msg you mentioned MIME processing: AFAIKT, that's not much 
effective WRT spam/ham classification - see reports in other projects, eg
CRM114 (crm114.sf.net) - see there as well for link to 'normalizemime', a 
tool to mangle/sanitize an RFC [2]822 msg in UTF-*.

- For possible algos/how to implement BCR, besides ifile itself and related 
papers, see comments in crm114 code, and you may want also to have a look at 
dabcl / L.Breyer sw/site : http://www.lbreyer.com/emailtut.html
(Continue reading)


Gmane