Hi,
Maybe an example will
clearify this issue
Assume following binary
data
3 features F1, F2, F3
(all three binary: 1/0)
1 class attribute
Phishing(1=Yes, 0=No)
Now, assume you only
consider the phishing emails to train your model. You might have following
data:
F1 F2 F3 C
1 0 1 1
1 1 0 1
1 0 1 1
1 1 1 1
1 0 1 1
1 1 0 1
Your model will probably
create following rule: If F1=1 è C = 1 (Acc. = 100% on
training data)
However, the non-phishing
data you left out, might have been:
F1 F2 F3 C
1 0 0 0
1 0 0 0
1 0 0 0
Given this additional
data, your previous model will recognize every phishing email, but will also
classify all non-phishing emails as phishing. Maybe, feature1 implies “E-mail
has a subject”.
However, if your
classifiers gets all data (both phishing and non-phishing), It might find
following rule:
If F2=1 or F3=1 è C = 1.
Now, your model will
classify all phishing as phishing, but will recognize the normal emails as
normal and no longer as phishing
Of course, this is an
academic example, but illustrates the danger of leaving out the negative
examples.
Cheers,
Benoît
From:
wekalist-bounces <at> list.scms.waikato.ac.nz
[mailto:wekalist-bounces <at> list.scms.waikato.ac.nz] On Behalf Of Madhusudhanan chandrasekaran
Sent: maandag 1 september 2008
16:46
To: Weka machine learning workbench
list.
Subject: Re: [Wekalist] A general
classification question
Hello,
Thanks! Still I'm not able to fathom the downside to it. Assume, that my
features are binary, then I wouldn't it be that most of the ham emails do not
possess the
features, and hence would slip past the classifier? Is there any more concrete
theoretical explanation for this? Text-bookish is also fine since it is a very
"primitive"
question. Moreover in this case, what kind of errors would result. Type I and
Type II errors may occur in all cases.
Thanks,
_Madhu
On Mon, Sep 1, 2008 at 10:11 AM, Benoît Depaire <benoit.depaire <at> uhasselt.be>
wrote:
I think you should add "ham corpus" (I
assume these are emails which are NOT phishing emails). The reason is that
there are two type of errors you can make. In your case:
You could classify a phishing email as non
phishing
You could classify a normal email as phishing
If you only have phishing emails, the model can
only learn what the characteristics are of phishing email. Therefore your final
model will probably too protective, classifying many normal emails as phishing
emails.
By the way. If you're building a classifiers to
discern between those two types, your class variable should have variance. If
you just want to learn characteristics of phishing email, you're not really
building a model, but more running some descriptive statistics.
In general, you need positive and negative cases,
if you want your model to learn to discern between them.
Cheers,
Benoît Depaire
Hello,
Caveat --This has to do little with Weka. Sorry if this is a wrong forum, but I
am confident I would get
some help here.
I'm facing a conundrum while classifying text into two segments. Suppose, I
would want to know if a given
email is phishing email or not, is it fine if I only train my model using
phishing corpus, and then use it for
testing against both phishing and ham corpus. What's the point of including ham
corpus in the training set,
especially if it is going to vary according to each users' mailbox? It would
only increase the entropy of the
model. Please let me know if I am missing something IMPORTANT. I see that most
of the spam classification
papers also have used ham set in training too.
Thanks,
_Madhu
_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist