Peter Reutemann | 1 Sep 2008 03:33
Picon
Picon
Favicon

Re: choosing words based on TF-IDF score

I only had a quick look at the code, since I'm not familiar with text
classification...

> I have two questions while using WEKA in text classification.
>
> First, I'm converting  dataset into arff using TextDirectorytoARFF and
> StringToWordVector.
> I want to know how StringToWordVector chooses the set of words if -W option
> is set.

It just counts the occurrences of words, i.e., the substrings that the
chosen tokenizer returns (default is 1000). This is the basis for the
dictionary the filter uses.

> Second, while using TextDirectoryToArff, I put in the ham and spam dataset
> into a single directory.
> And based on the filename attribute in the generate ARFF file, I change them
> accordingly
> to their respective classes.
>
> What would I have to do to select the top 'n' words in each class, say based
> on the tf-idf score.

The ranking is solely done on the number of occurrences. TF/IDF
transformation happens after the dictionary has been created.

> I know that StringToWordVector has option to represent each word (attribute)
> as their tf/idf scores.
> But it does not help me. Moreover, if I  separate the spam and ham dataset
> by keeping them in
(Continue reading)

slander | 1 Sep 2008 11:29
Picon
Favicon

Re: basic help in weka - neural nets


I think this is my last question...
"Correlation coefficient  0.9797"
As this number has to be between 0 and 1, I presume I can use this as
percentage, i.e. the neural network was 97.97% successful in predicting the
4th number? or is it not possible to use a percentage at all?
Thanks for all your help and advice.

Peter Reutemann wrote:
> 
>> what other method/figure would I have to use
>> instead to be able to determine how well the neural network predicted the
>> 4th number from the previous 3.
> 
> The correlation coefficient gives you that. The close to 1 the better
> (and trying to minimize the errors, of course). If I'm not completely
> mistaken, Weka uses Pearson's coefficient:
>   http://en.wikipedia.org/wiki/Correlation_coefficient
> 

--

-- 
View this message in context: http://www.nabble.com/basic-help-in-weka---neural-nets-tp19202120p19251557.html
Sent from the WEKA mailing list archive at Nabble.com.

_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
(Continue reading)

Peter Reutemann | 1 Sep 2008 12:23
Picon
Picon
Favicon

Re: basic help in weka - neural nets

> "Correlation coefficient  0.9797"
> As this number has to be between 0 and 1, I presume I can use this as
> percentage, i.e. the neural network was 97.97% successful in predicting the
> 4th number?

correlation =/= accuracy. "Accuracy" is based on rows in the dataset,
i.e., the percentage of correctly classified rows. From the
"correlation coefficient" (= CC) you cannot deduce the number of rows:
there can be one classification that was way off and pushes the CC
down to 0.7, whereas all other classifications were really close to
the actual value. On the other hand, a CC of 0.8, though higher in
value, can mean that a lot of the classifications were off by a larger
margin than in the case of a CC of 0.7. Hence it is important to look
at the CC *and* at the errors, e.g., root mean squared error (= RMSE).

> or is it not possible to use a percentage at all?

I've never come across "percentage" being used in conjunction with the
correlation coefficient.

Cheers, Peter
--

-- 
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/ Ph. +64 (7) 858-5174

A general classification question

Hello,

Caveat --This has to do little with Weka. Sorry if this is a wrong forum, but I am confident I would get
some help here.

I'm facing a conundrum while classifying text into two segments. Suppose, I would want to know if a given
email is phishing email or not, is it fine if I only train my model using phishing corpus, and then use it for
testing against both phishing and ham corpus. What's the point of including ham corpus in the training set,
especially if it is going to vary according to each users' mailbox? It would only increase the entropy of the
model. Please let me know if I am missing something IMPORTANT. I see that most of the spam classification
papers also have used ham set in training too.

Thanks,
_Madhu
_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
Benoît Depaire | 1 Sep 2008 16:11
Picon
Favicon

RE: A general classification question

I think you should add “ham corpus” (I assume these are emails which are NOT phishing emails). The reason is that there are two type of errors you can make. In your case:

 

You could classify a phishing email as non phishing

You could classify a normal email as phishing

 

If you only have phishing emails, the model can only learn what the characteristics are of phishing email. Therefore your final model will probably too protective, classifying many normal emails as phishing emails.

 

By the way. If you’re building a classifiers to discern between those two types, your class variable should have variance. If you just want to learn characteristics of phishing email, you’re not really building a model, but more running some descriptive statistics.

 

In general, you need positive and negative cases, if you want your model to learn to discern between them.

 

Cheers,

Benoît Depaire

 

From: wekalist-bounces <at> list.scms.waikato.ac.nz [mailto:wekalist-bounces <at> list.scms.waikato.ac.nz] On Behalf Of Madhusudhanan chandrasekaran
Sent: maandag 1 september 2008 15:15
To: Weka machine learning workbench list.
Subject: [Wekalist] A general classification question

 

Hello,

Caveat --This has to do little with Weka. Sorry if this is a wrong forum, but I am confident I would get
some help here.

I'm facing a conundrum while classifying text into two segments. Suppose, I would want to know if a given
email is phishing email or not, is it fine if I only train my model using phishing corpus, and then use it for
testing against both phishing and ham corpus. What's the point of including ham corpus in the training set,
especially if it is going to vary according to each users' mailbox? It would only increase the entropy of the
model. Please let me know if I am missing something IMPORTANT. I see that most of the spam classification
papers also have used ham set in training too.

Thanks,
_Madhu

_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist

Re: A general classification question

Hello,

Thanks! Still I'm not able to fathom the downside to it. Assume, that my features are binary, then I wouldn't it be that most of the ham emails do not possess the
features, and hence would slip past the classifier? Is there any more concrete theoretical explanation for this? Text-bookish is also fine since it is a very "primitive"
question. Moreover in this case, what kind of errors would result. Type I and Type II errors may occur in all cases.


Thanks,
_Madhu
On Mon, Sep 1, 2008 at 10:11 AM, Benoît Depaire <benoit.depaire <at> uhasselt.be> wrote:

I think you should add "ham corpus" (I assume these are emails which are NOT phishing emails). The reason is that there are two type of errors you can make. In your case:

 

You could classify a phishing email as non phishing

You could classify a normal email as phishing

 

If you only have phishing emails, the model can only learn what the characteristics are of phishing email. Therefore your final model will probably too protective, classifying many normal emails as phishing emails.

 

By the way. If you're building a classifiers to discern between those two types, your class variable should have variance. If you just want to learn characteristics of phishing email, you're not really building a model, but more running some descriptive statistics.

 

In general, you need positive and negative cases, if you want your model to learn to discern between them.

 

Cheers,

Benoît Depaire

 

From: wekalist-bounces <at> list.scms.waikato.ac.nz [mailto:wekalist-bounces <at> list.scms.waikato.ac.nz] On Behalf Of Madhusudhanan chandrasekaran
Sent: maandag 1 september 2008 15:15
To: Weka machine learning workbench list.
Subject: [Wekalist] A general classification question

 

Hello,

Caveat --This has to do little with Weka. Sorry if this is a wrong forum, but I am confident I would get
some help here.

I'm facing a conundrum while classifying text into two segments. Suppose, I would want to know if a given
email is phishing email or not, is it fine if I only train my model using phishing corpus, and then use it for
testing against both phishing and ham corpus. What's the point of including ham corpus in the training set,
especially if it is going to vary according to each users' mailbox? It would only increase the entropy of the
model. Please let me know if I am missing something IMPORTANT. I see that most of the spam classification
papers also have used ham set in training too.

Thanks,
_Madhu


_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist


_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
Benoît Depaire | 1 Sep 2008 17:26
Picon
Favicon

RE: A general classification question

Hi,

 

Maybe an example will clearify this issue

 

Assume following binary data

 

3 features F1, F2, F3 (all three binary: 1/0)

1 class attribute Phishing(1=Yes, 0=No)

 

Now, assume you only consider the phishing emails to train your model. You might have following data:

 

F1   F2  F3  C

1     0    1    1

1     1    0    1

1     0    1    1

1     1    1    1

1     0    1    1

1     1    0    1

 

Your model will probably create following rule: If F1=1 è C = 1  (Acc. = 100% on training data)

 

However, the non-phishing data you left out, might have been:

 

F1   F2   F3   C

1     0     0     0

1     0     0     0

1     0     0     0

 

Given this additional data, your previous model will recognize every phishing email, but will also classify all non-phishing emails as phishing. Maybe, feature1 implies “E-mail has a subject”.

 

However, if your classifiers gets all data (both phishing and non-phishing), It might find following rule:

 

If F2=1 or F3=1 è C = 1.

Now, your model will classify all phishing as phishing, but will recognize the normal emails as normal and no longer as phishing

 

Of course, this is an academic example, but illustrates the danger of leaving out the negative examples.

 

Cheers,

Benoît

 

From: wekalist-bounces <at> list.scms.waikato.ac.nz [mailto:wekalist-bounces <at> list.scms.waikato.ac.nz] On Behalf Of Madhusudhanan chandrasekaran
Sent: maandag 1 september 2008 16:46
To: Weka machine learning workbench list.
Subject: Re: [Wekalist] A general classification question

 

Hello,

Thanks! Still I'm not able to fathom the downside to it. Assume, that my features are binary, then I wouldn't it be that most of the ham emails do not possess the
features, and hence would slip past the classifier? Is there any more concrete theoretical explanation for this? Text-bookish is also fine since it is a very "primitive"
question. Moreover in this case, what kind of errors would result. Type I and Type II errors may occur in all cases.


Thanks,
_Madhu

On Mon, Sep 1, 2008 at 10:11 AM, Benoît Depaire <benoit.depaire <at> uhasselt.be> wrote:

I think you should add "ham corpus" (I assume these are emails which are NOT phishing emails). The reason is that there are two type of errors you can make. In your case:

 

You could classify a phishing email as non phishing

You could classify a normal email as phishing

 

If you only have phishing emails, the model can only learn what the characteristics are of phishing email. Therefore your final model will probably too protective, classifying many normal emails as phishing emails.

 

By the way. If you're building a classifiers to discern between those two types, your class variable should have variance. If you just want to learn characteristics of phishing email, you're not really building a model, but more running some descriptive statistics.

 

In general, you need positive and negative cases, if you want your model to learn to discern between them.

 

Cheers,

Benoît Depaire

 

From: wekalist-bounces <at> list.scms.waikato.ac.nz [mailto:wekalist-bounces <at> list.scms.waikato.ac.nz] On Behalf Of Madhusudhanan chandrasekaran
Sent: maandag 1 september 2008 15:15
To: Weka machine learning workbench list.
Subject: [Wekalist] A general classification question

 

Hello,

Caveat --This has to do little with Weka. Sorry if this is a wrong forum, but I am confident I would get
some help here.

I'm facing a conundrum while classifying text into two segments. Suppose, I would want to know if a given
email is phishing email or not, is it fine if I only train my model using phishing corpus, and then use it for
testing against both phishing and ham corpus. What's the point of including ham corpus in the training set,
especially if it is going to vary according to each users' mailbox? It would only increase the entropy of the
model. Please let me know if I am missing something IMPORTANT. I see that most of the spam classification
papers also have used ham set in training too.

Thanks,
_Madhu


_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist

 

_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
Nissim | 1 Sep 2008 23:04
Picon

X-Means Algorithm


Hi !!!

If current version of Weka contains the X-Mean algorithm ? I don't see it in
Explorer or jar file.
If not what was the last version that contained its stable implementation ?

Thanks,
Nissim
--

-- 
View this message in context: http://www.nabble.com/X-Means-Algorithm-tp19260704p19260704.html
Sent from the WEKA mailing list archive at Nabble.com.

_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
Peter Reutemann | 1 Sep 2008 23:31
Picon
Picon
Favicon

Re: X-Means Algorithm

> If current version of Weka contains the X-Mean algorithm ? I don't see it in
> Explorer or jar file.
> If not what was the last version that contained its stable implementation ?

X-Means is only available in the developer version.

Cheers, Peter
--

-- 
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/ Ph. +64 (7) 858-5174
Mark Hall | 1 Sep 2008 23:38
Favicon

Re: X-Means Algorithm

The development version of Weka (3.5.8) has the XMeans clusterer.

Cheers,
Mark.

On 2/09/2008, at 9:04 AM, Nissim wrote:

>
> Hi !!!
>
> If current version of Weka contains the X-Mean algorithm ? I don't  
> see it in
> Explorer or jar file.
> If not what was the last version that contained its stable  
> implementation ?
>
>
> Thanks,
> Nissim
> -- 
> View this message in context: http://www.nabble.com/X-Means- 
> Algorithm-tp19260704p19260704.html
> Sent from the WEKA mailing list archive at Nabble.com.
>
>
> _______________________________________________
> Wekalist mailing list
> Wekalist <at> list.scms.waikato.ac.nz
> https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist

--
Mark Hall
Senior Developer/Consultant, Pentaho Open Source Business Intelligence
Citadel International, Suite 340, 5950 Hazeltine National Dr.,
Orlando, FL 32822, USA
+64 7 847-3537 office, +64 21 399-132 mobile, +1 815 550-8637 fax,
Skype: mark.andrew.hall, Yahoo: mark_andrew_hall
Download the latest release today <http://www.sourceforge.net/ 
projects/pentaho>

_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist

Gmane