Bernhard Pfahringer | 1 Sep 2011 08:07
Picon

Re: exporting ROC curves in a graphic format

Hi,

I would save the outcome as csv and then use
tools like gnuplot or excel to produce the
graphs you like.

cheers, Bernhard

On Wed, Aug 31, 2011 at 10:19 PM, Marina <marina.popovscaia <at> gmail.com> wrote:
> Hi,
>
> I have constructed a number of ROC curves for different classifiers
> ("Visualize threshold curve" function).
> I can save the outcome only as ARFF file. But i need to have all plots in
> some graphic file format.
> I would also need to have the possibilities to modify ROC curves: like
> colors and thickness..
>
> Could anyone advise me how to make it? (besides making the Print screen
> captures:))
>
> Thank you!
>
> Marina
>
>
>
>
>
>
(Continue reading)

Harri Saarikoski | 1 Sep 2011 08:19
Picon

Re: Classifier Configuration Advices



2011/8/31 Paul Mayer <equilibrium87 <at> gmail.com>
2011/8/26 wessel van persie <wesselvanpersie <at> googlemail.com>
Dear Paul,

For a given learning algorithm and data set you can find optimal parameters using weka.classifiers.meta.CVParameterSelection.

Furthermore it is important to use the Weka Experimenter when trying out lots of different algorithms and settings.
When only using the Weka Explorer you might get a high performance score purely by chance.

When you have the performance results from various learners you can post them to the mailing list.
Maybe someone is able to deduce why some learners perform poorly while others perform better.

Best regards,

Wessel


_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
 
Dear Harri, dear Wessel,

Thanks for your advices! I will give the CVparameterselection a try and will play around with the mentioned parameters. As a
result of last weeks work, I ended up with splitting the initial set to be used for training into 66% training and 33% evaluation instances.
I ran 10-fold-crossvalidation on the 66% and built an ensemble classifier extracted from the 10 interim models which votes for the
most popular class for an unseen instance in the evaluation set. It should be a formally more correct approach than the one used before.

Aside from that, I am running into serious trouble when it comes to normalization and standardization: So far I am transforming to z-values,
secondly using the median and MAD and thirdly the median and IQR. Some distributions are skewed, I do have, however one feature which
has lots of 0s and 1s - my naive approach would have been to simply shift the values by an arbitrary value and log-transform afterwards,
however was told that this is far from being as much of a trivial task as I thought it would be. Would you suggest at least computing the
sqrt for skewed distributions?

a transformation of a feature's distribution is not necessarily required: what defines requirement to transform is how differently its values are associated with target classes not the marginal distribution of feature values itself as you seem to suggest so it may not give trouble at all -> to ascertain investigate class probabilities of that feature compared to other features by running e.g. NaiveBayes which outputs such probs p(f|c1) and p(f|c2) and if p(f|c)=abs(p(f|c1)-p(f|c2)) is greater for the f(eature) in question than the others then something should be done in order to avoid that one feature dominating the (any) classifier's decision (nb: domination by one feature should be eliminated not to remove a good feature but to have a maximally homogenous feature space, i.e. with similarly distributed p(f|c)'s of features)

As for normalization, I do not really know what method to apply. One feature shows significant outliers, others don't. Overall, this whole
problem should only affect the MLP, right? Would you simply suggest using WEKA's normalization implementation? Outlier handling would
be required in that case though, wouldn't it?

(by outliers and their handling you mean what ?)

only if you find based on the above p(f|c) that the feature should be transformed, we can work out how the transform should be done (will need to see the distribution/dataset)

best, Harri

ps. weka's normalize filter just rescales the distribution doesn't transform it and since it can only transform all features globaly it shouldn't have any effect on performance
(rather some math function should be applied, see MathExpression filter)



Thanks again & best regards,
Paul

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html




--
-----------------
Harri Saarikoski
data mining specialist
Finland

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Harri Saarikoski | 1 Sep 2011 09:02
Picon

Re: CV vs. Test Set



2011/8/31 Ronchi, Davide <davide.ronchi.10 <at> ucl.ac.uk>
Hi all.

In a machine learning experiment I have encountered this problem: if I train a DT with half of the data as training set and half as test set, I obtain a classifier precision (94%) that is lower than when I use all the data for training and I run 10x CV (99%). The data set is huge (more than 1.2 million samples) and the classifier is trained to classify each sample into two different classes. I have a feeling that this is due to the fact that in the 10x CV case the training is done over 90% of the data set, thus yielding better results than when using a 50/50 split. Could anyone point me to an explanation of why this happens? I have read the Weka 3.6 book, but I haven't found very clear the chapter in which they compare different evaluation techniques.

90/10 yielding higher accuracies than 50/50 should no longer be true with 1.2 mill samples because effect of training volume should disappear after tens of thousands at the latest
-> investigate the class distribution: i.e. one of the two classes has considerably fewer samples, then it's effectively the same as if you had as few samples in total which would then explain higher accuracy (otherwise I guess it's the very unlikely but possible random sampling effect, i.e. that the extra 40% contain information that the 50% or 600,000 samples didn't contain)

best, Harri
 

Thanks,
Davide Ronchi

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html




--
-----------------
Harri Saarikoski
data mining specialist
Finland

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Shyam kumar K | 1 Sep 2011 09:48
Picon
Favicon

Fw: help sought in executing xmeans algorithm in gui environment



--- On Wed, 31/8/11, Shyam kumar K <shyamnssc <at> yahoo.in> wrote:

From: Shyam kumar K <shyamnssc <at> yahoo.in>
Subject: help sought in executing xmeans algorithm in gui environment
To: "wekalist <at> list.scms.waikato.ac.nz" <wekalist <at> list.scms.waikato.ac.nz>
Date: Wednesday, 31 August, 2011, 12:24 PM

                                                                                                                                                                &n bsp;  01/09/2011 1.15 pm
Sir,

someone please help me in rectifying the following problem

I am using weka3.7.4 for comparing the performance of Kmeans and Xmeans algorithm as part of my research work
weka 3.7.4 has simple kmeans with it. I have used the package manager facility to download x means algorithm latest version 1.0.1
and successfully added the package to weka 3.7.4

when I am trying to execute x mens on my dataset of 650 instances (it contains only numeric data, and k means takes it for execution and successfully executed it), I am not able to execute it.as the start clustering option is disabled, it appears

When I choose cluster facilities available in the explorer portion, the newly added packages are shown with a slightly different colour . Is this because the xmeans is not properly installed?

if someone encountered the same problem, please help me in executing the algorithm 
 I  tried to set properties by right clicking on the name xmeans, but it also failed as I am not able to execute x means package

pls help me by guiding through the difficulty
regards

shyam kumar 
kottayam , Kerala, india
email shyamnssc <at> yahoo.in

shyamnss11 <at> gmail.com


_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Shyam kumar K | 1 Sep 2011 09:51
Picon
Favicon

Fw: help sought in executing xmeans algorithm in gui environment



--- On Wed, 31/8/11, Shyam kumar K <shyamnssc <at> yahoo.in> wrote:

From: Shyam kumar K <shyamnssc <at> yahoo.in>
Subject: help sought in executing xmeans algorithm in gui environment
To: "wekalist <at> list.scms.waikato.ac.nz" <wekalist <at> list.scms.waikato.ac.nz>
Date: Wednesday, 31 August, 2011, 12:24 PM

                                                                                                                                                              31/08/2011 3 PM
Sir,

someone please help me in rectifying the following problem

I am using weka3.7.4 for comparing the performance of Kmeans and Xmeans (clusteering) algorithms as part of my research work
weka 3.7.4 has simple kmeans with it. I have used the package manager facility to download x means algorithm latest version 1.0.1
and successfully added the package to weka 3.7.4

when I am trying to execute x mens on my dataset of 650 instances (it contains only numeric data, and k means takes it for execution and successfully executed it), I am not able to execute it.as the start clustering option is disabled, it appears

When I choose cluster facilities available in the explorer portion, the newly added packages are shown with a slightly different colour . Is this because the xmeans is not properly installed?

if someone encountered the same problem, please help me in executing the algorithm 
 I  tried to set properties by right clicking on the name xmeans, but it also failed as I am not able to execute x means package

pls help me by guiding through the difficulty
regards

shyam kumar 
kottayam , Kerala, india
email shyamnssc <at> yahoo.in

shyamnss11 <at> gmail.com


_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Paul Mayer | 1 Sep 2011 12:54
Picon

Re: Classifier Configuration Advices



2011/9/1 Harri Saarikoski <harri.saarikoski <at> gmail.com>


2011/8/31 Paul Mayer <equilibrium87 <at> gmail.com>
 
Dear Harri, dear Wessel,

Thanks for your advices! I will give the CVparameterselection a try and will play around with the mentioned parameters. As a
result of last weeks work, I ended up with splitting the initial set to be used for training into 66% training and 33% evaluation instances.
I ran 10-fold-crossvalidation on the 66% and built an ensemble classifier extracted from the 10 interim models which votes for the
most popular class for an unseen instance in the evaluation set. It should be a formally more correct approach than the one used before.

Aside from that, I am running into serious trouble when it comes to normalization and standardization: So far I am transforming to z-values,
secondly using the median and MAD and thirdly the median and IQR. Some distributions are skewed, I do have, however one feature which
has lots of 0s and 1s - my naive approach would have been to simply shift the values by an arbitrary value and log-transform afterwards,
however was told that this is far from being as much of a trivial task as I thought it would be. Would you suggest at least computing the
sqrt for skewed distributions?

a transformation of a feature's distribution is not necessarily required: what defines requirement to transform is how differently its values are associated with target classes not the marginal distribution of feature values itself as you seem to suggest so it may not give trouble at all -> to ascertain investigate class probabilities of that feature compared to other features by running e.g. NaiveBayes which outputs such probs p(f|c1) and p(f|c2) and if p(f|c)=abs(p(f|c1)-p(f|c2)) is greater for the f(eature) in question than the others then something should be done in order to avoid that one feature dominating the (any) classifier's decision (nb: domination by one feature should be eliminated not to remove a good feature but to have a maximally homogenous feature space, i.e. with similarly distributed p(f|c)'s of features

only if you find based on the above p(f|c) that the feature should be transformed, we can work out how the transform should be done (will need to see the distribution/dataset)

best, Harri

ps. weka's normalize filter just rescales the distribution doesn't transform it and since it can only transform all features globaly it shouldn't have any effect on performance
(rather some math function should be applied, see MathExpression filter)

--
-----------------
Harri Saarikoski
data mining specialist
Finland


_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


Dear Harri,

Thanks again for your answer! I hope I got your instruction right, took the raw data and ran the Naive Bayes classifier using just one feature at a time.
Appended is a list of true positives, true negatives and the absolute difference of both values w.r.t. 10-fold cross-validation (or was I supposed to simply build the classifier on the whole training data?):

F1    0.6832627118644068    0.4046610169491525    0.27860169491525427
F2    0.6451271186440678    0.423728813559322    0.22139830508474578
F3    0.5603813559322034    0.5095338983050848    0.05084745762711862
F4    0.7139830508474576    0.3919491525423729    0.3220338983050847
F5    0.7658898305084746    0.4427966101694915    0.3230932203389831
F6    0.6228813559322034    0.4682203389830508    0.15466101694915257
F7    0.3580508474576271    0.7002118644067796    0.3421610169491525
F8    0.548728813559322    0.628177966101695    0.07944915254237295
F9    0.9141949152542372    0.07309322033898305    0.8411016949152542
F10    0.7902542372881356    0.3156779661016949    0.4745762711864407
F11    0.6313559322033898    0.5720338983050848    0.05932203389830504
F12    0.7838983050847458    0.6091101694915254    0.17478813559322037

Especially feature 9 appears to be quite problematic.
May I include a compressed tar file showing density plots of the corresponding (raw data) distributions? Or should I provide them in some other way?
Would be glad if anyone could help me out on that.
When talking about outlier handling, I was referring to dropping the significantly deviating values before normalizing to e.g. [-1,1] because especially feature 9 shows extremely large values (3489232) while the median lies at 57316.65. "Cramping" the whole feature space down without removing some values (e.g. winsorizing was suggested to me) would probably render the feature completely useless.

Thanks again & Best Regards,
Paul

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Ronchi, Davide | 1 Sep 2011 15:41
Picon
Picon
Favicon

Re: CV vs. Test Set


On 1 Sep 2011, at 08:02, Harri Saarikoski wrote:

> 
> 
> 2011/8/31 Ronchi, Davide <davide.ronchi.10 <at> ucl.ac.uk>
> Hi all.
> 
> In a machine learning experiment I have encountered this problem: if I train a DT with half of the data as
training set and half as test set, I obtain a classifier precision (94%) that is lower than when I use all the
data for training and I run 10x CV (99%). The data set is huge (more than 1.2 million samples) and the
classifier is trained to classify each sample into two different classes. I have a feeling that this is due
to the fact that in the 10x CV case the training is done over 90% of the data set, thus yielding better results
than when using a 50/50 split. Could anyone point me to an explanation of why this happens? I have read the
Weka 3.6 book, but I haven't found very clear the chapter in which they compare different evaluation techniques.
> 
> 90/10 yielding higher accuracies than 50/50 should no longer be true with 1.2 mill samples because effect
of training volume should disappear after tens of thousands at the latest
> -> investigate the class distribution: i.e. one of the two classes has considerably fewer samples, then
it's effectively the same as if you had as few samples in total which would then explain higher accuracy
(otherwise I guess it's the very unlikely but possible random sampling effect, i.e. that the extra 40%
contain information that the 50% or 600,000 samples didn't contain)

Yes, that is the case. ~70% of the samples belongs to one class, and ~30% belong to the other. Unfortunately,
this is how our data is, and we can't do anything about it :(

Which form of validation do you think is the most realistic in this case? 10x CV or 50/50?

Thanks,
Davide
_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

wessel van persie | 1 Sep 2011 16:37

Re: Classifier Configuration Advices

Dear Paul,

It is very hard to give advice on how to build a good classifier without seeing the actual data.

Best regards,

Wessel

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Joel Lang | 1 Sep 2011 22:05
Picon
Picon

Special Characters in String Values

Hello

I have a dataset with a string attribute and one of its values contains
the unicode character 'Information Seperator Two' (\u001E)

http://www.fileformat.info/info/unicode/char/1e/index.htm

When the dataset is written to an ARFF file the value containing the
character is not quoted. This leads to a problem when the dataset is
read back in, probably because the StreamTokenizer recognizes the
character as a seperator. As a fix I've added the character to the list
of chars in Utils.quote() which require quoting. Presumably, there are
other characters that would need to be added there as well.

Best,

Joel

--

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Mark Hall | 1 Sep 2011 22:43
Favicon

Re: CV vs. Test Set

On 2/09/11 1:41 AM, Ronchi, Davide wrote:
>
> On 1 Sep 2011, at 08:02, Harri Saarikoski wrote:
>
>>
>>
>> 2011/8/31 Ronchi, Davide<davide.ronchi.10 <at> ucl.ac.uk>
>> Hi all.
>>
>> In a machine learning experiment I have encountered this problem: if I train a DT with half of the data as
training set and half as test set, I obtain a classifier precision (94%) that is lower than when I use all the
data for training and I run 10x CV (99%). The data set is huge (more than 1.2 million samples) and the
classifier is trained to classify each sample into two different classes. I have a feeling that this is due
to the fact that in the 10x CV case the training is done over 90% of the data set, thus yielding better results
than when using a 50/50 split. Could anyone point me to an explanation of why this happens? I have read the
Weka 3.6 book, but I haven't found very clear the chapter in which they compare different evaluation techniques.
>>
>> 90/10 yielding higher accuracies than 50/50 should no longer be true with 1.2 mill samples because
effect of training volume should disappear after tens of thousands at the latest
>> ->  investigate the class distribution: i.e. one of the two classes has considerably fewer samples, then
it's effectively the same as if you had as few samples in total which would then explain higher accuracy
(otherwise I guess it's the very unlikely but possible random sampling effect, i.e. that the extra 40%
contain information that the 50% or 600,000 samples didn't contain)
>
> Yes, that is the case. ~70% of the samples belongs to one class, and ~30% belong to the other.
Unfortunately, this is how our data is, and we can't do anything about it :(
>
> Which form of validation do you think is the most realistic in this case? 10x CV or 50/50?

Weka performs randomization and stratification of the data when doing a 
cross-validation. A percentage split just randomizes the data first. It 
might be worth trying a 2-fold cross-validation to see how this compares 
to the random 50/50 split and the 10-fold cross-validation.

Cheers,
Mark.

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


Gmane