Harri Saarikoski | 1 Jan 05:39 2010
Picon

Re: Attribute Ranking in Weka



2009/12/31 Feras <feras_o <at> yahoo.com>
 
Hi,
 
Is there a way in Weka to rank the attributes from best to worse by giving them numeric numbers presenteing their importance in data.

see manually insertable feature weights -> xrff format in weka docs
limited to some classifiers only -> see wekalist archives on feature/attribute weights
 
 
If available, please provide me with the code or link to do it using some sample dataset.
 
Cheers,
 
Feras


_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html




--
-----------------
Harri M.T. Saarikoski
M.A, PhD graduate student
Helsinki University
Finland
_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Joanna Pschorr | 1 Jan 18:02 2010

Re: How to understand data format after applying StringToWordVector?

>> My Question is can anybody explain what the data means after applying
>> StringToWordVector? See  http://old.nabble.com/file/p26970215/data.txt
>> data.txt  data file. I know StringToWordVector transform data into sort of
>> TF-IDF transformation.
>> But see a test instance before and after applying the StringToWordVector:
>>
>>  <at> data
>> 'matricid stori hors wayn panti pop highwai noir thriller refer ',5
>>
>>   <at> data
>> {0 5,150 1.555245,201 1.662094,207 2.268923,219 2.623,232 1.606613,262
>> 0.793745,279 2.268923,305 2.423594}
>>
>> what 0 5, 150 1.555245 means?
>>
>
Hi,

I must correct myself: The pair of values 0 5 is the class of course,
I must have overlooked that your class names are numbers, sorry for
that. Terms not found in the training data can't possibly appear in
the vector or even get a weight. The attribute containing class names
(Trainraiting) is the first one (position 0) and the category of your
test data record is 5 so then it makes sense.

Regards,
\joanna

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

musi | 1 Jan 18:48 2010
Picon

Re: How to understand data format after applying StringToWordVector?


Yes, I got it it makes sense.
thanks,

Joanna Pschorr wrote:
> 
>>> My Question is can anybody explain what the data means after applying
>>> StringToWordVector? See  http://old.nabble.com/file/p26970215/data.txt
>>> data.txt  data file. I know StringToWordVector transform data into sort
>>> of
>>> TF-IDF transformation.
>>> But see a test instance before and after applying the
>>> StringToWordVector:
>>>
>>>  <at> data
>>> 'matricid stori hors wayn panti pop highwai noir thriller refer ',5
>>>
>>>   <at> data
>>> {0 5,150 1.555245,201 1.662094,207 2.268923,219 2.623,232 1.606613,262
>>> 0.793745,279 2.268923,305 2.423594}
>>>
>>> what 0 5, 150 1.555245 means?
>>>
>>
> Hi,
> 
> I must correct myself: The pair of values 0 5 is the class of course,
> I must have overlooked that your class names are numbers, sorry for
> that. Terms not found in the training data can't possibly appear in
> the vector or even get a weight. The attribute containing class names
> (Trainraiting) is the first one (position 0) and the category of your
> test data record is 5 so then it makes sense.
> 
> Regards,
> \joanna
> 
> _______________________________________________
> Wekalist mailing list
> Send posts to: Wekalist <at> list.scms.waikato.ac.nz
> List info and subscription status:
> https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette:
> http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
> 
> 

--

-- 
View this message in context: http://old.nabble.com/How-to-understand-data-format-after-applying-StringToWordVector--tp26970215p26986965.html
Sent from the WEKA mailing list archive at Nabble.com.

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
musi | 1 Jan 18:51 2010
Picon

SMOreg error: exceptionjava.lang.ArrayIndexOutOfBoundsException:


I used SMOreg and I am getting the follwoing error while evaluating the
model (using the same dataset)

exceptionjava.lang.ArrayIndexOutOfBoundsException: 11
java.lang.ArrayIndexOutOfBoundsException: 11
        at weka.core.SparseInstance.index(Unknown Source)
        at
weka.classifiers.functions.supportVector.RegOptimizer.SVMOutput(Unknown
Source)
        at weka.classifiers.functions.SMOreg.classifyInstance(Unknown
Source)
        at
weka.classifiers.Evaluation.evaluateModelOnceAndRecordPrediction(Unknown
Source)
        at weka.classifiers.Evaluation.evaluateModel(Unknown Source)
        at
netflix.weka.SVMRegTextByAttribute.evaluatePredictiveModel(SVMRegTextByAttribute.java:617)
        at
netflix.weka.SVMRegTextByAttribute.doNBSteps(SVMRegTextByAttribute.java:338)
        at
netflix.weka.SVMRegTextByAttribute.main(SVMRegTextByAttribute.java:754)

Can someone explain why it is so? BTW I am using StringToWordVector with
"-T", "-O" options for converting text vectors into numeric values. My
output class in integer and lies in 1-5 range.

(I must mention, the same dataset works well for linearregression under
weka)
Thanks for your time, 
--

-- 
View this message in context: http://old.nabble.com/SMOreg-error%3A--exceptionjava.lang.ArrayIndexOutOfBoundsException%3A-tp26986984p26986984.html
Sent from the WEKA mailing list archive at Nabble.com.

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Harri Saarikoski | 2 Jan 12:31 2010
Picon

Re: SMOreg error: exceptionjava.lang.ArrayIndexOutOfBoundsException:



2010/1/1 musi <eng.musi <at> gmail.com>


I used SMOreg and I am getting the follwoing error while evaluating the
model (using the same dataset)

exceptionjava.lang.ArrayIndexOutOfBoundsException: 11

while we wait for a weka code expert to give the list a hand, we may recycle peter's old replies from wekalist archives:
http://old.nabble.com/ArrayIndexOutOfBoundsException---due-to-change-in-training-dataset-td26602953.html

does that help you? has reference to StringToWordVector too...
 
java.lang.ArrayIndexOutOfBoundsException: 11
       at weka.core.SparseInstance.index(Unknown Source)
       at
weka.classifiers.functions.supportVector.RegOptimizer.SVMOutput(Unknown
Source)
       at weka.classifiers.functions.SMOreg.classifyInstance(Unknown
Source)
       at
weka.classifiers.Evaluation.evaluateModelOnceAndRecordPrediction(Unknown
Source)
       at weka.classifiers.Evaluation.evaluateModel(Unknown Source)
       at
netflix.weka.SVMRegTextByAttribute.evaluatePredictiveModel(SVMRegTextByAttribute.java:617)
       at
netflix.weka.SVMRegTextByAttribute.doNBSteps(SVMRegTextByAttribute.java:338)
       at
netflix.weka.SVMRegTextByAttribute.main(SVMRegTextByAttribute.java:754)

Can someone explain why it is so? BTW I am using StringToWordVector with
"-T", "-O" options for converting text vectors into numeric values. My
output class in integer and lies in 1-5 range.

(I must mention, the same dataset works well for linearregression under
weka)
Thanks for your time,
--
View this message in context: http://old.nabble.com/SMOreg-error%3A--exceptionjava.lang.ArrayIndexOutOfBoundsException%3A-tp26986984p26986984.html
Sent from the WEKA mailing list archive at Nabble.com.



_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html




--
-----------------
Harri M.T. Saarikoski
M.A, PhD graduate student
Helsinki University
Finland
_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Harri Saarikoski | 2 Jan 12:44 2010
Picon

Re: How to understand data format after applying StringToWordVector?



2009/12/30 musi <eng.musi <at> gmail.com>

Hi,

I am applying StringToWordVector over vectors of text. I have training set
and test set and I am using Java program to apply StringToWordVector and
training a classifier (Naibe Bayes, and SVM). I have 5 nominal classes
{1,2,3,4,5}.

My Question is can anybody explain what the data means after applying
StringToWordVector? See  http://old.nabble.com/file/p26970215/data.txt
data.txt  data file. I know StringToWordVector transform data into sort of
TF-IDF transformation.
But see a test instance before and after applying the StringToWordVector:

<at> data
'matricid stori hors wayn panti pop highwai noir thriller refer ',5

  <at> data
{0 5,150 1.555245,201 1.662094,207 2.268923,219 2.623,232 1.606613,262
0.793745,279 2.268923,305 2.423594}

what 0 5, 150 1.555245 means?


this is weka arff in sparse format, convenient for a lot of binary attribute representation
it omits the attributes for each sample where value is zero, and thus saves time and space

in my word sense disambiguation set that I don't think is dissimilar to a text mining set:
<at> attribute 'keen<>to' {0,1}
<at> attribute 'pursued<>we' {0,1}
<at> attribute 'senseclass' {...}
<at> data
{13 1, 282 1, 493 1, 1033 1, 1388 1, 1712 1, 1791 1, 1961 1, 2528 1, 2884 1, 4308 1, 4704 1, 4900 art~1:06:00::}

first number is the attribute index number in the arff listing of attributes,
second is the only allowed non-zero value ('1') for all

I do think your second number has apparently a product of tf-idf transformation of columns
which originally (before transformation) were bunch of ones and zeroes
(i.e. attribute present or absent), apparently all the other attributes (e.g. 1-149)
have value zero (as in 'attribute has zero merit')?

Harri
 
I am applying the same StringToWordVector over training set and test set.
Furthermore, I am not getting good scores for this problem using Naive
Bayes. What can be wrong?

Thanks for your reply,
Rgrds,

Mustansar Ali Ghazanfar  (Musi)
PhD, ISIS group, ECS,
University of Southampton, UK



--
View this message in context: http://old.nabble.com/How-to-understand-data-format-after-applying-StringToWordVector--tp26970215p26970215.html
Sent from the WEKA mailing list archive at Nabble.com.



_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html




--
-----------------
Harri M.T. Saarikoski
M.A, PhD graduate student
Helsinki University
Finland
_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
sunwing | 2 Jan 12:51 2010
Picon
Picon

Re: about REPtree


Dear Harri,

Thank you very much for your kind help.

Xiaoyu

--

-- 
View this message in context: http://old.nabble.com/about-REPtree-tp26964707p26992399.html
Sent from the WEKA mailing list archive at Nabble.com.

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Nancy Adam | 2 Jan 23:47 2010
Picon

RMSE and R2

 

Hi everyone,

Happy New Year.

 

Can anyone please tell me in these questions?:

Is RMSE that weka computes based on the training set or based on the testing set?

 

Can R2 be computed from RMSE?

 

When do I say that a regression program is better than another one in terms of RMSE? i.e when its RMSE is bigger or smaller the other program?

 

Many thanks,

Nancy


Keep your friends updated— even when you’re not signed in.
_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Harri Saarikoski | 3 Jan 11:16 2010
Picon

Re: RMSE and R2



2010/1/3 Nancy Adam <nancyadam84 <at> hotmail.com>

 

Hi everyone,

Happy New Year.

 

Can anyone please tell me in these questions?:

Is RMSE that weka computes based on the training set or based on the testing set?


depends on the evaluation option:
if it's cross-validation you have selected, then on training set
if there's a separate test set provided, then on that test set
 

 

Can R2 be computed from RMSE?


same information is provided for both: actual outcomes and their predicted values -> offsets
so yes, they can : root mean squared error is square root of mean squared error
while R2 is the square of the sample correlation coefficient
-> critical difference of how they best apply to different cases
might be that R squared is bounded/normalised between -1 and 1
while RMSE 'modeling of error quantity' is unbounded 0 to infinity
(but I'm not sure)

 

When do I say that a regression program is better than another one in terms of RMSE? i.e when its RMSE is bigger or smaller the other program?


good = when error that RMSE measures is as small as poosible

Harri

 

Many thanks,

Nancy


Keep your friends updated— even when you’re not signed in.

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html




--
-----------------
Harri M.T. Saarikoski
M.A, PhD graduate student
Helsinki University
Finland
_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Mark Hall | 3 Jan 22:34 2010

Re: about REPtree

Harri Saarikoski wrote:
> 
> 
> 2009/12/30 sunwing <ap.quxiaoyu <at> archi.kyoto-u.ac.jp 
> <mailto:ap.quxiaoyu <at> archi.kyoto-u.ac.jp>>
> 
> 
>     Dear all,
> 
>     1)I am wondering the meaning of 
>     numbers in the leave node of a REPtree.
>     for example:
> 
>     4:68.83 (41.57/1989.69) [24.01/4049.91]
>     I only know"4" at the beginning is the series number of node, but
>     what do
>     the following numbers mean?
> 
> 
> well, in a reptree built from cpu dataset:
> MMAX < 6100
> |   MMAX < 3500 : 20.03 (21/5.14) [13/5]
> |   MMAX >= 3500 : 28.5 (32/8.41) [12/23.49]
> MMAX >= 6100 : 40.89 (19/16.14) [14/41.38]
> 
> the bit before ":" is the conditional rule ("feature value <=> value 
> cutoff") that was built
> the first number after it is class value, to which the rule is attached
> which in sets with numeric class is naturally a number
> cf. reptree on nominal class sets (iris):
> petallength < 2.5 : Iris-setosa (33/0) [17/0]
> petallength >= 2.5
> |   petalwidth < 1.75 : Iris-versicolor (36/3) [18/2]
> |   petalwidth >= 1.75 : Iris-virginica (31/1) [15/0]
> 
> then come the brackets... phew
> 
> using reptree model on iris set as example:
> -> first figures in both brackets sum up to 150, so for each rule, 
> that's the /number /of instances matching that rule
> and their sum is total number of instances in the set
> -> latter figures in both brackets sum up to 6, so for each rule, this 
> is the /percentage /of instances misclassified by that rule
> and their sum is the total misclassification percentage
> 
> first/second set of brackets seem to be evaluation statistics of each 
> rule on (non-current) [current] class

The second set of brackets are stats computed on the hold-out set that 
is used for pruning the tree.

Cheers,
Mark.

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Gmane