J D | 1 Aug 2007 01:06
Picon

Tricks to improve J48 model performance?

Hello. First let me say that I'm something of a novice at data mining, so thank you in advance for bearing with me. I'm running J48 with default parameters on a table with a boolean independent variable (the class) and 2954 dependent variables, all of which are integer and decimal values.

I've been adding new dependent vars to try and improve the model performance, but the predictive accuracy never improves from 60%. This despite the fact that the new vars are getting used in the model. Now here's the weird part. If I look through the tree and remove variables that don't seem to be predictive, the accuracy goes down.

I'm getting some good classification groups so I should be capable of getting better than 60% accuracy. The fact that it sticks at 60 makes me think that I've hit some plateau with the J48 algorithm and my data set size. Maybe a change in parameters or algorithm would help? Or maybe I need to use guided attribute selection? WEKA is so full of options that I'm a little confused about where to start. Any suggestions or pointers to references would be greatly appreciated.

_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
Mark Hall | 1 Aug 2007 06:18
Picon
Picon

Re: [Re: testing M of R classifiers]

On Mon, Jul 30, 2007 at 06:57:02AM -0700, Tim Menzies wrote:

> 
> One follow up question. When I look at the file names in -threshold-file I see that
> it is of a different size, depending on the learner. e.g. the threshold files generated
> by logistic regression and bayes and typically much larger than, say, jrip. 

Weka generates an ROC curve by first ordering predictions according to
the probability assigned to the positive class. The curve is generated
by varying a threshold on the probability assigned to the positive
class. I.e., start at 1 and gradually reduce the value down to 0. The
number of true and false positives can simply be read off of the
ranked list. Since rules and decision trees compute probabilities
based on frequencies accumulated from rather small samples at the
leaves, there are usually many ties in the probabilities assigned to
the positive class. Because of this, there are only a few true
threshold values (i.e. points in the ranked list where the probability
changes value).

This is all discussed in the book by Witten and Frank.

Cheers,
Mark.

--

-- 
Mark Hall
Senior Lecturer
Department of Computer Science
University of Waikato
Hamilton
New Zealand
www.cs.waikato.ac.nz/~mhall

_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
yue jiang | 1 Aug 2007 06:43
Picon

the detail information of instances' classification

Hello,

In weka, when we use a classifier to train a model and predict a test 
dataset, it usually output a confusion matrix (or accuracy, error rate).

For example, if I have a test dataset with the following format:

                          variable1,  var2, .....   class
instance1             x11,         x12, .....      True
instance 2            x21,         x22, ....       False
instance 3            x31,         x32, ...,       False
.....

How can I find the detail classification result?  That is, how can I 
know whether "instance 1" is classified to "True" or "False" ?
What kind of parameter need to specify in weka?

Any help will be greatly appreciated.

Yue Jiang
Marcin Zajączkowski | 1 Aug 2007 10:04
Picon
Favicon

Problem with text classification (with use a classifier)

Hi,

I tried to make a simple text classification using Weka. As a test I 
wrote a classification of a single word to a specified category (one of 
two) using NaiveBayes classifier.
I'm able to make a training set, convert it using StringToWordVector, 
build classifier and evaluate it, but I have problems with using 
classifier (works weird).

Note: Due to code snip sets post is quite long, so the question is in 
the marked section at the at of it.

First I create a training set:

Attribute wordAttribute = new Attribute("wordAttribute", (FastVector)null);

FastVector classVal = new FastVector(2);
classVal.addElement(SHORTER);
classVal.addElement(LONGER);
Attribute classAttribute = new Attribute("category", classVal);

FastVector attributes = new FastVector(2);
attributes.addElement(wordAttribute);
attributes.addElement(classAttribute);

Instances trainingSet = new Instances("Rel", attributes, 10);
trainingSet.setClassIndex(1);

and fill it using construction like:

Instance tempElem = new Instance(2);
tempElem.setValue((Attribute)attributes.elementAt(0), word);
tempElem.setValue((Attribute)attributes.elementAt(1), category);

After that training set looks:

 <at> relation Rel

 <at> attribute wordAttribute string
 <at> attribute category {shorter,longer}

 <at> data
ala,shorter
ela,shorter
ania,shorter
ula,shorter
malgosia,longer
krystyna,longer
mateuszek,longer
tadeuszek,longer

I convert it to a word vector:

StringToWordVector stringToWordVector = new StringToWordVector();

stringToWordVector.setInputFormat(trainingSet);
for (int i=0; i<trainingSet.numInstances(); i++) {
     stringToWordVector.input(trainingSet.instance(i));
}
if (!stringToWordVector.batchFinished()) {
     throw ...
}

Instances wordSet = stringToWordVector.getOutputFormat();
Instance tempInstance = null;
while ((tempInstance = stringToWordVector.output()) != null) {
     binarySet.add(tempInstance);
}

which looks:

 <at> relation 'RelWordVector'

 <at> attribute category {shorter,longer}
 <at> attribute ala numeric
 <at> attribute ania numeric
 <at> attribute ela numeric
 <at> attribute ula numeric
 <at> attribute krystyna numeric
 <at> attribute malgosia numeric
 <at> attribute mateuszek numeric
 <at> attribute tadeuszek numeric

 <at> data
{1 1}
{3 1}
{2 1}
{4 1}
{0 longer,6 1}
{0 longer,5 1}
{0 longer,7 1}
{0 longer,8 1}

Evaluation result using training set or another prepared set looks ok:
Correctly Classified Instances           8              100      %
Incorrectly Classified Instances         0                0      %
Kappa statistic                          1
Mean absolute error                      0.0033
Root mean squared error                  0.0033
Relative absolute error                  0.6522 %
Root relative squared error              0.6522 %
Total Number of Instances                8

===== place with the question =====

But I'm unable to use a classifier on specified object.

I make an instance:
         Instance tempElem = new Instance(2);
         tempElem.setDataset(trainingSet);
         tempElem.setValue(0, "grrrr");

Looks:
"grrrr,?

Convert it to a word vector:
StringToWordVector testStringToWordVector = new StringToWordVector();
testStringToWordVector.setInputFormat(trainingSet);
testStringToWordVector.input(tempElem);
testStringToWordVector.batchFinished();
Instance binaryTempSet = testStringToWordVector.output();

It looks: {0 ?,1 1}

And get distribution:
double[] distribution = model.distributionForInstance(binaryTempSet);
for (double val : distribution) {
     logger.info(val);
}

Which is *always*:
0.9952004499323915
0.004799550067608458
independently what is set as a value...

I tried many combinations, but I'm not able to get a good result. I'm 
not experienced in using Weka, so it could be some trivial mistake (or 
maybe not).
Could you tell me what do I wrong using a classifier?

Thanks for help
Marcin

_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
Peter Reutemann | 1 Aug 2007 12:06
Picon
Picon

Re: Problem with text classification (with use a classifier)

> I tried to make a simple text classification using Weka. As a test I
> wrote a classification of a single word to a specified category (one of
> two) using NaiveBayes classifier.
> I'm able to make a training set, convert it using StringToWordVector,
> build classifier and evaluate it, but I have problems with using
> classifier (works weird).
>
> Note: Due to code snip sets post is quite long, so the question is in
> the marked section at the at of it.
>
>
> First I create a training set:
>
> Attribute wordAttribute = new Attribute("wordAttribute",
> (FastVector)null);
>
> FastVector classVal = new FastVector(2);
> classVal.addElement(SHORTER);
> classVal.addElement(LONGER);
> Attribute classAttribute = new Attribute("category", classVal);
>
> FastVector attributes = new FastVector(2);
> attributes.addElement(wordAttribute);
> attributes.addElement(classAttribute);
>
> Instances trainingSet = new Instances("Rel", attributes, 10);
> trainingSet.setClassIndex(1);
>
> and fill it using construction like:
>
> Instance tempElem = new Instance(2);
> tempElem.setValue((Attribute)attributes.elementAt(0), word);
> tempElem.setValue((Attribute)attributes.elementAt(1), category);
>
> After that training set looks:
>
>  <at> relation Rel
>
>  <at> attribute wordAttribute string
>  <at> attribute category {shorter,longer}
>
>  <at> data
> ala,shorter
> ela,shorter
> ania,shorter
> ula,shorter
> malgosia,longer
> krystyna,longer
> mateuszek,longer
> tadeuszek,longer
>
>
> I convert it to a word vector:
>
> StringToWordVector stringToWordVector = new StringToWordVector();
>
> stringToWordVector.setInputFormat(trainingSet);
> for (int i=0; i<trainingSet.numInstances(); i++) {
>      stringToWordVector.input(trainingSet.instance(i));
> }
> if (!stringToWordVector.batchFinished()) {
>      throw ...
> }
>
> Instances wordSet = stringToWordVector.getOutputFormat();
> Instance tempInstance = null;
> while ((tempInstance = stringToWordVector.output()) != null) {
>      binarySet.add(tempInstance);
> }
>
> which looks:
>
>  <at> relation 'RelWordVector'
>
>  <at> attribute category {shorter,longer}
>  <at> attribute ala numeric
>  <at> attribute ania numeric
>  <at> attribute ela numeric
>  <at> attribute ula numeric
>  <at> attribute krystyna numeric
>  <at> attribute malgosia numeric
>  <at> attribute mateuszek numeric
>  <at> attribute tadeuszek numeric
>
>  <at> data
> {1 1}
> {3 1}
> {2 1}
> {4 1}
> {0 longer,6 1}
> {0 longer,5 1}
> {0 longer,7 1}
> {0 longer,8 1}
>
> Evaluation result using training set or another prepared set looks ok:
> Correctly Classified Instances           8              100      %
> Incorrectly Classified Instances         0                0      %
> Kappa statistic                          1
> Mean absolute error                      0.0033
> Root mean squared error                  0.0033
> Relative absolute error                  0.6522 %
> Root relative squared error              0.6522 %
> Total Number of Instances                8
>
>
> ===== place with the question =====
>
> But I'm unable to use a classifier on specified object.
>
> I make an instance:
>          Instance tempElem = new Instance(2);
>          tempElem.setDataset(trainingSet);
>          tempElem.setValue(0, "grrrr");
>
> Looks:
> "grrrr,?
>
> Convert it to a word vector:
> StringToWordVector testStringToWordVector = new StringToWordVector();

Wrong! You have to use the *same* filter for converting any subsequent
transformations, otherwise you initialize the filter again and the
dictionary changes. See "Batch filtering" in the "Use Weka in your Java
code" article:
http://weka.sourceforge.net/wiki/index.php/Use_Weka_in_your_Java_code#Batch_filtering

> testStringToWordVector.setInputFormat(trainingSet);
> testStringToWordVector.input(tempElem);
> testStringToWordVector.batchFinished();
> Instance binaryTempSet = testStringToWordVector.output();
>
> It looks: {0 ?,1 1}
>
> And get distribution:
> double[] distribution = model.distributionForInstance(binaryTempSet);
> for (double val : distribution) {
>      logger.info(val);
> }
>
> Which is *always*:
> 0.9952004499323915
> 0.004799550067608458
> independently what is set as a value...

You create a filter with a dictionary of only 1 word, but set the dataset
wrongly to the training set. That results of course in always the same
output (the index will always point to the same word in the training set,
no matter what actual word was in the second dataset).

...

HTH

Cheers, Peter
--

-- 
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174
Marcin Zajączkowski | 1 Aug 2007 14:07
Picon
Favicon

Re: Problem with text classification (with use a classifier)

On 2007-08-01 12:06, Peter Reutemann wrote:
>> I tried to make a simple text classification using Weka. As a test I
>> wrote a classification of a single word to a specified category (one of
>> two) using NaiveBayes classifier.
>> I'm able to make a training set, convert it using StringToWordVector,
>> build classifier and evaluate it, but I have problems with using
>> classifier (works weird).
(...)
>> But I'm unable to use a classifier on specified object.
>>
>> I make an instance:
>>          Instance tempElem = new Instance(2);
>>          tempElem.setDataset(trainingSet);
>>          tempElem.setValue(0, "grrrr");
>>
>> Looks:
>> "grrrr,?
>>
>> Convert it to a word vector:
>> StringToWordVector testStringToWordVector = new StringToWordVector();
> 
> Wrong! You have to use the *same* filter for converting any subsequent
> transformations, otherwise you initialize the filter again and the
> dictionary changes. See "Batch filtering" in the "Use Weka in your Java
> code" article:
> http://weka.sourceforge.net/wiki/index.php/Use_Weka_in_your_Java_code#Batch_filtering

Right, what is more Filrer.useFiler() is much more convenient that using 
custom loops.

>> testStringToWordVector.setInputFormat(trainingSet);
>> testStringToWordVector.input(tempElem);
>> testStringToWordVector.batchFinished();
>> Instance binaryTempSet = testStringToWordVector.output();
>>
>> It looks: {0 ?,1 1}
>>
>> And get distribution:
>> double[] distribution = model.distributionForInstance(binaryTempSet);
>> for (double val : distribution) {
>>      logger.info(val);
>> }
>>
>> Which is *always*:
>> 0.9952004499323915
>> 0.004799550067608458
>> independently what is set as a value...
> 
> You create a filter with a dictionary of only 1 word, but set the dataset
> wrongly to the training set. That results of course in always the same
> output (the index will always point to the same word in the training set,
> no matter what actual word was in the second dataset).

Thanks. Now I've got it.

So, that means that in case I had to use classifier in a different 
program runs (to classify new element) I would have to save a Classifier 
(e.g. NaiveBayes) and a Filter (StringToWordVector) somewhere (in my 
case most likely in a database).
has Weka any facilities in saving that objects or it's the easiest way 
to keep it serialized (here in a database)?

Thanks for help
Marcin

_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
Bioinformatics Group | 1 Aug 2007 15:34
Picon

Alt-Shift-Left click doesn't work in JAVA

Dear all,

I tried to use the Alt-Shift-Left click shortcut from an application I built
using the code reported herein:

http://weka.sourceforge.net/wiki/index.php/Generating_ROC_curve

I tried to compile and launch it from within Netbeans 5.5.1 (the compiler
i'm using) and it's all ok. I can save images without any problem!
However, when I try to save the ROC image launching the jar file from the
command line (
e.g.
java -jar "C:\Documents and
Settings\Admin\JavaApplication\dist\JavaApplication.jar")
the save dialog box seems to return a null pointer generating a null pointer
exception

java.lang.NullPointerException
        at weka.gui.visualize.PrintableComponent.saveComponent(Unknown
Source)
        at weka.gui.visualize.PrintablePanel.saveComponent(Unknown Source)
        at weka.gui.treevisualizer.TreeVisualizer.mousePressed(Unknown
Source)
        at java.awt.AWTEventMulticaster.mousePressed(Unknown Source)
        at java.awt.Component.processMouseEvent(Unknown Source)
        at javax.swing.JComponent.processMouseEvent(Unknown Source)
        at java.awt.Component.processEvent(Unknown Source)
        at java.awt.Container.processEvent(Unknown Source)
        at java.awt.Component.dispatchEventImpl(Unknown Source)
        at java.awt.Container.dispatchEventImpl(Unknown Source)
        at java.awt.Component.dispatchEvent(Unknown Source)
        at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source)
        at java.awt.LightweightDispatcher.processMouseEvent(Unknown Source)
        at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source)
        at java.awt.Container.dispatchEventImpl(Unknown Source)
        at java.awt.Window.dispatchEventImpl(Unknown Source)
        at java.awt.Component.dispatchEvent(Unknown Source)
        at java.awt.EventQueue.dispatchEvent(Unknown Source)
        at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown
Source)
        at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source)
        at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown
Source)
        at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
        at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
        at java.awt.EventDispatchThread.run(Unknown Source)



(I have found something about this problem in the javadoc regarding
saveComponent() method in TreeVisualizer class but the description seems to
be a little bit concise :) ).
What kind of problem could it be? I tried to launch the jar file
double-clicking on it but it shows the same problem. I experienced the same
problem even trying to save the jpeg of a tree.
Many thanks in adavance!

Filippo

P.S. I've imported weka.jar (v. 3.5.6) as library in my project
 
_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
Peter Reutemann | 1 Aug 2007 22:34
Picon
Picon
Favicon

Re: Re: Problem with text classification (with use a classifier)

...

> So, that means that in case I had to use classifier in a different 
> program runs (to classify new element) I would have to save a Classifier 
> (e.g. NaiveBayes) and a Filter (StringToWordVector) somewhere (in my 
> case most likely in a database).
> has Weka any facilities in saving that objects or it's the easiest way 
> to keep it serialized (here in a database)?

All filters and classifiers are serializable, if you use the 
FilteredClassifier in conjunction with the StringToWordVector (and your 
choice of base classifier) you don't have to worry about saving filter 
and classifier, but only the classifier. I've never saved a serialized 
Java object in a database, but I guess you should be able to save it as 
a BLOB. For more information on serialization, see this Wiki article:
   http://weka.sourceforge.net/wiki/index.php/Serialization

And you could use a ByteArrayOutputStream instead of a FileOutputStream 
and save the resulting byte array in the database (and retrieve it via a 
ByteArrayInputStream again).

HTH

Cheers, Peter
--

-- 
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/     +64 (7) 838-4466 Ext. 5174

_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
Peter Reutemann | 1 Aug 2007 22:50
Picon
Picon
Favicon

Re: Alt-Shift-Left click doesn't work in JAVA

> I tried to use the Alt-Shift-Left click shortcut from an application I built
> using the code reported herein:
> 
> http://weka.sourceforge.net/wiki/index.php/Generating_ROC_curve
> 
> I tried to compile and launch it from within Netbeans 5.5.1 (the compiler
> i'm using) and it's all ok. I can save images without any problem!
> However, when I try to save the ROC image launching the jar file from the
> command line (
> e.g.
> java -jar "C:\Documents and
> Settings\Admin\JavaApplication\dist\JavaApplication.jar")
> the save dialog box seems to return a null pointer generating a null pointer
> exception 
> 
> java.lang.NullPointerException
>         at weka.gui.visualize.PrintableComponent.saveComponent(Unknown
> Source)
>         at weka.gui.visualize.PrintablePanel.saveComponent(Unknown Source)
>         at weka.gui.treevisualizer.TreeVisualizer.mousePressed(Unknown
> Source)
>         at java.awt.AWTEventMulticaster.mousePressed(Unknown Source)
...
> (I have found something about this problem in the javadoc regarding
> saveComponent() method in TreeVisualizer class but the description seems to
> be a little bit concise :) ).
> What kind of problem could it be? I tried to launch the jar file
> double-clicking on it but it shows the same problem. I experienced the same
> problem even trying to save the jpeg of a tree.
> Many thanks in adavance!
> 
> Filippo
> 
> P.S. I've imported weka.jar (v. 3.5.6) as library in my project

I assume that you only have a few classes in your Jar that are directly 
needed to compile it. But the weka.gui.visualize.PrintableComponent 
class discovers all classes derived from the class 
weka.gui.visualize.JComponentWriter in the package "weka.gui.visualize" 
and displays them in the combobox of the save dialog. If you don't 
bundle at least one component in this  particular package in your Jar as 
well, it won't work of course.

Since you're using NetBeans *with* Weka it works from within the IDE, 
but by invoking your Jar with the "-jar" option (which ignores the 
CLASSPATH environment variable!) it crashes with a NullPointerException.

Either add the necessary classes to your Jar (in the appropriate 
package) or supply the weka.jar as well in your Java call (but then with 
"-classpath" and not "-jar", since "-jar" only takes one Jar).

HTH

Cheers, Peter
--

-- 
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/     +64 (7) 838-4466 Ext. 5174
Peter Reutemann | 2 Aug 2007 00:10
Picon
Picon
Favicon

Jython and Weka

Hey!

I've submitted several new classes to the developer version of Weka (CVS 
or snapshot) in order to write new classifiers also in Jython, not only 
in Java. Could be useful for the Python gurus out there... ;-)

Read the FAQ "Can I use Weka from Python?" for a quick overview:
http://weka.sourceforge.net/wiki/index.php/Frequently_Asked_Questions#Can_I_use_Weka_from_Python.3F

The Wiki article "Using Weka from Jython" explains in detail how to 
access Weka classes from Jython and also how to implement your own 
classifier that can be used within the Weka framework (with a Jython 
example implementation of ZeroR):
http://weka.sourceforge.net/wiki/index.php/Using_Weka_from_Jython

Cheers, Peter
--

-- 
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/     +64 (7) 838-4466 Ext. 5174

Gmane