From: Peter Reutemann <fracpete <at> gmail.com>
To: Weka machine learning workbench list. <wekalist <at> list.scms.waikato.ac.nz>
Sent: Sun, November 1, 2009 1:27:19 PM
Subject: Re: [Wekalist] Advice about n-grams and comparing v
ectors
> I have looked up the FAQ and have used the following:
>
> 0) Made 2 Directories mydata and testing. In mydata there are 2 directories,
> active and lazy, with text files in each. In testing, there are two
> directories, active and lazy which have one file each. Both the files in
> testing/active and testing/lazy are also present in the training (mydata)
> directories.
>
> 1)java -classpath ./weka.jar weka.core.converters.TextDirectoryLoader -dir
> ./testing/ > testing1.arff
>
> 2)java -classpath ./weka.jar weka.core.converters.TextDirectoryLoader -dir
> ./mydata/ > training1.arff
>
> 3)java -classpath ./weka.jar
> weka.filters.unsupervised.attribute.StringToWordVector -tokenizer
> "we
ka.core.tokenizers.NGramTokenizer -min 2 -max 3" -b -i training1.arff -o
> training2.arff -r testing1.arff -s testing2.arff
>
> Now I have a training
and testing set in same format
>
> I used the GUI to selct a classifer and build a model and saved it.
>
> 4)java -classpath ./weka.jar weka.classifiers.lazy.LWL -l
> lazy.LWL_oct_30_2009.model -T testing2.arff -p 0
Why do you use the GUI for generating the model? And why don't use
just build the classifier on the fly?
-t: training set
-T: test set
If you really want to save the model and re-use it later, then use the
-d option to do this.
> The output is as follows:
> === Predictions on test data ===
>
> inst# actual predicted error
> 1 0 &
nbsp; 0 0
> 2
0 0 0
>
> did something go wrong?
>
> The model has
>
> Relative absolute error 19.021 %
> Root relative squared error 59.9506 %
> Attributes: 1018
> Test mode: 16-fold cross-validation
> Instances: 18
Since you loaded the data in the Explorer, you should have noticed
that the StringToWordVector puts the class attribute in first
position. Unless you specify otherwise, the last attribute is use
d as
class attribute (Explorer and commandline).
BTW 16-fold CV is rather odd...
> Also, consider the case where I have a file and want to test if
it is lazy
> or active (classification), will I have to re-do the whole procedure again
> Steps 1 through 4 !! It takes quite a few minutes on my puny laptop to
> crunch out the tokenized result.
There's nothing you can do about that: machine learning and data
mining are computational expensive and memory intensive process. But
instead of performing those steps manual, automate them by writing a
script, for instance (batch, bash, groovy, jython, perl, ...)
Cheers, Peter
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/ Ph. +64 (7) 858-5174______________
_________________________________
Wekalist mailing list
Send posts to:
Wekalist <at> list.scms.waikato.ac.nzList info and subscription status:
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalistList etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.htmlThanks Peter, much obliged