Peter Reutemann | 1 Nov 2008 01:08
Picon
Picon
Favicon

Re: Stream Classification of Data Set

> Hi. I'm a student new to machine learning and interested in dealing with a
> large data set. I've read the FAQ article concerning the topic
> (http://weka.sourceforge.net/wiki/index.php/Classifying_large_datasets), and
> I'm interested in stream, rather than batch methods of dealing with this
> problem. You mention that the UpdateableClassifier interface handles this
> but considering that this pretty much only covers Lazy and Bayesian
> classifiers I'm guessing this is actually intended for incremental
> classifiers (ie. train, use, then train some more), a subset of classifiers
> that don't require the full data set to reside in memory. Unless I'm
> mistaken several other algorithms such as neural networks, not to mention
> the humble conjunctive rule should be able to handle input sequentially
> (without random access), and hence have a 'buildClassifier' method that
> handles a stream of data to reduce the memory burden.
>
> I thought this was rather basic so I've been scrounging around for a couple
> days looking for such capabilities. Before I start mucking around the source
> or resort to MOA I was wondering if:
> a. I'm wrong and all non-Bayesian and non-Lazy classifiers truly need random
> access to the training data or...

Yes, most of the classifiers need that kind of access (or at least,
the way they were implemented), for computing statistics for
attributes, like infogain, etc.

> b. Weka already has these basic capabilities and I'm just not spotting them

Incremental classifier support is there, what's necessary are contributions.

> or...
> c. Weka simply lacks such capabilities.
(Continue reading)

Peter Reutemann | 1 Nov 2008 01:30
Picon
Picon
Favicon

Re: Query regarding testing on a testset

> The classification problem that I am looking at is {Spam,Legitimate} emails.
>
> I have trained a Naive model and a J48 model based on a training set of 600
> emails (half spam and half legit emails).
> I have stored the models on my disk and now I am using those while testing
> on a testset.
>
> Now, when I run the test, Naive filter seems to think that only the 1st test
> instance is legitimate while the rest are all spam.
> And this is the result even if I change the test set! It always gives all
> spam except the 1st one!
[...]
> The piece of code I have written for testing is as follows:
>
>          // loading the Naive trained model stored on disk
>         Classifier classifier = testspam.loadClassifierModel(model);
>
>         Instances unlabeled = new Instances(
>                 new BufferedReader(
>                   new FileReader(testdataset)));
>
>         //set class attribute
>         unlabeled.setClassIndex(unlabeled.numAttributes() - 1);
>
>         //create copy
>         Instances labeled = new Instances(unlabeled);
>
>         //label instances
>         for (int i = 0; i < unlabeled.numInstances(); i++) {
>             double clsLabel =
(Continue reading)

Peter Reutemann | 1 Nov 2008 01:49
Picon
Picon
Favicon

Re: Libsvm in Weka :Cannot handle binary class

> I'm having the same problem as above.
> I am trying to use Libsvm with One-class SVM option in Weka.
> I get error of Cannot Handle Binary Class when ever I perform
> the one-class svm operation.
>
> My arff file looks a bit like:
> .....
>  <at> attribute Feature12 {1,2,3}
>  <at> attribute Feature13 real
>  <at> attribute Feature14 real
>  <at> attribute Feature15 {1,0}
>  <at> data
> .....
> Feature15 is the main class.
> I've tried two things
> 1) Left the arff file unchanged and created an Instances made up of
> only one class, however I still get get "Cannot handle binary..."
> error when I try build the LibSvm Classifier on the instances.

It expects a class attribute with 1 class label, but yours has 2.
Hence the exception "cannot handle binary..."

> 2) Changed my arff file from " <at> attribute Feature15 {1,0}" to
> " <at> attribute Feature15 {1}"
> Hoever in my code I get a "nominal value not declared in header"

Your data lists a label that is not declared in the header.

> Is it that there is something wrong with my arff file? i.e. was not
> converted properly from csv?
(Continue reading)

Peter Reutemann | 1 Nov 2008 01:52
Picon
Picon
Favicon

Re: Probability of instance belonging to class

> I have a group of instances which belong to a class and there is only one
> class, say, students who have scored high in the subjects.
> Now, I would like to know what is the probability of the new test students
> to fall in this category of high scorers and then I want to
> sort these students in terms of probability.
> Is there any such utility in Weka to help me achieve this?

Not really. You could perform one-class classification with LibSVM and
then see whether a student is either a high-scorer or not (but no
probability). See LibSVM article on the Wiki (link to that article is
provided in the FAQ "How do I use libsvm in Weka?").

Cheers, Peter
--

-- 
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174
Jui Deshpande | 1 Nov 2008 02:08
Picon

Re: Query regarding testing on a testset

Thanks for replying so promptly!

Was the training data you used all numeric? This is because for spam/legit data, my .arff file has all string attributes and so before saying buildClassifer(), I first use the StringtoWordVector filter on it and then train it.

Could that be the reason for the issue that I might be facing?

Also, I am currently not using any filter on the unlabeled test file. Do I need to apply the same filter to that as well before testing it?

Jui

On Fri, Oct 31, 2008 at 8:30 PM, Peter Reutemann <fracpete <at> waikato.ac.nz> wrote:
> The classification problem that I am looking at is {Spam,Legitimate} emails.
>
> I have trained a Naive model and a J48 model based on a training set of 600
> emails (half spam and half legit emails).
> I have stored the models on my disk and now I am using those while testing
> on a testset.
>
> Now, when I run the test, Naive filter seems to think that only the 1st test
> instance is legitimate while the rest are all spam.
> And this is the result even if I change the test set! It always gives all
> spam except the 1st one!
[...]
> The piece of code I have written for testing is as follows:
>
>          // loading the Naive trained model stored on disk
>         Classifier classifier = testspam.loadClassifierModel(model);
>
>         Instances unlabeled = new Instances(
>                 new BufferedReader(
>                   new FileReader(testdataset)));
>
>         //set class attribute
>         unlabeled.setClassIndex(unlabeled.numAttributes() - 1);
>
>         //create copy
>         Instances labeled = new Instances(unlabeled);
>
>         //label instances
>         for (int i = 0; i < unlabeled.numInstances(); i++) {
>             double clsLabel =
> classifier.classifyInstance(unlabeled.instance(i));
>             labeled.instance(i).setClassValue(clsLabel);
>
>             System.out.println(clsLabel + " -> " +
> unlabeled.classAttribute().value((int)clsLabel));
>         }

I've just tested your code (slightly modified version attached) on the
anneal UCI dataset (split into 2, training and unlabeled data). It
worked fine for me, I got all sorts of different labels predicted by
NaiveBayes (I assume thats the classifier you're using when you refer
to "Naive").

Just try your setup on a completely different dataset and see whether
you see a change in behaviour.

Cheers, Peter
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist


_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
Mark Hall | 1 Nov 2008 02:38
Favicon

Re: Stream Classification of Data Set


On Nov 1, 2008, at 7:30 AM, Damian Johnson wrote:
>
>
> I thought this was rather basic so I've been scrounging around for a  
> couple days looking for such capabilities. Before I start mucking  
> around the source or resort to MOA I was wondering if:
> a. I'm wrong and all non-Bayesian and non-Lazy classifiers truly  
> need random access to the training data or...

Yes. There are some more simple methods that can learn incrementally.  
Weka has some (Naive Bayes, AODE and such like). Neural networks that  
do updates after each instance rather than after a batch can be  
trained incrementally as well. Ditto for the perceptron training  
procedure. More sophisticated learning methods typically need access  
to all the data. There are some decision tree learners (Hoeffding  
trees for example) that can learn incrementally. MOA has an  
implementation of Hoeffding trees but Weka doesn't, yet.
>
> c. Weka simply lacks such capabilities.

As Peter said, Weka is open source, so contributions of these kinds of  
algorithms/improvements are more than welcome.

>
>
> In addition Weka appears to be doing some rather... er, substantial  
> data duplication internally. I tried running Simple Linear  
> Regression over a tiny subset of my data as a sort of 'hello  
> world' (only 335 megs), while allocating 1.5 gigs of memory to the  
> task. However, it still died with an OutOfMemoryError citing the  
> copyInstances method of the Instances class. Hence I need to wonder  
> why Weka's demanding more than four times the memory of the data  
> set? Internally copying the whole data set seems more than a little  
> unwise - is Weka only intended for tiny, toy sets of data?

Internal copying (note that most copying is only shallow; modification  
of values results in deep-copying of the instance(s) in question) is  
used to protect the original data for other schemes that might be  
accessing it concurrently. If you have an application where this is  
not the case, then you can modify the code to avoid this.

>
>
> One final, only somewhat related question: a PhD graph theory  
> student in my lab mentioned that Weka's only really intended as a  
> teaching, rather than production tool. From what I've seen so far  
> I'm inclined to agree but I'd like to check - is Weka commonly used  
> outside of the classroom?

It is true to say that Weka is used heavily in academia. However,  
algorithms/models have been OEM'd by commercial companies and are  
being used in production environments.

Cheers,
Mark.

--
Mark Hall
Senior Developer/Consultant, Pentaho Open Source Business Intelligence
Citadel International, Suite 340, 5950 Hazeltine National Dr.,
Orlando, FL 32822, USA
+64 7 847-3537 office, +64 21 399-132 mobile, +1 815 550-8637 fax,
Skype: mark.andrew.hall, Yahoo: mark_andrew_hall
Download the latest release today <http://www.sourceforge.net/projects/pentaho 
 >

_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
Jui Deshpande | 1 Nov 2008 03:01
Picon

Re: Query regarding testing on a testset

Hi,

As you said, I tried my training and testing on iris.arff (the example file with weka) and it does give different values for the test result.

However, for my kind of files, it is still giving the same result (1st value Legitimate and rest all spam)

I have attached a very short, basic training file and a test file just to give an idea of what kind of files I am looking at.

Is the problem with using the StringToWordVector filter before training? Should I be using something else?

Thanks a lot in advance!

Regards,
Jui

On Fri, Oct 31, 2008 at 9:08 PM, Jui Deshpande <juidesh <at> gmail.com> wrote:
Thanks for replying so promptly!

Was the training data you used all numeric? This is because for spam/legit data, my .arff file has all string attributes and so before saying buildClassifer(), I first use the StringtoWordVector filter on it and then train it.

Could that be the reason for the issue that I might be facing?

Also, I am currently not using any filter on the unlabeled test file. Do I need to apply the same filter to that as well before testing it?

Jui

On Fri, Oct 31, 2008 at 8:30 PM, Peter Reutemann <fracpete <at> waikato.ac.nz> wrote:
> The classification problem that I am looking at is {Spam,Legitimate} emails.
>
> I have trained a Naive model and a J48 model based on a training set of 600
> emails (half spam and half legit emails).
> I have stored the models on my disk and now I am using those while testing
> on a testset.
>
> Now, when I run the test, Naive filter seems to think that only the 1st test
> instance is legitimate while the rest are all spam.
> And this is the result even if I change the test set! It always gives all
> spam except the 1st one!
[...]
> The piece of code I have written for testing is as follows:
>
>          // loading the Naive trained model stored on disk
>         Classifier classifier = testspam.loadClassifierModel(model);
>
>         Instances unlabeled = new Instances(
>                 new BufferedReader(
>                   new FileReader(testdataset)));
>
>         //set class attribute
>         unlabeled.setClassIndex(unlabeled.numAttributes() - 1);
>
>         //create copy
>         Instances labeled = new Instances(unlabeled);
>
>         //label instances
>         for (int i = 0; i < unlabeled.numInstances(); i++) {
>             double clsLabel =
> classifier.classifyInstance(unlabeled.instance(i));
>             labeled.instance(i).setClassValue(clsLabel);
>
>             System.out.println(clsLabel + " -> " +
> unlabeled.classAttribute().value((int)clsLabel));
>         }

I've just tested your code (slightly modified version attached) on the
anneal UCI dataset (split into 2, training and unlabeled data). It
worked fine for me, I got all sorts of different labels predicted by
NaiveBayes (I assume thats the classifier you're using when you refer
to "Naive").

Just try your setup on a completely different dataset and see whether
you see a change in behaviour.

Cheers, Peter
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist



Attachment (trainingData.arff): application/octet-stream, 863 bytes
Attachment (testdata-basic.arff): application/octet-stream, 818 bytes
_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
Peter Reutemann | 1 Nov 2008 04:01
Picon
Picon
Favicon

Re: Query regarding testing on a testset

> Was the training data you used all numeric?

No, nominal and numeric attributes mixed.

> This is because for spam/legit
> data, my .arff file has all string attributes and so before saying
> buildClassifer(), I first use the StringtoWordVector filter on it and then
> train it.
>
> Could that be the reason for the issue that I might be facing?
>
> Also, I am currently not using any filter on the unlabeled test file. Do I
> need to apply the same filter to that as well before testing it?

Yes, of course. How else would the data get transformed correctly?

Instead of just using your base classifier, i.e., NaiveBayes, use the
FilteredClassifier meta-classifier in conjunction with your base
classifier and the StringToWordVector. Serializing such a trained
model will also serialize the filter and you can use it on your test
data.

Cheers, Peter
--

-- 
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174
Peter Reutemann | 2 Nov 2008 00:22
Picon
Picon
Favicon

new wiki launched

Hi everybody

After many, many hours of converting articles of the old
MediaWiki-based WekaWiki and adding them to the new Wikispaces-based
wiki on sourceforge.net, the migration is almost complete. The new
wiki is available at the following URL:
  http://weka.wiki.sourceforge.net/

Why the migration? When sourceforge.net moved to a new data center
(https://sourceforge.net/community/forum/topic.php?id=3410), they also
modified the webserver setup, breaking our wiki setups and uploaded
files (screenshots, source code files, etc.) were no longer
accessible. Since sourceforge.net now offered a wiki themselves,
without us having to maintain our own software, we decided to move to
the, unfortunately, less powerful wiki. We used a backup of ours to
restore the uploaded files. The old WekaWiki links to the articles in
the new Wiki, for now, but it might get removed altogether in the
future.

A note on the WekaDoc wiki: we've started on a new Weka manual, which
encompasses the tutorials that come with Weka at the moment (Explorer,
Experimenter, Bayesnet, etc.) plus additional information from the
WekaDoc wiki. Currently, Mark is adding the last bits and pieces.

Since we had to rely on an older backup, a few files could not be
restored. I was wondering whether anybody on the list has the
originals lying around somewhere locally. The following articles were
affected:

* Generating cross-validation folds (Java approach)
  - CrossValidationSingleRun.java
  - CrossValidationMultipleRuns.java
  - CrossValidationAddPrediction.java

* Using Weka via Jepp:
  - J48ViaJepp.txt (python script)

Please send them to me off-list, in case you got one of them. Thanks!

Cheers, Peter
--

-- 
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174
tianguanhua | 1 Nov 2008 15:22
Picon

答复: [Wekalist] bayesian network TAN

I have installed weka3.5.8, and select menu from explorer ,and find
Bayesian network, but could not find TAN opinion, or how do you select
search algorithm, where is it? I am really confused , could you help me
?

Thanks 

-----邮件原件-----
发件人: wekalist-bounces <at> list.scms.waikato.ac.nz
[mailto:wekalist-bounces <at> list.scms.waikato.ac.nz] 代表 jsu <at> site.uottawa.
ca
发送时间: 2008年10月31日 21:00
收件人: Weka machine learning workbench list.
主题: Re: [Wekalist] bayesian network TAN

Hi:

If you are looking for TAN as a classifier. You can find it in WEKA.
Just go
to Bayesnet, and select search algorithm as TAN.

Cheers

Jiang

> Hi
>   I have downloaded weka3.5.8 , TAN is not implemented in this version
> ,is it right? And which version of weka has implemented TAN Bayesian ,
> or are there someone interested in Bayesian network , I wanted to use
> TAN, where could I get a open source or free software tools for TAN
> Bayesian network tools
>
> Thanks in advance
> _______________________________________________
> Wekalist mailing list
> Wekalist <at> list.scms.waikato.ac.nz
> https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
>

_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist

_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist

Gmane