Martin | 18 Dec 14:39 2014
Picon

Re: Boosting

Good morning Mathias,

The default settings for AdaBoostM1 work well.

In WEKA, to do classification with AdaBoostM1(metalearner), follow the steps:
Open WEKA Explorer -> Classify tab -> Choose button (select weka.classifiers.meta.AdaBoostM1) -> hit "Start"

WEKA will display the results for you (e.g. Precision, Recall, ROC Area, and Confusion Matrix, ...etc.), and all what you have to do is understanding those metrics and considering them in your report.

Upper steps are also same for Bagging, Stacking and Vote.

When you are working with any learning algorithm, you have to know your data type (e.g. numeric or nominal) for both attributes and class, and compare that type with the capabilities of the particular algorithm (they should be matched). 

Here are the capabilities of AdaBoostM1:

Class -- Nominal class, Binary class, Missing class values.

Attributes -- Empty nominal attributes, Unary attributes, Binary attributes, Numeric attributes, Nominal attributes, Missing values, Date attributes.


Good luck and all the best to you.

Best Regards,
Martin

On 18 December 2014 at 20:18, Mathias Mantelli [via WEKA] <[hidden email]> wrote:
Hi all,

One question, in the Weka, Boosting = AdaBoostM1?



If you reply to this email, your message will be added to the discussion below:
http://weka.8497.n7.nabble.com/Boosting-tp33187.html
To unsubscribe from WEKA, click here.
NAML

View this message in context: Re: Boosting
Sent from the WEKA mailing list archive at Nabble.com.
_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.waikato.ac.nz
List info and subscription status: http://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Martin | 18 Dec 13:55 2014
Picon

Re: Boosting

Yes. Boosting is a family of algorithms, and the AdaBoost algorithm is the most influential one. In Weka AdaBoostM1 boost using the AdaBoost M1 method.

Wish this helps.
Martin

On 18 December 2014 at 20:18, Mathias Mantelli [via WEKA] <[hidden email]> wrote:
Hi all,

One question, in the Weka, Boosting = AdaBoostM1?



If you reply to this email, your message will be added to the discussion below:
http://weka.8497.n7.nabble.com/Boosting-tp33187.html
To unsubscribe from WEKA, click here.
NAML

View this message in context: Re: Boosting
Sent from the WEKA mailing list archive at Nabble.com.
_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.waikato.ac.nz
List info and subscription status: http://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Martin | 18 Dec 13:48 2014
Picon

Re: Boosting

Yes. Boosting is a family of algorithms, and the AdaBoost algorithm is the most influential one. In Weka AdaBoostM boost using the AdaBoostM1 method.

Wish this helps.
Martin

On 18 December 2014 at 20:18, Mathias Mantelli [via WEKA] <[hidden email]> wrote:
Hi all,

One question, in the Weka, Boosting = AdaBoostM1?



If you reply to this email, your message will be added to the discussion below:
http://weka.8497.n7.nabble.com/Boosting-tp33187.html
To unsubscribe from WEKA, click here.
NAML

View this message in context: Re: Boosting
Sent from the WEKA mailing list archive at Nabble.com.
_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.waikato.ac.nz
List info and subscription status: http://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Ralf | 18 Dec 11:19 2014
Picon

Re: OnlineDataset to be Normalized or Not

I do not know what is your data look like. Anyway, normally, the filter "weka.filters.unsupervised.attribute.Normalize" will normalize all numeric values in the dataset. Now, since the type of your dataset is string (as I assume), the Normalize filter will not affect on your data. Thus, you can go without normalization.

Regards,
Ralf

On Thu, Dec 18, 2014 at 5:30 PM, pzha.msc [via WEKA] <[hidden email]> wrote:
Respected Sir,
                     If i go without normalization and use online dataset which are available.Is there any problem as I am using (Simple Kmean, dbscan etc).


If you reply to this email, your message will be added to the discussion below:
http://weka.8497.n7.nabble.com/DataSet-uploading-tp32859p33183.html
To start a new topic under WEKA, email [hidden email]
To unsubscribe from WEKA, click here.
NAML

View this message in context: Re: OnlineDataset to be Normalized or Not
Sent from the WEKA mailing list archive at Nabble.com.
_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.waikato.ac.nz
List info and subscription status: http://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Pimwadee Chaovalit | 18 Dec 10:01 2014
Picon

Kmeans cluster centroids for training and testing set

Hello,

I am experimenting on clustering a dataset of 14,008 rows with K-means. I have chosen "Percentage split
66%" for cluster mode. The cluster result produced is composed of 2 sets: one for the training set with full
data of 14,008 rows, the other for the test split with the full data of 9,245 rows. 

While I was analyzing the centroids trying to get the feeling of the characters of each group, I was curious
about something. According to the result posted below, the centroid for group#1 of the training set seems
to match the centroid for group#1 of the testing set. But the centroid for group#2 of training seems to
match that of group#3 of the testing and #3 of the training to #2 of the testing. Group#0 represents the
group of missing values.

My question is: Does Weka receive the cluster centroids from the training and then use them on the
evaluation of the testing set? If so, why do the same clusters (of the training and training) have
different cluster number (as in 
the case of cluster#2 and cluster #3)? Or does Weka re-cluster everything from scratch on the testing data
set. If so, how can that be model evaluation? I really need to explore the characters of clusters within
this dataset 
so it is quite important to me to understand if the groups stay the same or do they evolve during the model
evaluation step.

I would appreciate if anyone has any insight into how the software works in order to perform model
evaluation in this case.

Thank you for answers in advance,
Pimwadee

*************************************************************

=== Model and evaluation on training set ===

kMeans
======

Number of iterations: 21
Within cluster sum of squared errors: 3.412647650457991E20

Cluster centroids:
                                                          Cluster#
Attribute    Full Data          0               1               2              3
               (14008)            (74)         (5426)       (4995)      (3513)
=============================================================
D             0.1135          missing        0.0168        0.0608       0.7048
          +/-0.2401         --          +/-0.0757  +/-0.1055  +/-0.2126

M              0.6939        missing        0.9686       0.4516        0.484
           +/-0.2934        --          +/-0.0765 +/-0.1491  +/-0.2842

T              Infinity         Infinity         0.041        0.0818         0.1719
               +/-NaN  +/-NaN       +/-0.2481 +/-0.2207  +/-0.3444

Time taken to build model (full training data) : 0.5 seconds

=== Model and evaluation on test split ===

kMeans
======

Number of iterations: 17
Within cluster sum of squared errors: 2.1213755665009135E20

Cluster centroids:
                         Cluster#
Attribute    Full Data     0              1                  2               3
                (9245)       (46)         (3585)         (2308)        (3306)
=============================================================
=====
D        0.1106      missing         0.0168        0.6942       0.0569
           +/-0.2368    --       +/-0.0748  +/-0.2154  +/-0.1018

M        0.6948      missing        0.9677        0.4832       0.4553
           +/-0.2915    --       +/-0.0771  +/-0.2811  +/-0.1491

T        Infinity       Infinity          0.0398       0.1863       0.0825
          +/-NaN     +/-NaN   +/-0.2458  +/-0.3722   +/-0.221

Time taken to build model (percentage split) : 0.21 seconds

---
Disclaimer:

This e-mail and any files transmitted with it may contain confidential and proprietary information of the
National Science and Technology Development Agency (NSTDA), Thailand. They are intended solely for the
use of the addressed individuals or entities. If you are not the intended recipient, you are required to
immediately delete this e-mail and its contents from your system. Any disclosure, distribution, or
action based upon the contents of this e-mail is strictly prohibited. Any views or opinions presented in
this e-mail are solely those of the sender and do not necessarily represent those of NSTDA. NSTDA does not
accept any responsibility for the content of this message or the consequences of any actions taken on the
basis of the information provided. NSTDA accepts no liability for any
  damage caused by any virus or malware which may be inserted in this e-mail during transmission.
_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.waikato.ac.nz
List info and subscription status: http://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Ralf | 18 Dec 08:15 2014
Picon

Re: OnlineDataset to be Normalized or Not

You are sill able to perform normalize filter for the dataset.

Any actions(s) in the preprocess stage should be taken based on the data itself, and what do you want to get form this data. In addition, the algorithm that you want to use plays an important role in determining the action that you want to make. Using supervised and/or unsupervised filters is just to help you dealing with the data easily. Finally, you need to insure that your test set remains independent from the training one.

To sum up, there is no standard method for the preprocess stage, all depends on the data that you have, and the learning algorithm that you want to use.

You can refer to the book "Data Mining: Practical Machine Learning Tools and Techniques: Practical Machine Learning Tools and Techniques".

Regards,
Ralf

On Thu, Dec 18, 2014 at 12:39 PM, pzha.msc [via WEKA] <[hidden email]> wrote:
Respected Sir,
                     The data set available online on your website and other sites need be normalized or they are already normalized. Can i used directly without normalized the data set in text clustering method.
Can i get all the details of Filtering both supervised and unsupervised in the pre-processors. Which filter is used for which algorithms when and where.
I have used some online data set like Reuter, abalone etc. I have directly applied without going to the filtering methods like normalization or discretize etc. I have used StringtoWord only for using only text data.
I have searched alot to get the best WEKA manual to get the information of all the methods of pre-processor and filtering methods when and where to used.
Will you send me the material for this.

If you reply to this email, your message will be added to the discussion below:
http://weka.8497.n7.nabble.com/DataSet-uploading-tp32859p33177.html
To start a new topic under WEKA, email [hidden email]
To unsubscribe from WEKA, click here.
NAML

View this message in context: Re: OnlineDataset to be Normalized or Not
Sent from the WEKA mailing list archive at Nabble.com.
_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.waikato.ac.nz
List info and subscription status: http://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Kenan | 17 Dec 16:28 2014
Picon

How to predict several unknown attributes of an Instance with the WEKA API?

Hello,

I am trying to build a classifier that can predict if a room will be occupied at a specific time of a day, when only a part of the day is known.

The only classes that I want to predict are: true or false.

The room is in a meeting room in a company building, and the gathered data shows that most of the time the room is unoccupied (class = false). Especailly between 00.00-12.00 and 16.00-24.00 (European time).

The problem is that the classifier ALWAYS classifies the unknown attributes as FALSE.

Here is how my model looks like:

An hour of a day is split up in 10 minutes intervals, so an hour has 6 time slots, and the whole day has (6*24) = 144

The attributes of my data look like this:

dayname, season, timeslot1, timeslot2, timeslot3, timeslot4, timeslot5, timeslot6, timeslot7, timeslot8, timeslot9, timeslot10,timeslot11..... timeslot144
monday     winter     false          false           false           false          true            true           true            false           false          false            false                false
tuesday     summer  false          false           false           false          true            true           true            false           false          false            false                false
wednes     spring     false          false           false           false          true            true           true            false           false          false            false                false
thursday    autumn   false          false           false           false          true            true           true            false           false          false            false                false
friday         summer  false          false           false           false          true            true           true            false           false          false            false                false
......

I provided the classifier with data for a whole year, and built it like this:

private SMO buildClassifier(Instances dataset) {

SMO classifier = new SMO();

try {

Evaluation eval = new Evaluation(dataset);
eval.crossValidateModel(classifier, dataset, 10, new Random(1));
System.out.println(eval.toSummaryString("\nResults\n\n", false));

classifier.buildClassifier(dataset);

} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

return classifier;
}

The outcome of the evaluation looks pretty good:

Correctly Classified Instances         111              100      %
Incorrectly Classified Instances         0                0          %
Kappa statistic                                   1     
Mean absolute error                          0     
Root mean squared error                  0     
Relative absolute error                      0       %
Root relative squared error               0        %
Coverage of cases (0.95 level)         100    %
Mean rel. region size (0.95 level)      50      %
Total Number of Instances               111     


A day which I want to classify, with unknown attributes, can look like this:

dayname, season, timeslot1, timeslot2, timeslot3, timeslot4, timeslot5, timeslot6, timeslot7, timeslot8, timeslot9, timeslot10,timeslot11..... timeslot144
monday     winter     false          false           false           false          ?                ?                ?                ?                ?                ?                 ?                      ?

Programatically, I find out which time attributes are unknown, and make a loop where I change the "setClassIndex" value of the dataset.

Again, my problem is that the classifier allways returns class FALSE. Even between 12.00 and 16.00, where the room should be occupied, at least some of the time.

Am I doing something wrong? Does WEKA support multiple unknown attributes this way?

Thanks in advance
Kenan Imamovic
_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.waikato.ac.nz
List info and subscription status: http://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
double d s | 17 Dec 19:46 2014
Picon

getting evaulation metrics (results)

Hi all,

I am using the package RWeka, and I'm wondering what command(s) should be used to get the results of (Correctly Classified Instances, ROC Ares, Precision, Recall and confusion matrix) in R with RWeka?

Thanks
Sandler    &nbsp ;  
_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.waikato.ac.nz
List info and subscription status: http://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Juan Manuel Barreneche | 17 Dec 18:50 2014
Picon

Small Bug


Hello list, I would like to report this (rather small) bug.

Summary: the instances numbering in the predictions output table start working badly after 999999 instances (using a separate test set). In particular, each number gets repeated 10 times after this point.

The following it's a sample of the last instances of my toy example:

 inst#     actual  predicted error prediction
     1       1:c0       1:c0       0.627
     2       2:c1       1:c0   +   0.627
     3       2:c1       1:c0   +   0.627
     4       2:c1       1:c0   +   0.627
     5       1:c0       1:c0       0.627
     6       1:c0       1:c0       0.627
     7       1:c0       1:c0       0.627
     8       2:c1       1:c0   +   0.627
     9       1:c0       1:c0       0.627
    10       1:c0       1:c0       0.627
    11       1:c0       1:c0       0.627
...
999995       1:c0       1:c0       0.627
999996       1:c0       1:c0       0.627
999997       1:c0       1:c0       0.627
999998       1:c0       1:c0       0.627
999999       1:c0       1:c0       0.627
100000       1:c0       1:c0       0.627
100000       1:c0       1:c0       0.627
100000       1:c0       1:c0       0.627
100000       1:c0       1:c0       0.627
100000       1:c0       1:c0       0.627
100000       1:c0       1:c0       0.627
100000       2:c1       1:c0   +   0.627
100000       1:c0       1:c0       0.627
100000       1:c0       1:c0       0.627
100000       1:c0       1:c0       0.627
100001       1:c0       1:c0       0.627
100001       1:c0       1:c0       0.627
100001       1:c0       1:c0       0.627
100001       1:c0       1:c0       0.627
100001       1:c0       1:c0       0.627
100001       1:c0       1:c0       0.627
100001       1:c0       1:c0       0.627
100001       1:c0       1:c0       0.627
100001       1:c0       1:c0       0.627
100001       2:c1       1:c0   +   0.627
100002       2:c1       1:c0   +   0.627

As you can see, everything looks ok in the first 1000000 rows of this output, bu
​​
t
​​ after that, the numbers start repeating with the mentioned pattern.

To replicate this example, you can use weka's built in data generators. Here
​ ​
are the configurations I used:

​​
// For the test set:
​weka.datagenerators.classifiers.classification.RDG1 -r example_test -S 1 -n
1000020 -a 3 -c 2 -N 0 -I 0 -M 1 -R 10​


// For the train set:
weka.datagenerators.classifiers.classification.RDG1 -r example_train -S 1 -n 20000 -a 3 -c 2 -N 0 -I 0 -M 1 -R 10​

// The classifier used:
weka.classifiers.rules.ZeroR

// The output predictions format:

weka.classifiers.evaluation.output.prediction.PlainText

Hope this is of some help!

Happy Hol
i​
days
​ ​

Juan Manuel
_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.waikato.ac.nz
List info and subscription status: http://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Kshitij Judah | 17 Dec 00:21 2014
Picon

Storing additional information to identify a weka instance

Hello,
   I use Weka for my data mining tasks at work. I am looking for a way to attach additional information to a Weka instance that can be used to identify that instance. 

For example, if I am trying to classify webpages into news versus non-news using Weka, I need a way to attach the webpage URL to each webpage instance so that I can use this info to quickly know which webpage an instance represents. 

Machine learning package Mallet provides a way to do this. I am wondering if something similar exists for Weka. Any help is greatly appreciated. 

Thanks,
Kshitij
_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.waikato.ac.nz
List info and subscription status: http://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Mark Hall | 17 Dec 02:56 2014
Picon
Picon

Weka 3.6.12 and 3.7.12 releases

Hi everyone!

New versions of Weka are available for download from the Weka homepage:

* Weka 3.6.12 - stable book 3rd edition version. It is available as ZIP,
with Win32 installer, Win32 installer incl. JRE 1.7.0_72, Win64 installer,
Win64 installer incl. 64 bit JRE 1.7.0_72 and Mac OS X application (both
Oracle and Apple JVM versions).

* Weka 3.7.12 - development version. It is available as ZIP, with Win32
installer, Win32 installer incl. JRE 1.7.0_72, Win64 installer, Win64
installer incl. 64 bit JRE 1.7.0_72 and Mac OS X application (both Oracle
and Apple JVM versions).

Both versions contain a significant number of bug fixes, it is recommended
to upgrade to the new versions. Stable Weka 3.6 receives bug fixes only.
The development version receives bug fixes and new features.

Weka homepage:
http://www.cs.waikato.ac.nz/~ml/weka/

Pentaho data mining community documentation:
http://wiki.pentaho.com/display/Pentaho+Data+Mining+Community+Documentation

Packages for Weka>=3.7.2 can be browsed online at:
http://weka.sourceforge.net/packageMetaData/

Note: It might take a while before Sourceforge.net has propagated all the
files to its mirrors.

What's new in 3.7.12?

Some highlights
---------------

In core weka:

* GUIChooser now has a plugin exension point that allows implementations
of GUIChooser.GUIChooserMenuPlugin to appear as entries in either the
Tools or Visualization menus
* SubsetByExpression filter now has support for regexp matching
* weka.classifiers.IterativeClassifierOptimizer - a classifier that can
efficiently optimize the number of iterations for a base classifier that
implements IterativeClassifier
* Speedup for LogitBoost in the two class case
* weka.filters.supervised.instance.ClassBalancer - a simple filter to
balance the weight of classes
* New class hierarchy for stopwords algorithms. Includes new methods to
read custom stopwords from a file and apply multiple stopwords algorithms
* Ability to turn off capabilities checking in Weka algorithms. Improves
runtime for ensemble methods that create a lot of simple base classifiers
* Memory savings in weka.core.Attribute
* Improvements in runtime for SimpleKMeans and EM
* weka.estimators.UnivariateMixtureEstimator - new mixture estimator

In packages:

* New discriminantAnalysis package. Provides an implementation of Fisher's
linear discriminant analysis
* Quartile estimators, correlation matrix heat map and k-means++
clustering in distributed Weka
* Support for default settings for GridSearch via a properties file
* Improvements in scripting with addition of the offical Groovy console
(kfGroovy package) from the Groovy project and TigerJython (new
tigerjython package) as the Jython console via the GUIChooser
* Support for the latest version of MLR in the RPlugin package
* EAR4 package contributed by Vahid Jalali
* StudentFilters package contributed by Chris Gearhart
* graphgram package contributed by Johannes Schneider

As usual, for a complete list of changes refer to the changelogs.

Cheers,
The Weka Team

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.waikato.ac.nz
List info and subscription status: http://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


Gmane