Ashraf M. Kibriya | 1 Sep 2005 02:23
Picon
Picon

Re: plese help me...urgent

>
>
>> Hi, I'm a student of a Bicocca University in Milan, Italy and is the 
>> first time that I use Weka.
>> 
>> I want to create a text classifier for some documents of my University.
>> 
>> I have got only a database of 142 entry both for train and 
>> test.(multilabel problem with around 100 categories)
>> 
>> I have made the preprocessing phase myself, because my documents are 
>> written in italian and this language is very difficoult to preprocess 
>> with a automatic tool.
>> 
>> Now I have built all file in arff format, and I want to ask if there is 
>> some algorithm that is best for my problem. Now I have used Multiclass 
>> in order to aggregate my file and a neural network, but it take more 
>> than two hour to make a model...It's normal?
>  
>
Try using Multinomial Naive Bayes or the Complement Naive Bayes, they are generally more well suited for
text classification problems. For higher accuracy, you might want to try SVM (i.e. SMO classifier in
Weka), although it would take longer to run.

Regards,
Ashraf
Ricardo Farfán | 1 Sep 2005 04:54
Picon

what data conform each cluster in K mean algorithm

Hello,

 

I have a java implementation of clustering analisis, I'm working whit K mean,

 

 

but I want to know  what data conform each cluster in K mean algorithm

 

 

ATT: Ricardo Farfán

_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
Peter Reutemann | 1 Sep 2005 05:15
Picon
Picon
Favicon

Re: distributing process

> I am trying the distributing process of the weka´s experimenter. But I 
> am having difficulties.Which the step that I must follow? 

You can find a step-by-step HOWTO for *nix on the Wiki:
   http://weka.sourceforge.net/wiki/index.php/Remote_Experiment

HTH

Cheers, Peter
--

-- 
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/     +64 (7) 838-4466 Ext. 5174
Stefan Kuhn | 1 Sep 2005 19:43
Picon
Favicon

programming problem

Hi all,
I'm using weka j48 in a webapp. I'm building the model from time to time. For 
this, I add all my training instances to an Intances object. When I later 
want to classify something, it seems I have to da a setDataset with my 
Instances object from the training on the instance to classify. For me, this 
means I have to hold a pointer to Instances in memory all the time. Since I 
have 352 attributes and some thousand instances, I want to avoid this. Is 
this possible? I would like to only hold the J48 object in memory.
Thanks,
Stefan
--

-- 
Stefan Kuhn M. A.
Cologne University BioInformatics Center (http://www.cubic.uni-koeln.de)
Zülpicher Str. 47, 50674 Cologne
Tel: +49(0)221-470-7428   Fax: +49 (0) 221-470-7786
My public PGP key is available at http://pgp.mit.edu

_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
David | 1 Sep 2005 19:30
Picon

Re: programming problem

Hello,
 
I simply delete the instances when I want to shrink the model:
 
ex.:
 
 private void Shrink() {
     m_TrainingSet.delete();
     m_TestingSet.delete();
 }
 
This remove all instances while the dataset "prototype" (features description) remains intact.
 
For logging purpose, you can backup instances by dumping them in an .arff file on disk.
 
 public static void PrintToArffFile(WekaLearner pi_wl, String pi_filename, int pi_Set) {
  ArffSaver as = new ArffSaver();
  try {
   File fOut = new File(pi_filename);
   FileOutputStream os = new FileOutputStream(fOut);
   as.setDestination(os);
   as.setFile(fOut);
   if (pi_Set == PRINT_TRAINING_SET) {
    as.setInstances(pi_wl.GetTrainInstances());
   } else if (pi_Set == PRINT_TESTING_SET) {
    as.setInstances(pi_wl.GetTestInstances());
   } else {
    throw new Error("You must specify whether to output training or testing set.");
   }
   as.writeBatch();
  } catch (Exception e) {
   throw new Error("Unable to output the weka' arff file");
  }  
 }
 
David

 
On 9/1/05, Stefan Kuhn <stefan.kuhn <at> uni-koeln.de> wrote:
Hi all,
I'm using weka j48 in a webapp. I'm building the model from time to time. For
this, I add all my training instances to an Intances object. When I later
want to classify something, it seems I have to da a setDataset with my
Instances object from the training on the instance to classify. For me, this
means I have to hold a pointer to Instances in memory all the time. Since I
have 352 attributes and some thousand instances, I want to avoid this. Is
this possible? I would like to only hold the J48 object in memory.
Thanks,
Stefan
--
Stefan Kuhn M. A.
Cologne University BioInformatics Center ( http://www.cubic.uni-koeln.de)
Zülpicher Str. 47, 50674 Cologne
Tel: +49(0)221-470-7428   Fax: +49 (0) 221-470-7786
My public PGP key is available at http://pgp.mit.edu

_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist



--
Balie - Baseline Information Extraction
http://balie.sourceforge.net
[Open Source ~ 100% Java ~ Using Weka ~ Multilingual]
_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
Peter Reutemann | 1 Sep 2005 22:43
Picon
Picon
Favicon

Re: Prediction

> I am having a problem calling Weka from the emacs shell on my Windows 
> machine. I have read the notes found at http://alex.seewald.at/WEKA/ . I 
> think it is a CLASSPATH setting problem.

I'm normally working with cygwin under Win32 or on a Linux box.

> Working from the directory to which I downloaded the file 
> callClassifier.class <http://alex.seewald.at/WEKA/callClassifier.class>. 
> I get:
> 
> *>set CLASSPATH .:$Weka-3-4/weka.jar*
> 
> set CLASSPATH .:$Weka-3-4/weka.jar
> 
> Environment variable CLASSPATH not defined
...

Since you're using windows, why don't set up the classpath there?
- right-click on "My Computer"
- select "Properties"
   (the alternative is via the Control Panel and then "System")
- under WinXP go to the "Advanced" tab
- click on "Environment Variables"
- create a new user variable (or system-wide for all users if you
   prefer): CLASSPATH
- and add the following value (assuming that Weka is installed in
   "c:\program files\weka-3-4")
   .;c:\program files\weka-3-4\weka.jar
- close all the dialogs with "OK"

Running Weka from command line:
- Start -> Run... and enter "cmd" followed by <Enter>
- e.g., run this command to see J48's help:
   java weka.classifiers.trees.J48 -h

HTH

Cheers, Peter
--

-- 
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/     +64 (7) 838-4466 Ext. 5174
Eibe Frank | 1 Sep 2005 22:45
Picon
Picon

Re: Very different results in 2 weka SMO versions?

No, they shouldn't be viewed that way.

Cheers,
Eibe

On Aug 31, 2005, at 5:41 PM, yee seng chan wrote:

> Eibe, thanks for the prompt reply! I've 1 more question below:
>  
>> The reason is that the old version used the standard sigmoid function
>> (i.e. without any fitting) to transform the SVM's output into [0,1]. 
> So this sigmoid function is not parameterized (i.e. not fitted to the 
> training data). In this case, can the real numbers output from this 
> old SMO version be considered to be proper probability estimates?
>  
> Thanks,
> Yee Seng.
>
>> In
>> default mode, the new version computes a "probability" based on the
>> number of votes each class receives in pairwise classification. To get
>> proper probability estimates you need to turn on the logistic model
>> option, which fits a logistic model (i.e. parametrized sigmoid) to the
>> output of each SVM in the pairwise classifier and then performs
>> pairwise coupling.
>>
>> Cheers,
>> Eibe
>>
>> On Aug 31, 2005, at 3:32 AM, yee seng chan wrote:
>>
>> > Hi,
>> >
>> > When executed on the same train.arff and eval.arff, I obtained very
>> > different results for 2 different versions of weka SMO. Below are 
>> the
>> > commands:
>> >
>> > java -mx500m -cp /home/grad/chanys/sw/weka-3-4-5/weka.jar
>> > weka.classifiers.meta.MultiClassClassifier -t train.arff -T 
>> eval.arff
>> > -p 1 -W weka.classifiers.functions.SMO
>>  >
>> > java -mx500m -cp /home/grad/chanys/wsd/sw/weka-3-2
>> > weka.classifiers.MultiClassClassifier -t train.arff -T eval.arff -p 
>> 1
>> > -W weka.classifiers.SMO
>> >
>> > On the older version weka-3-2, the real-numbers returned for each 
>> test
>> > instance is averaging 0.7, but for the newer weka-3-4-5, the
>> > real-numbers returned for each test instance is 1.0 (i.e. 100%
>> > confident of belonging to a certain class!).
>> >
>> > What's the difference between these 2 versions? The reason why I 
>> want
>> > to switch to the newer version is because it can fit logistic models
>> > to the SVM outputs.
>> >
>> > Yee Seng._______________________________________________
>> > Wekalist mailing list
>> > Wekalist <at> list.scms.waikato.ac.nz
>> > https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
>>
Picon
Favicon

Tertius algorithm


  Hi guys, Im using the Tertius association rule algoritm, but I need some 
pseudocode or article about it. 

Same for Apriori. 

Thanks 
_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
Peter Reutemann | 2 Sep 2005 00:13
Picon
Picon
Favicon

Re: Tertius algorithm

>   Hi guys, Im using the Tertius association rule algoritm, but I need some 
> pseudocode or article about it. 
> 
> Same for Apriori. 

You'll find paper references in the Javadoc of each algorithm.

Cheers, Peter
--

-- 
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/     +64 (7) 838-4466 Ext. 5174
Ashraf M. Kibriya | 2 Sep 2005 04:08
Picon
Picon

Re: StringToWordVector -W option

>From the GUI I don't think you can specify that option. However, if you
run the filter from the command line, there is an option "-c" that
allows you to set the class index. If you specify a class attribute
(index of the attribute) using this option then StringToWordVector will
try to keep top 100 words for class A, top 100 words for class B, and so on.

Regards,
Ashraf

Dwi Sianto Mansjur wrote:

> I have looked more into StringToWordVector
> I don't think there is an option to select top 100 words for class A ,
> top 100 words for class B, and so on
> If there is please let me know the option
>
> thanks
>
> On 8/23/05, *Ashraf M. Kibriya* <amk14 <at> cs.waikato.ac.nz
> <mailto:amk14 <at> cs.waikato.ac.nz>> wrote:
>
>     It keeps the 1000 highest occuring words, i.e. with the highest count
>     regardless of the number of documents it appears in.
>
>     It says in the documentation to be approaximate because if say for
>     example the 1000th word has frequency 20, and there are also another
>     lets say four words with frequency 20, then it will keep 1004 words,
>     including all those which have the same same count and are at the
>     boundary.
>
>
>     Regards,
>     Ashraf
>
>

Gmane