Carmelo Saffioti | 1 Feb 11:43 2009
Picon

Building a model on past data for future predictions

Hi everybody,
I'm working on a classification task for building a model to make predictions on future, training it on past data.
 
1) If I evaluate it with cross-validation its accuracy is about 68%
 
2) If I evaluate it partitioning the data set into training set (with data from Month 1 to Month N-1) and test set (with data of Month N), its accuracy is about 18%
 
Therefore predictions are too few accurate, according to the (2) experiment. Have you got any idea about how can I do to achive better results?
 
 
Thank you in advance for your precious help
Cheers
Carmelo
_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
Pablo Bermejo | 1 Feb 12:45 2009
Picon

Re: Building a model on past data for future predictions

Hello,

you may find useful the paper "Automatic Categorization of Email into 
Folders" by Bekkerman, McCallum and Huang. They present your kind of 
evaluation 2), which they call "Time-based split evaluation". The first 
thing to do to improve you accuracy is previously delete from test set 
those instances tagged with labels not existing in your training set; 
that is, you dont want to discover new classes but classifiy documents 
using existing classes.

cheers,
pablo.

Carmelo Saffioti escribió:
> Hi everybody,
> I'm working on a classification task for building a model to make 
> predictions on future, training it on past data.
>  
> 1) If I evaluate it with cross-validation its accuracy is about 68%
>  
> 2) If I evaluate it partitioning the data set into training set (with 
> data from Month 1 to Month N-1) and test set (with data of Month N), its 
> accuracy is about 18%
>  
> Therefore predictions are too few accurate, according to the (2) 
> experiment. Have you got any idea about how can I do to achive better 
> results?
>  
>  
> Thank you in advance for your precious help
> Cheers
> Carmelo
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Wekalist mailing list
> Wekalist <at> list.scms.waikato.ac.nz
> https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
Attachment (pbermejo.vcf): text/x-vcard, 319 bytes
_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
Harri M.T. Saarikoski | 1 Feb 17:12 2009
Picon
Picon

Re: Building a model on past data for future predictions

Lainaus "Carmelo Saffioti" <csaffi <at> libero.it>:

> Hi everybody,
> I'm working on a classification task for building a model to make  
> predictions on future, training it on past data.
>
> 1) If I evaluate it with cross-validation its accuracy is about 68%
>
> 2) If I evaluate it partitioning the data set into training set  
> (with data from Month 1 to Month N-1) and test set (with data of  
> Month N), its accuracy is about 18%

well it seems you have about the same amount of training instances for both
so the only explanation is in the way weka is implementing  
cross-validation, so-called stratified evaluation where each  
train/test fold pair for cv has mutually exactly the same class  
distribution and classifiers thereby are trained to expect that from  
test instances

>
> Therefore predictions are too few accurate, according to the (2)  
> experiment. Have you got any idea about how can I do to achive  
> better results?

hmm do you want better or more realistic results?

that level of drop is serious so there might be something else at  
fault too, hard to say

>
>
> Thank you in advance for your precious help
> Cheers
> Carmelo

--

-- 

kind regards,
Harri M.T. Saarikoski
M.A, PhD graduate student
Helsinki University
Finland

_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
Bernhard Pfahringer | 1 Feb 20:23 2009
Picon

Re: Building a model on past data for future predictions

> Hi everybody,
> I'm working on a classification task for building a model to make
> predictions on future, training it on past data.
>
> 1) If I evaluate it with cross-validation its accuracy is about 68%
>

If this is time series data, you cannot use cross-validation.
The proper procedure is training with data up to a certain
date, and then predict for data after that date.

Cross-validation is wrong for two reasons:
- it would use data from the future to predict past events,
which you cannot do in a real application
- time series often change little (both in input and output)
from one point to the next, so e.g. nearest neighbor in
cross-validation has a good chance to produce very high
results, which again would be meaningless in practise,
unless you only always want to predict one step ahead.

One more thing: usually the more into the future you try
to predict, the larger your errors get.

Bernhard
Eido Vaiss | 1 Feb 21:33 2009
Picon

how can i get more info about the parameters of the MLP classifier ?

hi
i want to learn more than what is write in the capabilities Dialog Box about the meaning of each parameter of the MLP classifier
cause i need to anlyze my result and try to get better results than  i got  with the default values of this classifier . 
i read the WeakManual 3.6.0 which released in 18 Dec 2008 and i didnt found a section which explain  with expansion about the meaning of the parameters in the MLP classifier .
i'll very apperciate any kind of guidnace or help .
and if someone have a good refernce to this subject it's also could help me .
with a HONOR eido.
_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
J K Rai | 2 Feb 06:55 2009
Picon

Interperting the output of MultilayerPerceptron

Hi All,

I am using MultilayerPerceptron for classification. How to interpret the output shown in explorer as shown below:

=== Run information ===

Scheme:       weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0..2 -N 500 -V 0 -S 0 -E 20 -H a
Relation:     percent_change_IPKC-weka.filters.unsupervised.attribute.AddExpression-E((a11/(a11+a65))^0.5)*a12-Nexpression-weka.filters.unsupervised.attribute.AddExpression-E((a65/(a11+a65))^0.5)*a66-Nexpression-weka.filters.unsupervised.attribute.AddExpression-E(((a111+a112)/a112)^0.01)*a111-Nexpression-weka.filters.unsupervised.attribute.AddExpression-E(((a111+a112)/a111)^0.01)*a112-Nexpression-weka.filters.unsupervised.attribute.AddExpression-E1/a25-Nexpression-weka.filters.unsupervised.attribute.AddExpression-E1/a79-Nexpression-weka.filters.unsupervised.attribute.AddExpression-E(((a111+a112)/a112)^0.5)*((a111+a112)^-0.5)*(a111)-Nexpression-weka.filters.unsupervised.attribute.AddExpression-E(((a111+a112)/a111)^0.5)*((a111+a112)^-0.5)*(a112)-Nexpression-weka.filters.unsupervised.attribute.Remove-R1-10,13-64,67-110-weka.filters.unsupervised.attribute.Remove-R2,4-9,11-12-weka.filters.unsupervised.attribute.AddExpression-Ea1^2-Nexpression-weka.filters.unsupervised.at tribute.Remove-R4-weka.filters.unsupervised.attribute.AddExpression-Ea2-Np1_LAST_LEVEL_CACHE_REFERENCES_PKI-weka.filters.unsupervised.attribute.AddExpression-Ea1-Np2_LAST_LEVEL_CACHE_REFERENCES_PKI-weka.filters.unsupervised.attribute.AddExpression-Ea3-N1/a25-weka.filters.unsupervised.attribute.Remove-R1-3
Instances:    1588
Attributes:   3
              p1
              p2
              1/a25
Test mode:    user supplied test set: 1588 instances

=== Classifier model (full training set) ===

Linear Node 0
    Inputs    Weights
    Threshold    -0.1883350530483593
    Node 1    -0.7755199142546875
Sigmoid Node 1
    Inputs    Weights
    Threshold    -4.167807246181952
    Attrib p1    -9.609509851641793
    Attrib p2    -0.0338491069080399
Class
    Input
    Node 0


Time taken to build model: 1.41 seconds

=== Evaluation on test set ===
=== Summary ===

Correlation coefficient                  0.8331
Mean absolute error                      1.6474
Root mean squared error                  2.723
Relative absolute error                 57.7396 %
Root relative squared error             57.7092 %
Total Number of Instances             1588   

---------------------------------------------------------------------------------------------

*Question.1:  What is the meaning and implications of  below mentioned values?
========

Linear Node 0
    Inputs    Weights
    Threshold    -0.1883350530483593
    Node 1    -0.7755199142546875
Sigmoid Node 1
    Inputs    Weights
    Threshold    -4.167807246181952
    Attrib p1    -9.609509851641793
    Attrib p2    -0.0338491069080399
Class
    Input
    Node 0

Thanks and regards,
JK

Check out the all-new Messenger 9.0! Click here.
_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
Carmelo Saffioti | 2 Feb 11:48 2009
Picon

Re: Building a model on past data for future predictions

Thank you Pablo for your interesting answer.

What I need is exactly a time-based split for train/test data sets. As you 
suggested, the first thing I'll do is to delete from the test set elements 
with not existing labels in my training set.

However I've got a question about this: in the real-world application I'll 
have data with not existing labels in my training set, so evaluating the 
model in this way doesn't seem to be realistic... What do you think about?

Thank you again
Cheers
Carmelo

----- Original Message ----- 
From: "Pablo Bermejo" <pbermejo <at> dsi.uclm.es>
To: "Weka machine learning workbench list." 
<wekalist <at> list.scms.waikato.ac.nz>
Sent: Sunday, February 01, 2009 12:45 PM
Subject: Re: [Wekalist] Building a model on past data for future predictions

Hello,

you may find useful the paper "Automatic Categorization of Email into
Folders" by Bekkerman, McCallum and Huang. They present your kind of
evaluation 2), which they call "Time-based split evaluation". The first
thing to do to improve you accuracy is previously delete from test set
those instances tagged with labels not existing in your training set;
that is, you dont want to discover new classes but classifiy documents
using existing classes.

cheers,
pablo.

Carmelo Saffioti escribió:
> Hi everybody,
> I'm working on a classification task for building a model to make
> predictions on future, training it on past data.
>
> 1) If I evaluate it with cross-validation its accuracy is about 68%
>
> 2) If I evaluate it partitioning the data set into training set (with
> data from Month 1 to Month N-1) and test set (with data of Month N), its
> accuracy is about 18%
>
> Therefore predictions are too few accurate, according to the (2)
> experiment. Have you got any idea about how can I do to achive better
> results?
>
>
> Thank you in advance for your precious help
> Cheers
> Carmelo
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Wekalist mailing list
> Wekalist <at> list.scms.waikato.ac.nz
> https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist

--------------------------------------------------------------------------------

_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
Pablo Bermejo | 2 Feb 12:17 2009
Picon

Re: Building a model on past data for future predictions

Hello Carmelo,

I thought you were trying to perform supervised classification: training 
your classifier with pre-labeled instances and then test by assigning 
the most probable class value to your test instances. If that is the 
case, you could just use in your training data method 
Instances.deleteWithMissingClass().

If what you are trying is to feed your classifier with instances with 
both existing and missing classes, thats beyond my scope, sorry.

Besides, since you have several time splits, you could also label by 
hand your first time-split and then test your second time-split. Then, 
when you train with both first and second time-split, you might use for 
your second split labels predicted previously while testing. Of course, 
this makes you keep with you prediction errors along your evaluation.

I also recommend you to read some preprocessing steps described in the 
paper by Bekkerman i suggested in the previous mail.

Hope this helps.

Regards,
Pablo.

Carmelo Saffioti escribió:
> Thank you Pablo for your interesting answer.
> 
> What I need is exactly a time-based split for train/test data sets. As 
> you suggested, the first thing I'll do is to delete from the test set 
> elements with not existing labels in my training set.
> 
> However I've got a question about this: in the real-world application 
> I'll have data with not existing labels in my training set, so 
> evaluating the model in this way doesn't seem to be realistic... What do 
> you think about?
> 
> 
> Thank you again
> Cheers
> Carmelo
> 
> 
> ----- Original Message ----- From: "Pablo Bermejo" <pbermejo <at> dsi.uclm.es>
> To: "Weka machine learning workbench list." 
> <wekalist <at> list.scms.waikato.ac.nz>
> Sent: Sunday, February 01, 2009 12:45 PM
> Subject: Re: [Wekalist] Building a model on past data for future 
> predictions
> 
> 
> Hello,
> 
> you may find useful the paper "Automatic Categorization of Email into
> Folders" by Bekkerman, McCallum and Huang. They present your kind of
> evaluation 2), which they call "Time-based split evaluation". The first
> thing to do to improve you accuracy is previously delete from test set
> those instances tagged with labels not existing in your training set;
> that is, you dont want to discover new classes but classifiy documents
> using existing classes.
> 
> cheers,
> pablo.
> 
> Carmelo Saffioti escribió:
>> Hi everybody,
>> I'm working on a classification task for building a model to make
>> predictions on future, training it on past data.
>>
>> 1) If I evaluate it with cross-validation its accuracy is about 68%
>>
>> 2) If I evaluate it partitioning the data set into training set (with
>> data from Month 1 to Month N-1) and test set (with data of Month N), its
>> accuracy is about 18%
>>
>> Therefore predictions are too few accurate, according to the (2)
>> experiment. Have you got any idea about how can I do to achive better
>> results?
>>
>>
>> Thank you in advance for your precious help
>> Cheers
>> Carmelo
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Wekalist mailing list
>> Wekalist <at> list.scms.waikato.ac.nz
>> https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
> 
> 
> 
> -------------------------------------------------------------------------------- 
> 
> 
> 
> _______________________________________________
> Wekalist mailing list
> Wekalist <at> list.scms.waikato.ac.nz
> https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
> 
> 
> _______________________________________________
> Wekalist mailing list
> Wekalist <at> list.scms.waikato.ac.nz
> https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
> 
Attachment (pbermejo.vcf): text/x-vcard, 319 bytes
_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
Carmelo Saffioti | 2 Feb 15:36 2009
Picon

Re: Building a model on past data for future predictions

Thank you Bernhard.

I'm understanding that for a more realistic model evaluation I cannot use 
cross-validation, but time-based train/test splitting...
Anybody has ever worked with this type of problem? What can you suggest me 
for developing better predictive models and achiving better results?

----- Original Message ----- 
From: "Bernhard Pfahringer" <bernhard.pfahringer <at> gmail.com>
To: "Weka machine learning workbench list." 
<wekalist <at> list.scms.waikato.ac.nz>
Sent: Sunday, February 01, 2009 8:23 PM
Subject: Re: [Wekalist] Building a model on past data for future predictions

>> Hi everybody,
>> I'm working on a classification task for building a model to make
>> predictions on future, training it on past data.
>>
>> 1) If I evaluate it with cross-validation its accuracy is about 68%
>>
>
> If this is time series data, you cannot use cross-validation.
> The proper procedure is training with data up to a certain
> date, and then predict for data after that date.
>
> Cross-validation is wrong for two reasons:
> - it would use data from the future to predict past events,
> which you cannot do in a real application
> - time series often change little (both in input and output)
> from one point to the next, so e.g. nearest neighbor in
> cross-validation has a good chance to produce very high
> results, which again would be meaningless in practise,
> unless you only always want to predict one step ahead.
>
> One more thing: usually the more into the future you try
> to predict, the larger your errors get.
>
> Bernhard
>
> _______________________________________________
> Wekalist mailing list
> Wekalist <at> list.scms.waikato.ac.nz
> https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist 
srai100 | 2 Feb 15:53 2009
Picon

stacking classifiers with different attribute sets


Hi,
I would like to use the stacking classifier provided in Weka. 
I would like to choose different attribute sets (from the same training set)
for the different classifiers chosen (both at level 0 and level 1) .  Is it
possible to do that using WEKA explorer ?
Thanks in advance!

Regards,
Sharat

--

-- 
View this message in context: http://www.nabble.com/stacking-classifiers-with-different-attribute-sets-tp21789997p21789997.html
Sent from the WEKA mailing list archive at Nabble.com.

_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist

Gmane