Re: Building a model on past data for future predictions
Pablo Bermejo <pbermejo <at> dsi.uclm.es>
2009-02-02 11:17:17 GMT
Hello Carmelo,
I thought you were trying to perform supervised classification: training
your classifier with pre-labeled instances and then test by assigning
the most probable class value to your test instances. If that is the
case, you could just use in your training data method
Instances.deleteWithMissingClass().
If what you are trying is to feed your classifier with instances with
both existing and missing classes, thats beyond my scope, sorry.
Besides, since you have several time splits, you could also label by
hand your first time-split and then test your second time-split. Then,
when you train with both first and second time-split, you might use for
your second split labels predicted previously while testing. Of course,
this makes you keep with you prediction errors along your evaluation.
I also recommend you to read some preprocessing steps described in the
paper by Bekkerman i suggested in the previous mail.
Hope this helps.
Regards,
Pablo.
Carmelo Saffioti escribió:
> Thank you Pablo for your interesting answer.
>
> What I need is exactly a time-based split for train/test data sets. As
> you suggested, the first thing I'll do is to delete from the test set
> elements with not existing labels in my training set.
>
> However I've got a question about this: in the real-world application
> I'll have data with not existing labels in my training set, so
> evaluating the model in this way doesn't seem to be realistic... What do
> you think about?
>
>
> Thank you again
> Cheers
> Carmelo
>
>
> ----- Original Message ----- From: "Pablo Bermejo" <pbermejo <at> dsi.uclm.es>
> To: "Weka machine learning workbench list."
> <wekalist <at> list.scms.waikato.ac.nz>
> Sent: Sunday, February 01, 2009 12:45 PM
> Subject: Re: [Wekalist] Building a model on past data for future
> predictions
>
>
> Hello,
>
> you may find useful the paper "Automatic Categorization of Email into
> Folders" by Bekkerman, McCallum and Huang. They present your kind of
> evaluation 2), which they call "Time-based split evaluation". The first
> thing to do to improve you accuracy is previously delete from test set
> those instances tagged with labels not existing in your training set;
> that is, you dont want to discover new classes but classifiy documents
> using existing classes.
>
> cheers,
> pablo.
>
> Carmelo Saffioti escribió:
>> Hi everybody,
>> I'm working on a classification task for building a model to make
>> predictions on future, training it on past data.
>>
>> 1) If I evaluate it with cross-validation its accuracy is about 68%
>>
>> 2) If I evaluate it partitioning the data set into training set (with
>> data from Month 1 to Month N-1) and test set (with data of Month N), its
>> accuracy is about 18%
>>
>> Therefore predictions are too few accurate, according to the (2)
>> experiment. Have you got any idea about how can I do to achive better
>> results?
>>
>>
>> Thank you in advance for your precious help
>> Cheers
>> Carmelo
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Wekalist mailing list
>> Wekalist <at> list.scms.waikato.ac.nz
>> https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
>
>
>
> --------------------------------------------------------------------------------
>
>
>
> _______________________________________________
> Wekalist mailing list
> Wekalist <at> list.scms.waikato.ac.nz
> https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
>
>
> _______________________________________________
> Wekalist mailing list
> Wekalist <at> list.scms.waikato.ac.nz
> https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
>
_______________________________________________
Wekalist mailing list
Wekalist <at> list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist