2011/9/1 Harri Saarikoski
<harri.saarikoski <at> gmail.com>
2011/8/31 Paul Mayer
<equilibrium87 <at> gmail.com>
Dear Harri, dear Wessel,
Thanks for your advices! I will give the CVparameterselection a try and will play around with the mentioned parameters. As a
result of last weeks work, I ended up with splitting the initial set to
be used for training into 66% training and 33% evaluation instances.
I ran 10-fold-crossvalidation on the 66% and built an ensemble
classifier extracted from the 10 interim models which votes for the
most popular class for an unseen instance in the evaluation set. It
should be a formally more correct approach than the one used before.
Aside from that, I am running into serious trouble when it comes to normalization and standardization: So far I am transforming to z-values,
secondly using the median and MAD and thirdly the median and IQR. Some distributions are skewed, I do have, however one feature which
has lots of 0s and 1s - my naive approach would have been to simply shift the values by an arbitrary value and log-transform afterwards,
however was told that this is far from being as much of a trivial task as I thought it would be. Would you suggest at least computing the
sqrt for skewed distributions?
a transformation of a feature's distribution is not necessarily required: what defines requirement to transform is how differently its values are associated with target classes not the marginal distribution of feature values itself as you seem to suggest so it may not give trouble at
all -> to ascertain investigate class probabilities of that feature compared to other features by running e.g. NaiveBayes which outputs such probs p(f|c1) and p(f|c2) and if p(f|c)=abs(p(f|c1)-p(f|c2)) is greater for the f(eature) in question than the others then something should be done in order to avoid that one feature dominating the (any) classifier's decision (nb: domination by one feature should be eliminated not to remove a good feature but to have a maximally homogenous feature space, i.e. with similarly distributed p(f|c)'s of features
only if you find based on the above p(f|c) that the feature should be transformed, we can work out how the transform should be done (will need to see the distribution/dataset)
best, Harri
ps. weka's normalize filter just rescales the distribution doesn't transform
it and since it can only transform all features globaly it shouldn't
have any effect on performance
(rather some math function should be applied, see MathExpression filter)
--
-----------------
Harri Saarikoski
data mining specialist
Finland
_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Dear Harri,
Thanks again for your answer! I hope I got your instruction right, took the raw data and ran the Naive Bayes classifier using just one feature at a time.
Appended is a list of true positives, true negatives and the absolute difference of both values w.r.t. 10-fold cross-validation (or was I supposed to simply build the classifier on the whole training data?):
F1 0.6832627118644068 0.4046610169491525 0.27860169491525427
F2 0.6451271186440678 0.423728813559322 0.22139830508474578
F3 0.5603813559322034 0.5095338983050848 0.05084745762711862
F4 0.7139830508474576 0.3919491525423729 0.3220338983050847
F5 0.7658898305084746 0.4427966101694915 0.3230932203389831
F6 0.6228813559322034 0.4682203389830508 0.15466101694915257
F7 0.3580508474576271 0.7002118644067796 0.3421610169491525
F8 0.548728813559322 0.628177966101695 0.07944915254237295
F9 0.9141949152542372 0.07309322033898305 0.8411016949152542
F10 0.7902542372881356 0.3156779661016949 0.4745762711864407
F11 0.6313559322033898 0.5720338983050848 0.05932203389830504
F12 0.7838983050847458 0.6091101694915254 0.17478813559322037
Especially feature 9 appears to be quite problematic.
May I include a compressed tar file showing density plots of the corresponding (raw data) distributions? Or should I provide them in some other way?
Would be glad if anyone could help me out on that.
When talking about outlier handling, I was referring to dropping the significantly deviating values before normalizing to e.g. [-1,1] because especially feature 9 shows extremely large values (3489232) while the median lies at 57316.65. "Cramping" the whole feature space down without removing some values (e.g. winsorizing was suggested to me) would probably render the feature completely useless.
Thanks again & Best Regards,
Paul