20 May 2013 11:52
Unable to understand the word frequency calculation from text files to arff program
sobiya khan <sobi16 <at> gmail.com>
2013-05-20 09:52:26 GMT
2013-05-20 09:52:26 GMT
Dear Mark,
We found the program (ARFF files from Text Collections on http://weka.wikispaces.com/) for converting text file to arff. However the original program was having some errors. We rectified it. Now it does yields the output.
However, there are a few points We are unable to decipher the output. Kindly provide some clarification to the following:
1. The modified program and the sample files are attached hereby for your reference. The program did sample the words which were more than 2 characters in length.However, we didn't understand how it calculated the weightage or frequency for the words so found.How does this represent the weightage of words found in the document?
2. Secondly How could we build a cluster on top of this ( using java api)? A Cluster as simple as dividing the text files/words based on their meaning/discussion like Personal/Official/suspicious etc? for example if we are using SimpleKmeans then how does weka recognizes the centroid among these words from various documents and how does it determines similarity among these words?because the the similarity measures such as Euclidean distances are calculated on numeric data then how are we going to use this distance measure for word? What will be the criteria for similarity among words to cluster these into separate groups?
3. Thirdly, does weka provides any means for tagging the clusters with the top N keywords for topic identification? Or is there any mechanism in weka for topic identification after performing clustering?We found the program (ARFF files from Text Collections on http://weka.wikispaces.com/) for converting text file to arff. However the original program was having some errors. We rectified it. Now it does yields the output.
However, there are a few points We are unable to decipher the output. Kindly provide some clarification to the following:
1. The modified program and the sample files are attached hereby for your reference. The program did sample the words which were more than 2 characters in length.However, we didn't understand how it calculated the weightage or frequency for the words so found.How does this represent the weightage of words found in the document?
2. Secondly How could we build a cluster on top of this ( using java api)? A Cluster as simple as dividing the text files/words based on their meaning/discussion like Personal/Official/suspicious etc? for example if we are using SimpleKmeans then how does weka recognizes the centroid among these words from various documents and how does it determines similarity among these words?because the the similarity measures such as Euclidean distances are calculated on numeric data then how are we going to use this distance measure for word? What will be the criteria for similarity among words to cluster these into separate groups?
I know my questions are bit too basic but... its just like we are missing some piece here.Kindly help.
Any help is most appreciated.
Thanks in advance.
Warm Regards
sobiya
_______________________________________________ Wekalist mailing list Send posts to: Wekalist <at> list.scms.waikato.ac.nz List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
RSS Feed