### [mahout 0.9 | k-means] methodology for selecting k to cluster very large datasets

hsharma mailinglists <hsharma.mailinglists <at> gmail.com>

2015-09-15 21:58:39 GMT

Hello,
I have some questions around large-scale clustering. I would like to
arrive at a methodology that I can use to determine an appropriate
value of K to run K-means clustering for (at least for my scenario, if
not in general). More details follow below (apologies for the
verbosity, but I wanted to provide as much context as I could).
By 'large' I mean to imply:
* large in the number of points to cluster
* large in the dimensionality of the vector/feature-space
* large in the number of clusters
It would be doubly great if someone here has had to perform clustering
in a similar setting (similar in terms of the number of datapoints,
the nature/type of data being clustered, the number of clusters being
formed, etc.) and are willing to share their war story.
:: The context ::
I'm trying to cluster ~ 18 million sparse term-frequency vectors with
Mahout 0.9. Although these vectors were not derived from documents in
a text corpus, they were generated based on each data point being
associated with a finite number of discrete symbols. The size of the
overall symbol-set is quite large, hence the sparsity of the vector
representation. I assumed that I don't require idf-weighting because
none of these symbols can occur more than once per datapoint/vector.
Additionally, each vector has a fixed number of non-zero dimensions.

(Continue reading)