Oisin Boydell | 30 Jul 12:42 2014

Clusterdump output format


We have been using K-Means to cluster a fairly large dataset (just under a million 128 dimension vectors of
floating point values - about 9.2GB in space delimited file format). We’re using Hadoop 2.2.0 and
Mahout 0.9. The dataset is first converted from simple space delimited format into
RandomAccessSparseVector format for K-Means using the
org.apache.mahout.clustering.conversion.InputDriver utility.

We’re not using Canopy clustering to determine the initial clusters as we want a specific number of
clusters (100,000) so we let K-Means create the initial random 100,000 centroids:

./mahout kmeans -i /lookandlearn/vectors_all -c /data/initial_centres -o /data/clusters_output -k
100000 -x 20 -ow -xm mapreduce

It all runs fine and we then extract the computed centroids using the clusterdump utility:

./mahout clusterdump -i /data/clusters_output/clusters-1-final/ -o ./clusters.txt -of TEXT

The clusters.txt output file contains the expected 100,000 lines (once cluster per line) however there
seem to be some idiosyncrasies in the output format…

If we add up all the values of n for each cluster, which should be the number of data points belonging to each
cluster, we get a total of 39,160,754. But we expect this to be the same as the number of input points
(9,769,004) as each input point should belong to a single cluster. We are not sure why the sum of n values is
nearly 4 times as large as the number of input points.
We also notice that the vector output format for the cluster centroids and radii seem to be in a couple of
different formats. The majority are a simple comma separated array format e.g.

c=[0.008, 0.006, 0.009, 0.014, 0.006, 0.003, 0.007, 0.005, 0.032, 0.004, 0.001, 0.003, 0.002, 0.002,
0.007, 0.017, 0.011, 0.002, 0.001, 0.014, 0.032, 0.015, 0.001, 0.002, 0.025, 0.007, 0.001, 0.007,
(Continue reading)

Simanchal | 30 Jul 12:15 2014

different methods of OnlineLogisticRegression class


I am new to Mahout Classification.

I am refering to MIA books and came across on 20 news group classification

What are the significant and what measour to be follow while giving values
below  methods of  OnlineLogisticRegression object ?


OnlineLogisticRegression learningAlgorithm = new
OnlineLogisticRegression(20, 1000, new

Thanks .

vaibhav srivastava | 30 Jul 06:17 2014

How to measure the LDA effieiency

Hi i am able run lda but wanted to know the accuracy for topic modeling or
any other efficiency measuement.
Thanks in adavnce
Bob Morris | 29 Jul 20:43 2014

Printing clusters of NamedVectors in mahout 0.9

In a thread beginning at [1], Bikash Gupta asks about what seems to be the
same issue I have, namely getting cluster lists from the output of a
clustering of NamedVector objects. (In my case clusters  come from a
CanopyDriver call.)

In the thread at [1] I understand Suneel's answer of Feb 24 2014 at 00:35,
but I don't understand where the Iterable comes from. That is (?), I don't
understand what is creator of the object  named clusterWritable in
Sunneel's answer when one wants to look at

Any help, including full examples for 0.9, will be appreciated.



Robert A. Morris

Emeritus Professor  of Computer Science
100 Morrissey Blvd
Boston, MA 02125-3390

Filtered Push Project
Harvard University Herbaria
Harvard University

(Continue reading)

Jonathan Cooper-Ellis | 29 Jul 19:02 2014

How to get document count for TFIDF calculate method?

Hey guys,

I'm trying to make a Bayesian classifier, but I'm having a hard time
figuring out how to programatically determine the value of the numDocs
param for calculate method in TFIDF, using the files generated building the
model on the command line.

I saw some code that did it like this:

int numDocs = documentFrequency.get(-1).intValue();

Where documentFrequency is a HashMap<Integer,Long> read from
frequency.file-0, but there's no key -1 in the file so its giving me an NPE
when I try to pass that to tfidf.calculate.

Anyone know what I'm doing wrong?


Luca Filipponi | 29 Jul 09:23 2014

Naive Bayes Classifier Sentiment Analysis

Hi , I am trying to develop sentiment analysis on italian tweet from twitter using the naive bayes
classifier, but I’ve some trouble.

My idea was to classify a lot of tweet as positive, negative or neautral, and using that as training set for
the Classifier. To do that I’ve wrote a sequence file, in the format <Text,Text>, where in the key there
is  /label/tweetID and in the key the text, and then the text of all the dataset is converted in tfidf vector,
using mahout utilities.

Then I’m using the command:

./mahout trainnb and ./mahout testnb to check the classifier, and the score is right (I’ve got nearly
100% because the test set is the same as the train set)

My question is if I want to use a test set that is unlabeled how should it be created? because if the format
isn’t like:

key = /label/  the classifier can’t find the label and I’ve got an exception

but in a new dataset, obviously this will be unlabeled because i need to classify that, so I don’t know what
put in the key of the sequence file.

I’ve searched online for some example, but the only ones that I’ve found use the split command, on the
original dataset, and then testing on part of that, but isn’t my case.

Every idea for developing a better sentiment analysis is welcome, thanks in advance for the help.

Nicholas Demusz | 28 Jul 21:53 2014

OnlineLogisticRegression sgd, calculating confidence value

I am trying to do some classification with Mahout's
OnlineLogisticRegression, I've built a model and have it trained on 5
categories of interest to me. I however was under the impression that the
classify() and classifyFull() methods would return a vector of floats that
totaled to 1.0 .. However I get a vector back and it only has a 1 in the
index position of the category that it thinks it's supposed to be in. Is
this the normal behavior? I have about 500 training items for each
category. I've played with the value of lambda some but it doesn't change.

If this is the intended outcome, could someone point me to a way to provide
a confidence value for items that I classify, or should I be looking at a

My goal is to have some sort of confidence score to indicate the level of
certainty that this is what it says it is, as well as put the exemplar data
into a category.
Ivan Brusic | 25 Jul 23:04 2014

Was the Vector hierarchy ever Serializable?

I am in the midst of upgrading our Mahout library in order to take
advantage of all the excellent recent additions.

As far as I can tell, the library was based off a snapshot of 0.5. The code
does not use any of the Mahout algorithms, just a few of the data
structures such as DenseVector. The existing code builds a Java object
which is then serialized and distributed. After upgrading to 0.9, I noticed
I was no longer able to deserialize objects since DenseVector is
not Serializable. After inspect the old jar, it seems like AbstractVector
was declared Serializable.

So either someone at my company added serialization to the Mahout classes
or they were Serializable at some point. I am assuming the former. Is this
the case? I looked at the commits and at no point was anything Serializable.

Since the classes are not Serializable and no longer inherit from Writable,
is there an existing strategy to output Mahout structures? Would hate to
write wrapper classes or once again modify the source.


Pat Ferrel | 25 Jul 20:39 2014

Re: recommenditembased returns 0 records from last map-reduce job

Sorry I haven’t read this thread carefully but it looks like you may be using the wrong IDs.

For most Mahout jobs you have to prepare you data to have Mahout IDs. You do this by looking at each datum and as
you see a new unique application specific user or item ID you give it a Mahout ID starting from 0. So Mahout ID
can be thought of as row and column numbers in a matrix. The Mahout IDs for rows will be 0 thru # of rows-1 same
for columns.

This always requires that you translate into Mahout IDs then after the job is run translate back into your
application IDs. You need a bi-directional dictionary of some type. I use a HashBiMap from Guava.

Also I’d avoid the threshold for now. If you get that wrong it will mess things up badly and is very hard to
tune. It’s there for completeness but I never use it.

On Jul 25, 2014, at 12:55 AM, Serega Sheypak <serega.sheypak <at> gmail.com> wrote:

Hi, nothing helps...
I do use mahout 0.9 compiled for CDH 4.7
I do provide only positive values
I do use itemsimilarityJob and do get 2000 similarities for 1400 unique
Input data is:
16*10^6 preferences
4*10^6 users
0.6*10^ items
I do use perason correlation and preferece vlaues are: 1.0 and 2.0

2014-07-22 9:32 GMT+04:00 Serega Sheypak <serega.sheypak <at> gmail.com>:

> Ok, I have recompiled mahout 0.9 for CDH 4.7. I'll try this evening.
> Right now I don't see how can it help me. As far as I know the stuff I try
(Continue reading)

Serega Sheypak | 24 Jul 08:57 2014

itemsimilarity returns few similar items

Hi, I'm trying to calc itemsimilarity using ItemSimilarityJob.
Here are my counts:
input dataset: user_id, item_id, pref: 16M
distinct items: 700K
distinct users: 4M

bucketed preferences per users
count_of_preferences, count_of_users
1                                   2M
2                                   600K
3                                   300K
4                                   300R

threshold: 0.91

It returns ~2000 rows for ~1000 distinct items.
What do i do wrong?
Martin, Nick | 24 Jul 07:27 2014


So I know fpgrowth was sent out to pasture a few months ago. As luck would have it I need to do this kind of thing now.

Would my only option now be to pull the source (per Sebastian's note in the JIRA)? Could I roll back from 0.9 to
a prev version to pick it back up?

Any other options? I don't *think* we'd be able to bite off algo maintenance so that probably rules our
getting it dropped back into the distro I'm guessing. 

Sent from my iPhone