Parimi Rohit | 20 Nov 01:34 2014

Bi-Factorization vs Tri-Factorization for recommender systems

Hi All,

Are there any (dis)advantages of using tri-factorization (||X - USV'||) as
opposed to bi-factorization ((||X - UV'||)) for recommender systems? I have
been reading a lot about tri-factorization and how they can be seen as
co-clustering of rows and columns and was wondering if such as technique is
implemented in Mahout?

Also, I am particularly interested in implicit-feedback datasets and the
only MF approach I am aware of is the ALS-WR for implicit feedback data
implemented in mahout. Are there any other MF techniques? If not, is it
possible (and useful) to extend some tri-factorization to handle
implicit-feedback along the lines of "Collaborative Filtering for Implicit
Feedback Datasets" (the approach implemented in Mahout).

I apologize for any inconvenience as this question is very general and
might not be relevant to Mahout and I would really appreciate any

Hersheeta Chandankar | 19 Nov 08:11 2014

Scores of Complimentary Naive Bayes Classifier

Hi ,

I've been working on Mahout Complimentary Naive Bayes Classifier for
categorization of documents.

Could anyone help me understand how exactly the ' scores ' given by the
classifier are calculated ? On what factors they depend?

Is there any way to set a threshold value for the score?

Thanks ,
Lee S | 19 Nov 04:12 2014

How to deal with catogrical and date data in mahout ?

Hi all:
 Do you hava any good practice when you deal with catogrical data?
 Does mahout have provided a tool class which can do the convertion?
Eyal Allweil | 18 Nov 15:12 2014

Using a predefined test set instead of using the split command when using the classifiers

Hi all,
I am trying to test Mahout's classifier, but skip the split step (as was asked in this question). I have three
categories and "other" for my test. When I use the split command and let Mahout split it (say, with
randomSelectionPct = 20) I get perfectly good results. But when I try splitting the input myself into test
and training (so I can compare with other solutions) the classifier chooses "other" for everything.
I am following the suggestion from this question. Does anyone have an idea of how I can do this, or what I am
doing wrong?
Thanks in advance!Eyal

Mahout 0.9: Using own test set instead of using split command
|   |
|   |  |   |   |   |   |   |
| Mahout 0.9: Using own test set instead of using split co...I have referred to these two links to run mahout
NB classifier [1]
[2] |
|  |
| View on | Preview by Yahoo |
|  |
|   |

Richard Hanson | 17 Nov 14:49 2014

Incremental refresh() with Psql

Hello, I'm trying to use mahout with PostgreSQL via the JDBC DataModel (ReloadFromJDBCDataModel wrapped around PostgreSQLBooleanPrefJDBCDataModel) and want to be able to read only new elements from the Database when calling the refresh() command. Is that possible? I have seen that the MongoDB implementation does something along those lines and the FileDataModel also via delta files.
The reason for doing this is that reading in the whole DB each time a refresh in called is very time consuming, partly because it is not necessarily local.

Thank you & regards,
Donni Khan | 17 Nov 14:01 2014

How to choose the intioal clusters for K-mean from Tf-IDF vectors

Hi All,

I'm working with text clustering. I want to select specific documents(as a
vectors) to be centroIDs fo k-means.
I have created the TF-IDF for my dataset by using Mahout, and I would like
to choose the initioal clusters from TFIDF vectors.

Anyone has an idea Hw I can do it by Mahout?

Many thanks in advance.
optimusfan | 14 Nov 16:09 2014

Poor performance of the MaxtrixMultiplicationJob


I'm working on implementing a custom algorithm using the Mahout library. The algorithm requires matrix
multiplication, which I saw was available at the object level (.times) as well as being implemented in the
MatrixMultiplicationJob.  I am currently testing a step in the algorithm that requires me to multiply a
10x2.4m matrix by one that is 2.4mx2.4m.  The performance has been awful, taking 11-12 hours to complete. 
This might be fine if it was the extent of the algorithm, but I will have multiple similarly sized steps, all
of which will be repeated in a loop.

I dug into this further, looking at the job running on my Hadoop cluster (Google Cloud Compute, 3 nodes  <at>  16 GB
each).  I noticed that the job appeared to only be running a single map and thus on a single node, as opposed to
previous steps such as TransposeJob that ran multiple mappers and finished in a fraction of the time. 
Researching it a bit further, I found a handful of concerning posts such as the two below:

Hadoop File Splits : CompositeInputFormat : Inner Join


Hadoop File Splits : CompositeInputFormat : Inner Join
I am using CompositeInputFormat to provide input to a hadoop job. The number of splits generated is the
total number of files given as input to CompositeInputFormat...  
View on Preview by Yahoo  

MatrixMultiplicationJob runs with 1 mapper only ?


MatrixMultiplicationJob runs with 1 mapper only ?
Hi, I am trying to multiple dense matrix of size [100 x 100k]. The size of the file is 104MB and
with default block sizeof 64MB only 2 blocks are getting created.   
View on Preview by Yahoo  

So, my questions are as follows.  Is the MatrixMultiplicationJob truly limited to only being able to be run
on a single node?  If so, it seems fairly useless.  And if so, what is the recommended way to do decently sized
multiplication such as I require?
mw | 14 Nov 15:11 2014

Configuring Mahout Maven Project to use hadoop2


i am working on a rest api for mahout called kornakapi.
I heared that it is possible to compile the mahout trunk such that it is 
compatible with hadoop2. How can i do that? Is it possible to do that 
via the pom.xml?


Sean Farrell | 12 Nov 05:53 2014

Kmeans -k cluster flag not being recognised

Hi all,

I'm working through the Kmeans clustering example in 'Mahout in Action' and
I've run into an issue regarding randomly generating the initial cluster
centroids. According to MIA (and the examples on the Mahout web page) if
you set the -k flag then the algorithm will use a random seed generator to
produce initial cluster centroids for however many clusters you select
(i.e. the number after -k). However, I'm getting an illegal state exception
error saying that no clusters are found in my directory path and that I
should check my -c argument (which sets the path for the initial cluster
centroids sequence file). Reading through the output prior to the error it
seems as though the -k flag is not being recognised.

A search through the mailing list archive finds that this is not a new
problem, but I can't find a solution posted anywhere (other than one case
where upgrading from v0.7 to v0.8 fixed it). Does anyone know if this has
been solved?

Here are the commands I am using:

> mahout kmeans -i /user/hdfs/Vectors/reuters-
vectors/tfidf-vectors/ -c /user/hdfs/Vectors/reuters-initial-clusters/ -o
/user/hdfs/Vectors/reuters-kmeans-clusters/ -dm
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd 1.0
-k 20 -x 20 -cl

And hear is the output:

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using
/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop/bin/hadoop and
14/11/12 15:23:56 WARN driver.MahoutDriver: No kmeans.props found on
classpath, will use command-line arguments only
14/11/12 15:23:57 INFO common.AbstractJob: Command line arguments:
--maxIter=[20], --method=[mapreduce], --numClusters=[20],
--output=[/user/hdfs/Vectors/reuters-kmeans-clusters/], --startPhase=[0],
14/11/12 15:23:57 INFO common.HadoopUtil: Deleting
14/11/12 15:23:58 INFO zlib.ZlibFactory: Successfully loaded & initialized
native-zlib library
14/11/12 15:23:58 INFO compress.CodecPool: Got brand-new compressor
14/11/12 15:23:58 INFO kmeans.RandomSeedGenerator: Wrote 20 Klusters to
14/11/12 15:23:58 INFO kmeans.KMeansDriver: Input:
/user/hdfs/Vectors/reuters-vectors/tfidf-vectors Clusters In:
/user/hdfs/Vectors/reuters-initial-clusters/part-randomSeed Out:
14/11/12 15:23:58 INFO kmeans.KMeansDriver: convergence: 1.0 max
Iterations: 20
14/11/12 15:23:58 INFO compress.CodecPool: Got brand-new decompressor
Exception in thread "main" java.lang.IllegalStateException: No input
clusters found in
/user/hdfs/Vectors/reuters-initial-clusters/part-randomSeed. Check your -c
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.lang.reflect.Method.invoke(
        at org.apache.mahout.driver.MahoutDriver.main(
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.lang.reflect.Method.invoke(
        at org.apache.hadoop.util.RunJar.main(
Donni Khan | 11 Nov 15:36 2014

Remove instance from SequenceFile

Hi All,

I'm working with text mining by using Mahoup algorithms. I'm calculating
the similarity for text documents, First I computed the TF-IDF for all
documents (SequenceFIle format), During computing the similarity, there are
a lot of documents do not have any simlair Doc's. So I would like to remove
those document from SequenceFile vectors.

Any Idea to do that?

Thank in advance,

vsinha | 11 Nov 11:59 2014

Naive Bayes Classification

Hi. I'm new to Mahout and am building a classification mechanism on 
Windows, which needs me to save the trained model into a file. The 
ModelSerializer class is only available for SGD based algorithms from the 
looks of it. Can someone please help me with saving the Complementary 
Naive Bayes model after training along with its conf and other helper 
files into a given path?