Donni Khan | 30 Mar 09:39 2015

Text clustering with SVD

Hallo Mahout users,

I'm working on text clustering, I would like to reduce the features to
enhance the clustering process.
I would like to use  the Singular Value Decomposition before cluatering
process. I will be thankfull if anyone has used this before, Is it a good
idea for clustering?
Is there any other method in mahout to reduce the text features before
Is anyone has idea how can I apply SVD by using Java code?

Thanks in advance,
Hersheeta Chandankar | 30 Mar 09:18 2015

Latent Semantic Analysis for Document Categorization

Hi Ted,

Thank you for a quick reply.
It would be of great help if you could please explain what kind of 'linking
information between documents' I should look for.
Pat Ferrel | 27 Mar 14:45 2015

Need help on Windows

We are trying to put together the next release of Mahout but need a volunteer who uses is on Windows. The
Script used to launch Hadoop Mapreduce jobs and Spark jobs need to be undated for the latest Windows and to
include new CLI bits that run on Spark. The issues is about updating mahout.cmd launcher script.

If you can help check out this Jira <>

Apache Mahout 
Hersheeta Chandankar | 26 Mar 13:55 2015

Latent Semantic Analysis for Document Categorization


I'm working on a document categorization project wherein I have some
crawled text documents on different topics which I want to categorize into
pre-decided categories like travel,sports,education etc.
Currently the approach I've used is of building a NaiveBayes Classification
model in mahout which has given good accuracy result of 70%-75%. But I
would still like to improve the accuracy by retrieving the semantic
dependencies between words of the documents.
I've read about Latent Semantic Analysis(LSA) which creates a term-document
matrix and subjects it to mathematical transformation called Singular Value
I'd thought of firstly subjecting the raw documents to LSA followed by
k-means clustering on LSA output and then giving the clustered output as
input to the NaiveBayes Classifier.
But on trying out LSA in Mahout the end result seemed to be in numerical
format and which after clustering were not acceptable by the NaiveBayes

Is my expirimental approach wrong? Has anybody worked on a similar issue
like this?
Could someone help me with the implementation of LSA or suggest any other
approach for semantic analysis of text documents.

Ted Dunning | 26 Mar 02:03 2015

Re: Fw: Mahout dataset Vectorization

This is an old question that I just dredged up in my email.

There is still a question about your format here.  When you say "IPs" do
you mean that you have a list of IP addresses?

Or is this a server web-log?  Does that mean that the destination IP is
implicit.  If so, you might be able to see a weak signal due to time
proximity of different IP addresses, but I can't see that you would see
much else.  Time proximity might give you a hint about wide-spread attacks.

On Wed, Feb 18, 2015 at 6:49 AM, Raghuveer <alwaysraghu <at>> wrote:

> Hi,
> I was going through mahout ppts online and came accross your email ID. I
> have few issues when i want to analyse my dataset.
> i am trying to find how i can make use of my dataset to present some
> relations. I have a dataset of the sort
> IPs,timestamp,bytes_tranferred
> what are the different relationships i can derive from this set so that i
> can present some meaningful values using mahout. Currently am planning to
> use this set to represent which client (in IPs column) had more traffic for
> a given time. So i will have to group IPs together i guess. Are there any
> better ideas and how can i do it using JAVA code It would be really helpful
> if you can show me a sample for this issue. Kindly suggest.
(Continue reading)

Jayani Withanawasam | 24 Mar 11:12 2015

Error in TF-IDF vector creation - java.lang.IllegalStateException


I'm trying to get text classification working in Mahout 1.0 on Hadoop fully
distribution mode (Ubuntu 12.04/ Hadoop 2.6)

There, I get the following error during TF vector creation.

*mahout seq2sparse -i 20news-seq -o 20news-vectors  -lnorm -nv  -wt tfidf*

Exception in thread "main" java.lang.IllegalStateException: Job failed!
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.lang.reflect.Method.invoke(
(Continue reading)

James Parker | 19 Mar 15:11 2015

Mahout ALS RecommenderJob error

I try ALS recommender using Java code.  This is the steps i follow:
1.  I create DatasetSplitter Job

2.  I create ParallelALSFactorizationJob  Job

3. I create FactorizationEvaluator   Job

4.  I create RecommenderJob Job

Everything  is ok for 1 to 3. But when i run the RecommenderJob , i have
the following error
 java.lang.Exception: java.lang.RuntimeException:
java.lang.ClassCastException: cannot be cast to

 I'm suprised to see that if i run steps 1 to 3 using my java code, and the
last step using the command line "mahout recommendfactorized ........"
with  the files generates with my java code steps 1 to 3), recommendations
are correctly generated.

Thanks you very much

(Continue reading)

mw | 18 Mar 17:54 2015

using trainDocTopicModel to approximate p(topic|document)


i am trying to use a topicmodel to approximate p(topic|document) like this

TopicModel model = new TopicModel(hadoopConf, conf.getEta(), 
conf.getAlpha(), dict trainingThreads, modelWeight,
Vector docTopics = new DenseVector(new 
Matrix docTopicModel = new SparseRowMatrix(model.getNumTopics(), 
int maxIters = 5000;
for(int i = 0; i < maxIters; i++) {
      model.trainDocTopicModel(document, docTopics, docTopicModel);

For some reason the values in docTopics dont change much.
Does that indicate that the current document does not fit into the model?
Could there be another explanation?


mw | 17 Mar 09:46 2015

Importing tfidf from training set


i am running lda on a training set to create a topic model.
For calculating p(topic|document) on unseen data i need to import the 
inverse document frequency from the training set.
Is there a way to do that in mahout?


Jeff Isenhart | 12 Mar 04:24 2015

demo spark-itemsimilarity; empty output

I am trying to run the example found here:

The data (demoItems.csv added to hdfs) is just copied from the example:
But when I runĀ 
mahout spark-itemsimilarity -i demoItems.csv -o output2 -fc 1 -ic 2
I get empty _SUCCESS and part-00000 filesĀ 

Any ideas?
Hartwig Anzt | 11 Mar 15:33 2015

error hard to digest

Dear Mahout-Users,

I run ALS on a dataset, and it works out for feature space sizes 
10,20,30. When I try to run it on a feature space size of 50, I receive 
the error below (sorry for the long mail, I wanted to include all 
output). It looks like the first iterations are fine, and it then 
breaks? It is not a memory problem, I've checked that...

Great thanks for help!


Warning: $HADOOP_HOME is deprecated.

Warning: $HADOOP_HOME is deprecated.

15/03/09 11:27:20 INFO common.AbstractJob: Command line arguments: 
{--alpha=[50], --endPhase=[2147483647], --implicitFeedback=[true], 
--lambda=[0.1], --numFeatures=[50], --numIterations=[3], 
--numThreadsPerSolver=[16], --output=[output_16], --startPhase=[0], 
15/03/09 11:27:21 INFO util.NativeCodeLoader: Loaded the native-hadoop 
15/03/09 11:27:21 INFO input.FileInputFormat: Total input paths to 
process : 1
(Continue reading)