chirag lakhani | 18 Dec 15:31 2014

deploying Naive Bayes model in Java App

I have looked around for information about this but I still feel unsure
about whether this is possible.  I have recently developed a naive bayes
model in sci-kit learn that takes some short text data (along with a few
other categorical features) and develops a classifier model.  I would like
to ultimately deploy this classifier in our Java based app.

I am thinking of using Mahout to build the model, especially given the fact
that we have a Hadoop cluster and it would be good to train the model there
and then be able to serialize it there then migrate it to our Java App
environment.  I wanted to know if this is possible and what would I
ultimately need to do in order to get it to work.

In the Mahout in Action book it says that Naive Bayes only works for
"text-like" variables, do that mean you cannot combine categorical
variables with text variables?


万代豊 | 17 Dec 08:33 2014

Advise needed for Mahout heap size allocation (seq2sparse failure)

After my several successful jobs experiences on other Mahout Kmeans
calculation in the past , I'm facing a sudden heap error as below in Mahout
seq2sparse process.(Mahout-0.70 on Hadoop-0.20.203 Pseudo-distributed)

[hadoop <at> localhost TEST]$ $MAHOUT_HOME/bin/mahout seq2sparse --namedVector
-i TEST/TEST-seqfile/ -o TEST/TEST-namedVector -ow -a
org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 200 -wt tfidf -s 5 -md
3 -x 90 -ml 50 -seq -n 2
Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /usr/local/mahout-distribution-0.7/mahout-examples-0.7-job.jar
14/12/16 22:52:55 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum
n-gram size is: 1
14/12/16 22:52:55 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum
LLR value: 50.0
14/12/16 22:52:55 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of
reduce tasks: 1
14/12/16 22:52:57 INFO input.FileInputFormat: Total input paths to process
: 10
14/12/16 22:52:57 INFO mapred.JobClient: Running job: job_201412162229_0005
14/12/16 22:52:58 INFO mapred.JobClient:  map 0% reduce 0%
14/12/16 22:53:27 INFO mapred.JobClient: Task Id :
attempt_201412162229_0005_m_000000_0, Status : FAILED
Error: Java heap space
14/12/16 22:53:29 INFO mapred.JobClient: Task Id :
attempt_201412162229_0005_m_000001_0, Status : FAILED
Error: Java heap space
14/12/16 22:53:40 INFO mapred.JobClient:  map 2% reduce 0%
14/12/16 22:53:42 INFO mapred.JobClient: Task Id :
attempt_201412162229_0005_m_000000_1, Status : FAILED
(Continue reading)

ARROYO MANCEBO David | 17 Dec 10:09 2014

Tanimoto Coefficient

Hi mahouters,

Is useful and acceptable the tanimoto coefficient for an user similarity or only for item similarity ?

       public static void main(String[] args) {
             try {
                    DataModel model = new FileDataModel(new File("data/dataset.csv"));
                    UserSimilarity similarity = new TanimotoCoefficientSimilarity(model);
                    UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.1, similarity, model);
                    UserBasedRecommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);
                    List<RecommendedItem> recommendations = recommender.recommend(2, 3);
                    for (RecommendedItem recommendation : recommendations) {
             } catch (IOException e) {
             } catch (TasteException e) {

Pasmanik, Paul | 15 Dec 22:40 2014

Question about choice of a recommender

Hi, I am trying to figure out what recommender or a combination of other techniques to use.
I have a system where a user is presented with a list of pre-defined (max 50 products) when he/she is viewing a
video clip from a finite (not huge) set.    The relationship between video clip and what products to show is
statically defined and can change from time to time.   We collect clicks and resulting buys for these
products.   What we are trying to do is to come up with the recommender engine that would display top 4-5
products that a user would most likely click on and possibly buy (limited to the pre-defined list of up to 50
or so products for that particular video clip).
We can think of a set of these pre-defined products as a category, but a product can appear in multiple categories.

What would be the best approach?


The information contained in this electronic transmission is intended only for the use of the recipient
and may be confidential and privileged. Unauthorized use, disclosure, or reproduction is strictly
prohibited and may be unlawful. If you have received this electronic transmission in error, please
notify the sender immediately.
Lee S | 15 Dec 10:03 2014

How can I include mahout 0.9 with hadoop 2.3 in my project?

Hi all:
I use gradle to management dependencies in my project.

dependencies {
     compile  'org.apache.mahout:mahout-core:0.9'

When gradle build , mahout with hadoop 1.2.1 will be downloaded.

Do I need to compile mahout with hadoop 2.3.0 and then include it  into my
project locally?
Suman Somasundar | 12 Dec 23:46 2014

Providing classification labels to Naive Bayes



I tried to run Naïve Bayes program on a digit recognition data set. If the program itself extracts the
labels, it runs fine.


If I provide a file which contains the labels, then the program throws an exception saying that the number of
labels is 0.


Why is this happening?


Pat Ferrel | 10 Dec 20:02 2014

Re: Question about Mahout spark-itemsimilarity

The idea is to put each of the indicators in a doc or db row with a field for each indicator.

If you index docs as csv files (not DB columns) with a row per doc you’ll have the following for [AtA]

doc-ID,item1-ID item2-ID item10-ID  and so on

The doc-ID field contains the item-ID and the indicator field contains a space delimited list of similar items

You add a field for each of the cross-cooccurrence indicators, you calculate [AtA] with every run if the
job, but [AtB] will be with a different action and so the IDs for A and B  don’t have to be the same. For
instance if you have a user’s location saved periodically you could treat the action of “being at a
location” as the input to B. You run A and B into the job and get back two text csv sparse matrices. Combine
them and index the following:

combined csv:
item1000-ID,item1-ID item2-ID item10-ID ...,location1-ID location2-ID …

This has two indicators in two fields created by joining  [AtA] and  [AtB] files. The item/doc-ID for the row
is from A always—if you look at the output that much should be be clear. Set up Elasticsearch to index the
docs with two fields.

Then the query uses the history of the user that you want to make recommendations for.

Don’t know Elasticsearch query format but the idea is to use the user’s history of items interacted
with and locations as the query.

Q: field1:”item1-ID item2-ID item3-ID” field2:”location1-ID locationN-ID”

field1 contains the data from the primary action indicator and the query string is the user’s history of
the primary action. Field2 contains the location cross-cooccurrence indicators and so will be
(Continue reading)

Gruszowska Natalia | 10 Dec 16:40 2014

Collaborative filtering item-based in mahout - without isolating users

Hi All,

In mahout there is implemented method for item based Collaborative filtering called itemsimilarity,
which returns the "similarity" between each two items.
In the theory, similarity between two items should be calculated only for users who ranked both items.
During testing I realized that in mahout it works different.
Below two examples.

Example 1. items are 11-12
In below example the similarity between item 11 and 12 should be equal 1, but mahout output is 0.36. It looks
like mahout treats null as 0.
Similarity between items:
101     102     0.36602540378443865

Matrix with preferences:
            11       12
1                     1
2                     1
3           1         1
4                     1

Example 2. items are 101-103.
Similarity between items 101 and 102 should be calculated using only ranks for users 4 and 5, and the same for
items 101 and 103 (that should be based on theory). Here (101,103) is more similar than (101,102), and it
shouldn't be.
Similarity between items:
101     102     0.2612038749637414
101     103     0.4340578302732228
102     103     0.2600070276638468

(Continue reading)

Anne Sauve | 9 Dec 22:57 2014

computing the distance between 2 values from fuzzyKmeans clustering clusteredPoints

Hello there,

I have been trying for a while to compute the pairwise distance between all the 
clustered points in order to fill in a distanceMatrix which will then be used to 
compute the silhouette of my clustering.

Here is my code I am using Mahout 0.8

 SequenceFile.Reader reader1 = new SequenceFile.Reader(fs, new 
Path("data/testdata/output8Box/clusteredPoints" + "/part-m-00000"),conf);
 IntWritable key1 = new IntWritable();
 WeightedVectorWritable value1 = new WeightedVectorWritable();

 List<NamedVector> clusters = new ArrayList<NamedVector>();
 while (,value1)) {
	 System.out.println(value1.toString() + " belongs to cluster " + 
	 NamedVector cluster = (NamedVector) value1.getVector();
// Compute the distanceMatrix
DistanceMeasure measure = new CosineDistanceMeasure();
for (int i = 0; i< clusters.size(); i++) {
	for (int j = i + 1; j < clusters.size(); j++) {
		double d = measure.distance(clusters.get(i), clusters.get(j));
		System.out.println("dist "+i + " ; "+ j + " : "+ d);

When I run it I am getting the following exception:
(Continue reading)

Makis P | 9 Dec 14:58 2014

General Instructions about Mahout

Hello everyone!

I am new to mahout and a university student! As a part for my diploma thesis 
i must use mahout's k-means algorithm and change it a little bit to produce 
code that will solve some of the problems i study.My main interest now is 
I send this message because i seek general information about mahout , and how 
to use it.

First of all let me tell you what i have done so far:

1) I downloaded the book "Μahout Cookbook" , and i followed the instructions 
and i set up hadoop and maven.
2) And after that i followed these instructions of 
how to setup mahout in eclipse

Now in eclipse mahout , mahout-buildtools etc , but no mahout-core jars as 
far as i can see. So i installed mahout ( assuming that that installation was 
OK ) and now how should i proceed to work with mahout? How can i run the code 
given via eclipse?

My question is if somebody could give the general idea of how things work 
with eclipse and mahout. Or if somebody has a tutorial that described such 

Thanks in advance for any help.

With friendly greetings,
(Continue reading)

shweta agrawal | 8 Dec 06:39 2014

Mahout clustering

I am new to mahout. I am working on mahout clustering to detect topic. I
have done mahout kmeans clustering and i got the top terms of cluster also,
but i want the document id of the clusters. How to get which document is in
which cluster?

Thanks and Regards