Pat Ferrel | 28 Aug 20:16 2014

Content-based recommendations

When we do cooccurrence recs with a search engine we index:

    itemID, list-of-indicator-items

Then search on the indicator field with user item history.

Could we use a similar approach for content-based recs? Imagine a content site where we have run the text
through a pipeline that narrows input to important tokens (lucene analyzer + LLR with threshold of some
kind) Then this goes into RowSimilarity.

docID, list-of-important-terms 

docID, list-of-similar-docs

Then index the list-of-similar-docs and query with the user doc history. The idea is to personalize the
content based recs rather than just show "docs like this one" 
Jozua Sijsling | 26 Aug 12:51 2014

ItemSimilarity based on k-means results

Dear Mahout users,

I am using a GenericBooleanPrefItemBasedRecommender where item similarity
is predetermined and based on k-means clustering results. I'm finding it
hard to rate the similarity of items even with the cluster data.

Given two items in the same cluster, I attempt to rate their similarity
based on the following calculation:

*double* distanceSquared = vector1.getDistanceSquared(vector2);
*double* radiusSquared = radius.getLengthSquared();
*double* diameterSquared = radiusSquared * 4;
*double* distanceRate = (diameterSquared - distanceSquared) /
*double* similarity = distanceRate * 2 - 1;
*double* result = Math.max(-1, Math.min(similarity, 1));

Note that the clustered points are tf-idf vectors where each vector will
have thousands of dimensions.

I find that with this simple calculation I am making the false assumption
that *radius.getLengthSquared() * 4* is a valid means of determining the
maximum squared distance for any two points in a cluster. This might have
worked if the clusters were n-spheres having the same radius on every axis.
But that is not the case. Often *distanceSquared* will be much larger than
diameterSquared, so my assumption fails and so does my calculation.

Given that my approach does not work. How should I determine a 'similarity
score' based on in-cluster distance? Working with ~40k dimensions I fear
that any optimal solution is still going to be very slow, so I am perfectly
(Continue reading)

Wei Li | 25 Aug 08:22 2014


Hi Mahout users:

    We have tried the item-based CF recommender with a user_id, item_id,
rating data. while the recommendation output is less than our expected, for
example, if we have 1000 users, the output should have 1000 records, one
for each user, right?

Sergi - | 24 Aug 03:19 2014

Java heap space using TrainNaiveBayesJob

Hi everybody,

I am implementing a classifier that handles a big amount of data using naive bayes using EMR as the "hadoop
cluster". By large amount of data I mean that the final models are around 45GB. While the feature
extraction step works fine, calling TrainNaiveBayesJob results in the following exception:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at org.apache.mahout.math.RandomAccessSparseVector.setQuick(
	at org.apache.mahout.math.VectorWritable.readFields(
	at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(
	at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(
	at Source)
	at Source)
	at$5.hasNext(Unknown Source)
	at Source)
	at org.apache.mahout.classifier.naivebayes.BayesUtils.readModelFromDir(
took me a little bit to realize that the MR job from Naive Bayes finished fine on each of the reducers, but
after the reduce step the namenode gets the models from the reducers, loads them to memory, validates them
and after the validation, serializes the models. My second thought was to increase the heap memory of the
namenode (using boostrap-actions in EMR,
s3://elasticmapreduce/bootstrap-actions/configure-daemons --namenode-heap-size=60000) but
even with this setup I am receiving the same exception. 

Has somebody dealt with a similar problem? Any suggestion? (other than trying a better master node). 
(Continue reading)

jay vyas | 23 Aug 03:51 2014

Extremely Slow ALS Recommender in 0.8, but faster in 0.9.

Hi mahout.  Im finding that in 0.8, my ALS recommender goes extremely slow,
but it goes very fast in 0.9 .

- At the time the jobs slows down, there is virtually no disk io.
- The CPU cycles fluctuate up and down during this time, but they arent at
100% the whole time.
- Mapper percentages increatse by about 2% every hour.  So its essentially
a hanging job (i.e. the input data set is super small, and the exact same
command finishes in about 2 minutes when using mahout 0.9).

To look into it,

- I saw that, between these two releases, there is a improvement thats been
made:, But there is no
associated patch.

- Doing a quick git diff, its clear that the VectorSumCombiner seems to
have been removed (probably so that mahout could a utility class vector

Im wondering : Has anyone ever seen the 0.8 ALS Recommender job progress
*very* slowly, so slowly, that its essentially hanging?  Like i said, this
bug appears gone in 0.9.

Any thoughts would be appreciated.



(Continue reading)

Wei Li | 22 Aug 10:56 2014

Mahout LDA issue: small word probability in each topic

Hi All:

    I have successfully compiled the Mahout 0.9 on Hadoop and submit the
LDA CVB model, most of the parameters are set to default values and the
--maxIter is set to 25. After we got the model, we found that the word
probability in each topic is quite small, most of them are about 0.00001
equally, is it OK? can we change the alpha and beta to change the word
distribution in each topic? thanks all :)
Wei Zhang | 21 Aug 15:25 2014

does Mahout support Hadoop 2.5.0 ?


After a system upgrade, we have a Hadoop 2.5.0 cloud instance. We are
trying to run Mahout on top of it. We are using Mahout 0.9 (downloaded from ) indicates Mahout supports
Hadoop 2.2.0

But even with mvn clean install -Dhadoop2.version=2.2.0 -DskipTests=true,
it doesn't appear Hadoop 2.x jar is downloaded anywhere (i.e., there is
only Hadoop 1.2.1 jar downloaded, which is what the pom.xml dictates).

Further, a simple HDFS I/O call would fail (i.e., creating an HDFS file).
It seems Hadoop 2.2.0 is not quite compatible with Mahout 0.9

My question is:
(1) Does Mahout support Hadoop 2.5.0 or should I check out the code from
git (as suggested at in order to get a
Hadoop 2.x friendly Mahout ?


Pat Ferrel | 20 Aug 23:23 2014

spark-itemsimilarity output

Got a question on Twitter about this so here’s the answer:

Hadoop itemsimilarity takes pairs of interactions as input (user id<tab>item id<tab>strength) and
outputs pairs (item id1<tab>item id2<tab>strength). Furthermore the output is only above the diagonal
so redundant pairs are not output.

spark-itemsimilarity takes the same input but outputs basically a drm in text form including redundant
pairs and sorts each row by LLR strength. It also preserves the IDs that were input even if they were
strings. For example, using the default delimiters the output might be:
(itemID1<tab>itemID2:strength2<space>itemID3:strength3…) This change of format is because:

1) one primary use of spark-itemsimilarity is with a search engine to create a recommender and this format
(minus the strengths) is easily indexed.
2) the output is basically a sorted list of similar items for each item. This is a better format where the user
expects to show a list of similar items
3) it is assumed that preserving the application IDs is a benefit.
4) the various parts of the output can be parsed with simple string “split” methods available in all languages.

If anyone can see a need to reproduce the old hadoop type output please speak up.
Wei Zhang | 20 Aug 00:16 2014

any pointer to run wikipedia bayes example


I have been able to run the bayesian network 20news group example provided
at Mahout website.

I am interested in running the Wikipedia bayes example, as it is a much
larger dataset.
From several googling attempts,  I figured it is a bit different workflow
than running the 20news group example -- e.g., I would need to provide a
categories.txt file, and invoke WikipediaXmlSplitter,  call
wikipediaDataSetCreator and etc.

I am wondering is there a document somewhere that describes the process of
running Wikipedia bayes example ?  seems no
longer work.

Greatly appreciated!

Pat Ferrel | 19 Aug 21:08 2014

Re: mapreduce ItemSimilarity input optimization

If you have purchase data, train with that. Purchase data is always much better than recommending views.
Don’t worry that you have 100 views per purchase, trust me this will give you much better recommendations.

Filter with un-related categories. So filter electronics, throwing aways clothes and home appliances
but not filtering our other possibly related categories like accessories. There are other ways to do this
but they are more complicated, let’s get this working first.

You should consider doing personalized recommendations next instead of only “other people purchased
these items”. All users see the same recs, since you have no personalization yet.

If you are keeping user purchase history you can train itemsimilarity with that—it will give you
"“other people purchased these items”, these are similar items. Then index the similarity data with
a search engine (Solr, Elastic Search) and query with the current users purchase history as the query. The
results will be an ordered list of items that are personalized recs.

I think you can get a free copy of “Practical Machine Learning” from the MapR site: It describes the Solr recommender method that
uses itemsimilarity. 

On Aug 19, 2014, at 10:23 AM, Serega Sheypak <serega.sheypak <at>> wrote:

Yes, ecommerce.
>> #2 data includes #1 data, right?
Yes, #1 are "raw" output of ItemSimilarity recommendtions
#2 are recommednations #1 with category filter applied.

I can't drop #1 "look-with" since #2(ith category filter) doesn't have
accessories. Category filter would remove accessory recommendations for
iphone and leave only other iphones.

(Continue reading)

Pat Ferrel | 18 Aug 19:09 2014

Row Similarity

The spark version of itemsimilarity only has LLR as a metric. But what about RSJ? it’s a pretty simple
thing to convert itemsimilarity to rowsimilarity but RSJ has some uses beyond collaborative filtering.
Are some of the other similarity metrics needed?

Specifically text comparison typically weights the terms in the DRM with TF-IDF. Then a user would apply
SIMILARITY_COSINE or SIMILARITY_TANIMOTO. Then again I’ve seen some using LLR for this too.

 Are some of the other similarity metrics needed for RSJ?