Serega Sheypak | 24 Jul 08:57 2014

itemsimilarity returns few similar items

Hi, I'm trying to calc itemsimilarity using ItemSimilarityJob.
Here are my counts:
input dataset: user_id, item_id, pref: 16M
distinct items: 700K
distinct users: 4M

bucketed preferences per users
count_of_preferences, count_of_users
1                                   2M
2                                   600K
3                                   300K
4                                   300R

threshold: 0.91

It returns ~2000 rows for ~1000 distinct items.
What do i do wrong?
Martin, Nick | 24 Jul 07:27 2014


So I know fpgrowth was sent out to pasture a few months ago. As luck would have it I need to do this kind of thing now.

Would my only option now be to pull the source (per Sebastian's note in the JIRA)? Could I roll back from 0.9 to
a prev version to pick it back up?

Any other options? I don't *think* we'd be able to bite off algo maintenance so that probably rules our
getting it dropped back into the distro I'm guessing. 

Sent from my iPhone
Bojan Kostić | 23 Jul 11:10 2014

Streaming kmeans question


As is see it from examples, streamingkmeans creates Centroids for clusters.
And my question is who can use those centorids? Is there a way to pass them
to kmeans for clustering?
In past i used canopy and then kmeans.
I searched user lists, and mahout jira. And i could not find any clue.
There are some other users who had same questions but no answer are given.
I can see that there is BallKMeans in source code and it is used
in StreamingKMeansReducer:getBestCentroids. But i cant figure out where is
the actual clustering of data.

What am i missing?

I am playing with mahout 0.9 and mahout trunk.

Best regards.

Bojan Kostić
Edith Au | 22 Jul 17:59 2014

RowSimilarityJob with sparse matrix skips rows


My RowSimiliarityJob returns a DRM with some rows missing.   The input file
is very sparse.  there are about 600 columns but only 1 - 6 would have a
value (for each row).   The output file has some rows missing.  The missing
rows are the ones with only 1 - 2 values filled.  Not all rows with 1 or 2
values are missing, just some of them.  And the missing rows are not always
the same for each RowSimilarityJob execution

What I would like to achieve is to find the relative strength between
rows.  For example, if there are 600 books, user1  and user2 like only one
book (the same book), then there should be a correlation between these 2

But my RowSimilarityJob output file seems to skip some of the users with
sparse preferences.  I am running the job locally with 4 options: input,
output, SIMILARITY_LOGLIKELIHOOD, and temp dir.   What would be the right
approach to pick up similarity between users with sparse preferences?


Helena | 22 Jul 11:17 2014

Adding unrated items to data model


I am trying to establish a system of content based recommendation by using 
Mahout. I use PostgreSQLJDBCDataModel.
I haven't found a way to use unrated items in generating data model.
I wonder: is it possible and how can it be done?

Thanks in advance, 

anaa dewet | 21 Jul 23:15 2014

Mahout on Cygwin

Hi , 

Can I run Mahout on windows using cygwin ?

If yes , Can anybody please direct me to tutorial or
step-step description ?

Thank in advance,dewet 		 	   		   		 	   		  
anaa dewet | 21 Jul 23:14 2014

LDA help

Hi , 
I sent this before ,but I do not know if it posted in the group successfully :It is my first question here
I would like to run LDA on my own set of documents .I used seqdirectory  utility to convert a directory of text
documents into sequence files however I need to set each sequence file key by rowid generated from
database or  a column's value of csv file .i.e. a lrage csv file where each line is a document , I would like to
create a sequence file for each line take in account the key to be the first column.Is there an option to
determine what the key of sequence file shall be ?
for LDA results , How can I interpret the term-topic and doc-topic files that generated after run LDA or cvb ?
appreciate help, 
Andreas Spalas | 21 Jul 21:43 2014

Help for Grouping similar items together. Clustering/Classification problem?


these days I am exploring Mahout Framework in order to solve a specific
The problem is that I have a csv file with 1.5 Million items - products
with the following format:
id, product_title
1, Apple IPHONE 5
2, Samsung Galaxy S5

and I would like to group the items-products together in terms of category
so for example in the above case both products would be under "Technology"
or "Smartphones" Category.

I would like to know if this is possible to handle in Mahout and whether
someone would choose clustering or classification way in order to solve
such a problem.

As, I am studying "Mahout in action" currently I saw that for Clustering
case I have to transform my data into a SequenceFile and find a way of
vectorization and I don't really get if this is applicable to my case at
the moment. For, the second case of classification I understand that I have
to provide some training data with target variable(in my case "Category")
in order to create a model for the classification system and I can extend
my dataset with this extra info but is it going to work?

Can anyone give me some advice on how to handle this particular problem?Is
it even possible to do it in Mahout? Any direction would be aprreciated!

(Continue reading)

Serega Sheypak | 20 Jul 20:57 2014

recommenditembased returns 0 records from last map-reduce job

Hi, I'm trying to create item similarity.
I gather items which users visit during shopping and then create a file:
user_id, item_id, weight (where weight can be: [1.0, 1.6, 1.9], depends on
user action type and data source)
-item_id, item_id, 1.0 (from items dictionary)

and I do provide a userFile, where user_id = -item_id

The idea is to get item similary. If any user visits item named "A", i want
to show him items "B", "c", "xxx" using preferences of other users.

The problem is that the last (???) mapreduce job returns 0 rows:

Here are my settings:

sudo -u oozie mahout recommenditembased \
                    --input visited_items_with_inverted_items \

                    --output result \
                    --similarityClassname SIMILARITY_LOGLIKELIHOOD \
                    --usersFile inverted_items \
                    --numRecommendations 500 \
                    --booleanData false \
                    --maxPrefsPerUser 100 \
                    --maxSimilaritiesPerItem 500 \
                    --minPrefsPerUser 0\
                    --maxPrefsPerUserInItemSimilarity 30 \
                    --threshold 0.91 \
                    --tempDir  temp \
(Continue reading)

Andrew Musselman | 18 Jul 07:38 2014

New post on the AWS blog

I wrote a post about building a recommender using Mahout on Amazon EMR and it finally went live:

Please feel free to share the link to get the word out.

Alex Ott | 14 Jul 19:33 2014

Mahout 0.9 & code from "Mahout in Action"

Hi all

For all who is interested in using the code from "Mahout in Action" with
latest Mahout - the examples were updated to work with Mahout 0.9. You can
find the code in separate branch:

I haven't tested all examples I tested chapters 2-5 & chapter 16 - if you
find that something isn't working, then file an issue on Github


With best wishes,                    Alex Ott
Twitter: alexott_en (English), alexott (Russian)