Renaud Grisoni | 2 Sep 00:56 2014

Wrong model when loading an AdaptiveLogisicRegression from a binary file


I encounter a problem when I apply an AdaptiveLogisicRegression model that
I have loaded from a binary file.
Indeed, the AdaptiveLogisicRegression object I load is not the same object
than the one I saved using the ModelSerializer class : for instance, the
Beta matrix of the OnlineLogisticRegression objects which are contained in
the AdaptiveLogisticRegression object are very different than those I have

I tried to use the ModelSerialiser class, then the write() and read()
methods from the AdaptiveLogisticRegression object, but I have always had
the same bad results.

Is there a bug during the model (de)serialization I should have heard about


alt | 1 Sep 14:45 2014

Re: UserBasedRecommender question

Ted, could you please explain this point:

> If you only have 20 categories, I would recommend that you consider using
> different technologies than recommendations.  Simply building 20
> classifiers is likely to be as effective or more so.

Suppose, we want to build a classifier to predict a category N as a
"label". Аnd we train it on whole user data or on a representative
sample. Then classifier believes that all combinations of features,
which it met with users, not interested in N, imply that user not
interested in N. So, if this classifier feats data well, it will give
us no positive answers for users from same data, not interested in N
yet. Isn`t it?
This only possible after some time – when data significantly changes. Right?

WBR Oleg

> From: Ted Dunning <ted.dunning <at>>
> To: "user <at>" <user <at>>
> Cc:
> Date: Wed, 6 Aug 2014 12:16:31 -0600
> Subject: Re: UserBasedRecommender question
> If you only have 20 categories, I would recommend that you consider using
> different technologies than recommendations.  Simply building 20
> classifiers is likely to be as effective or more so.

sara khodeir | 31 Aug 13:17 2014

Urgent questions

I'm curious what the difference between the predictors and features are in
the logistic regression model, what is the point of having features if it
predicts using the predictors.

Also,I am trying to include more features than can fit into the command
line so I wrote a file of the parameters and I'm trying to run it like this
./mahout "filepath" but this doesn't work. Does anyone know how I can do
Tom LAMPERT | 29 Aug 21:02 2014

lucene2seq and empty fields

Hi all,

I have running into a problem with lucene2seq and I'm wondering whether any of you can help me. I have a Solr
index in which the documents contain several fields and some of these fields are empty in some of the
documents. The problem that I am encountering is that lucene2seq gives the error:

Error: java.lang.IllegalArgumentException: Field 'XXXXXXX' does not exist in the index?

when an empty field is encountered. I have verified that the field exists by querying the Solr (but as I have
said, it is empty). Can anyone help with this?


Pat Ferrel | 29 Aug 20:27 2014


Great thanks,

Calculating TF-IDF on the input would seem to imply that the preference strength is being used as a proxy for
term count, right? As you say there is some question about whether this should be done. There certainly are
ways to preserve these “counts” when indexing happens in the search engine.

This issue seems to pop up a lot for two reasons:
1) people are still trying to use ratings as preference strengths; I tend to be dubious of this as a ranking
method since so much effort has gone to little effect here.
2) people want to mix actions; we have a good way to handle this now and weights often don’t work—mixing
actions may even make things worse.

Until your statements below I was strongly leaning towards ignoring preference strength and treating all
as boolean/binary. Can you think of a case that I missed? When I mine for preference data I get ratings and
keep only 4 or 5 out of 5 as unambiguous positive preferences tossing the rest as ambiguous.

As to normalizing the indicators, I have been turning that off for that field in Solr. Have to re-think that. 

On Aug 29, 2014, at 10:50 AM, Ted Dunning <ted.dunning <at>> wrote:

On Fri, Aug 29, 2014 at 9:48 AM, Pat Ferrel <pat <at>> wrote:
A[A’B] leaves users in rows so my mistake and  sparsifyWithLLR(A’B) is item similarity output.

Not sure what you mean by A diag(itemAweights) 

Intuitively speaking is this taking A and replacing any values (preference strengths) with IDF weights?
Not TF-IDF because TF = 1. If so wouldn’t it be binarize(A) diag(itemAweights)? R noob so bear with me.

binarize(A) diag(itemAweights) gives x.idf scores with no normalization

(Continue reading)

Pat Ferrel | 28 Aug 20:16 2014

Content-based recommendations

When we do cooccurrence recs with a search engine we index:

    itemID, list-of-indicator-items

Then search on the indicator field with user item history.

Could we use a similar approach for content-based recs? Imagine a content site where we have run the text
through a pipeline that narrows input to important tokens (lucene analyzer + LLR with threshold of some
kind) Then this goes into RowSimilarity.

docID, list-of-important-terms 

docID, list-of-similar-docs

Then index the list-of-similar-docs and query with the user doc history. The idea is to personalize the
content based recs rather than just show "docs like this one" 
Jozua Sijsling | 26 Aug 12:51 2014

ItemSimilarity based on k-means results

Dear Mahout users,

I am using a GenericBooleanPrefItemBasedRecommender where item similarity
is predetermined and based on k-means clustering results. I'm finding it
hard to rate the similarity of items even with the cluster data.

Given two items in the same cluster, I attempt to rate their similarity
based on the following calculation:

*double* distanceSquared = vector1.getDistanceSquared(vector2);
*double* radiusSquared = radius.getLengthSquared();
*double* diameterSquared = radiusSquared * 4;
*double* distanceRate = (diameterSquared - distanceSquared) /
*double* similarity = distanceRate * 2 - 1;
*double* result = Math.max(-1, Math.min(similarity, 1));

Note that the clustered points are tf-idf vectors where each vector will
have thousands of dimensions.

I find that with this simple calculation I am making the false assumption
that *radius.getLengthSquared() * 4* is a valid means of determining the
maximum squared distance for any two points in a cluster. This might have
worked if the clusters were n-spheres having the same radius on every axis.
But that is not the case. Often *distanceSquared* will be much larger than
diameterSquared, so my assumption fails and so does my calculation.

Given that my approach does not work. How should I determine a 'similarity
score' based on in-cluster distance? Working with ~40k dimensions I fear
that any optimal solution is still going to be very slow, so I am perfectly
(Continue reading)

Wei Li | 25 Aug 08:22 2014


Hi Mahout users:

    We have tried the item-based CF recommender with a user_id, item_id,
rating data. while the recommendation output is less than our expected, for
example, if we have 1000 users, the output should have 1000 records, one
for each user, right?

Sergi - | 24 Aug 03:19 2014

Java heap space using TrainNaiveBayesJob

Hi everybody,

I am implementing a classifier that handles a big amount of data using naive bayes using EMR as the "hadoop
cluster". By large amount of data I mean that the final models are around 45GB. While the feature
extraction step works fine, calling TrainNaiveBayesJob results in the following exception:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at org.apache.mahout.math.RandomAccessSparseVector.setQuick(
	at org.apache.mahout.math.VectorWritable.readFields(
	at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(
	at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(
	at Source)
	at Source)
	at$5.hasNext(Unknown Source)
	at Source)
	at org.apache.mahout.classifier.naivebayes.BayesUtils.readModelFromDir(
took me a little bit to realize that the MR job from Naive Bayes finished fine on each of the reducers, but
after the reduce step the namenode gets the models from the reducers, loads them to memory, validates them
and after the validation, serializes the models. My second thought was to increase the heap memory of the
namenode (using boostrap-actions in EMR,
s3://elasticmapreduce/bootstrap-actions/configure-daemons --namenode-heap-size=60000) but
even with this setup I am receiving the same exception. 

Has somebody dealt with a similar problem? Any suggestion? (other than trying a better master node). 
(Continue reading)

jay vyas | 23 Aug 03:51 2014

Extremely Slow ALS Recommender in 0.8, but faster in 0.9.

Hi mahout.  Im finding that in 0.8, my ALS recommender goes extremely slow,
but it goes very fast in 0.9 .

- At the time the jobs slows down, there is virtually no disk io.
- The CPU cycles fluctuate up and down during this time, but they arent at
100% the whole time.
- Mapper percentages increatse by about 2% every hour.  So its essentially
a hanging job (i.e. the input data set is super small, and the exact same
command finishes in about 2 minutes when using mahout 0.9).

To look into it,

- I saw that, between these two releases, there is a improvement thats been
made:, But there is no
associated patch.

- Doing a quick git diff, its clear that the VectorSumCombiner seems to
have been removed (probably so that mahout could a utility class vector

Im wondering : Has anyone ever seen the 0.8 ALS Recommender job progress
*very* slowly, so slowly, that its essentially hanging?  Like i said, this
bug appears gone in 0.9.

Any thoughts would be appreciated.



(Continue reading)

Wei Li | 22 Aug 10:56 2014

Mahout LDA issue: small word probability in each topic

Hi All:

    I have successfully compiled the Mahout 0.9 on Hadoop and submit the
LDA CVB model, most of the parameters are set to default values and the
--maxIter is set to 25. After we got the model, we found that the word
probability in each topic is quite small, most of them are about 0.00001
equally, is it OK? can we change the alpha and beta to change the word
distribution in each topic? thanks all :)