Mario Levitin | 18 Apr 00:02 2014
Picon

org.apache.mahout.math.IndexException

Hi,

I'm trying to run the ALS algorithm. However, I get the following error:

Exception in thread "pool-1-thread-3"
org.apache.mahout.math.IndexException: Index -691877539 is outside
allowable range of [0,2147483647)
at org.apache.mahout.math.AbstractVector.set(AbstractVector.java:395)
at
org.apache.mahout.cf.taste.impl.recommender.svd.ALSWRFactorizer.sparseUserRatingVector(ALSWRFactorizer.java:305)

At line 305 in ALSWRFactorizer.java, there is the following code

ratings.set((int) preference.getItemID(), preference.getValue());

My suspicion is that the error results from the casting to int in the above
line. Item IDs in mahout are long, so if you cast a long (which does not
fit into an int) then you will get negative numbers and hence the error.

However, this explanation also seems to me implausible since I don't think
such an error exists in Mahout code.

Any help will be appreciated.
Thanks
tuxdna | 17 Apr 13:47 2014
Picon

Is there any website documentation repository or tool for Apache Mahout?

I have seen the instructions here[1], but I am not sure if there is
any source-code for the documentation for website.

So here are my questions:

 * Does Apache Mahout project use any tool to generate website
documentation as it is now http://mahout.apache.org ?

 * Suppose I want to add some correction or edition to current Apache
Mahout documentation. Can I get a read-only access to the source of
website, so that I can immediately see how the edits will reflect once
there are accepted?

I was thinking in terms of the way GitHub pages work. For example if I
use Jekyll, I can view the changes on my machine, exactly as the will
appear on final website.

Regards,
Saleem

[1] http://mahout.apache.org/developers/how-to-update-the-website.html
[2] https://pages.github.com/
[3] http://jekyllrb.com/

Sebastian Schelter | 17 Apr 11:41 2014
Picon

Re: Performance Issue using item-based approach!

Could you take the output of the precomputation, feed it into a 
standalone recommender and test it there?

On 04/17/2014 11:37 AM, Najum Ali wrote:
>  <at> sebastian
>
>> Are you sure that the precomputation is done only once and not in every request?
> Yes, a  <at> Bean annotated Object is in Spring per default a singleton instance.
> I also just tested it out using a System.out.println()
> Here is my log:
>
> System.out.println("----> precomputation done!“ is called before returning the
> GenericItemSimilarity.
>
> The first two recommendations are Item-based -> pearson similarity
> The thrid and 4th log are also item-based using pre computed similarity
> The last log is the userbased recommender using pearson
>
> Look at the huge time difference!
>
> Am 17.04.2014 um 11:23 schrieb Sebastian Schelter <ssc <at> apache.org
> <mailto:ssc <at> apache.org>>:
>
>> Najum,
>>
>> this is really strange, feeding an ItemBased Recommender with precomputed
>> similarities should give you superfast recommendations.
>>
>> Are you sure that the precomputation is done only once and not in every request?
>>
(Continue reading)

Najum Ali | 17 Apr 11:17 2014

Performance Issue using item-based approach!

Hi guys, 

I have created a precomputed item-item-similarity collection for a GenericItemBasedRecommender.
Using the 1M MovieLens data, my item-based recommender is only 40-50% faster than without precomputation (like 589.5ms instead 1222.9ms). 
But the user-based recommender instead is really fast, it´s like 24.2ms? How can this happen? 

Here are more details to my Implementation:

CSV File: 1M pref, 6040 Users, 3706 Items

For my Implementation I´m using screenshots, because having the good highlighting.
My Recommender runs inside a Webserver (Jetty) using Spring 4 and Java8. I receive Recommendations as Webservice (JSON).

For DataModel, I´m using FileDataModel.


This code below creates me a precomputed ItemSimilarity when I start the Webserver and the property isItemPreComputationEnabled is set to true:


For time measuring I´m using AOP. I´m measuring the whole time from entering my Controller to sending the response.
based on System.nanoTime(); and getting the diff. It´s the same time measure for user based.

I haved tried to cache the recommender and the similarity with no big difference. I also tried to use CandidateItemsStrategy and MostSimilarItemsCandidateItemsStrategy, but also no performance boost.

public RecommenderBuilder createRecommenderBuilder(ItemSimilarity similarity) throws TasteException {
final int numberOfUsers = dataModel.getNumUsers();
final int numberOfItems = dataModel.getNumItems();
CandidateItemsStrategy candidateItemsStrategy = new SamplingCandidateItemsStrategy(numberOfUsers,numberOfItems);
MostSimilarItemsCandidateItemsStrategy mostSimilarStrategy = new SamplingCandidateItemsStrategy(numberOfUsers,numberOfItems);
return model -> new GenericItemBasedRecommender(model, similarity,candidateItemsStrategy,mostSimilarStrategy);
}

I dont know why item-based is taking so much longer then user-based. User-based is like fast as hell. I even tried a DataSet using 100k Prefs, and 10Million (Movielens). Everytime the user-based is soo much faster for any similarity. 

Hope you anyone can help me to understand this. Maybe I´m doing something wrong. 

Thanks!! :))



Jay Vyas | 16 Apr 21:31 2014
Picon

simple idea for improving mahout docs over the next month?

hi mahout... i finally thought of a really easy way of ad-hoc improvement
of mahout docs, that can feed into the efforts to get formal docs improved.

Any interest in creating a shared mahout FAQ file in a google doc.?

we can easily start adding questions into it that point to obvious missing
documentation parts, and mahout commiters can add responses below inline.
then overtime we can take those questions/answers and turn them directly
into real docs.

I think this will make it easier for a broader range of people to rapidly
improve mahout docs in an ad hoc sort of way.  i for one will volunteer to
help translate the QA stream into "real" documentation / JIRAs etc.

--

-- 
Jay Vyas
http://jayunit100.blogspot.com
Sebastian Schelter | 13 Apr 13:49 2014
Picon

Documentation, Documentation, Documentation

Hi,

this is another reminder that we still have to finish our documentation 
improvements! The website looks shiny now and there have been lots of 
discussions about new directions but we still have some work todo in 
cleaning up webpages. We should especially make sure that the examples work.

Please help with that, anyone who is willing to sacrifice some time, go 
through a website and try out the steps described is of great help to 
the project. It would also be awesome to get some help in creating a few 
new pages, especially for the recommenders.

Here's the list of documentation related jira's for 1.0:

https://issues.apache.org/jira/browse/MAHOUT-1441?jql=project%20%3D%20MAHOUT%20AND%20component%20%3D%20Documentation%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC

Best,
Sebastian

Nicklas Nilsson | 13 Apr 08:52 2014
Picon

Distance Measure in K-Means Java

Hi,

I am trying to cluster text with Canopy and K-Means. This is what I have and it works. But I’m curios if I
should not somehow run K-Means with Tanimoto and Canopy with Euclidian instead? What is K-Means using in
my setup? And why have the parameter for distance measure in KMeansDrivers run method been removed?

        //Generate input clusters for K-means (instead of using random K)
        CanopyDriver.run(conf,
                         TFIDF_VECTORS_PATH,
                         OUTPUT_PATH,
                         new TanimotoDistanceMeasure(),
                         t1,
                         t2,
                         runClusteringFalse,
                         clusterClassificationThreshold,
                         runSequential);

        //Generate K-Means clusters
        KMeansDriver.run(conf, 
                         TFIDF_VECTORS_PATH,
                         new Path(OUTPUT_PATH,"clusters-0-final"),
                         KMEANS_OUTPUT_PATH,
                         convergenceDelta,
                         maxIterations,
                         runClustering,
                         clusterClassificationThreshold,
                         runSequential);

Im wondering this since I read that Canopy runs good with a fast distance measure so I was thinking of using
Euclidian on Canopy and Tanimoto on K-means. Probably totally wrong but if someone could explain this it
would be great.

Thank you!

Best regards,
Nicklas
Mike Summers | 11 Apr 16:41 2014
Picon

PreferenceArray userID uniqeness?

Does the userId of a preferenceArray need to be unique across all entries
in a FastByIDMap?

I'm comparing two types of objects that contain the same set of traits
however it's possible that the userID (primary key) is not unique as it's
two db tables.

Thanks.
Terry Blankers | 11 Apr 05:33 2014

lucene2seq error: field does not exist in the index

Hi All, I'm very new to trying to use lucene2seq so I'm not sure if it's 
just user error, but I'm experiencing some unexpected behavior when 
running lucene2seq against my solr index (4.7.1). I've tried using both 
0.9 and the trunk build of mahout. (And BTW, I have been able to 
successfully run Reuters example as a test baseline.)

Here's the command I'm running:

    $MAHOUT_HOME/bin/mahout lucene2seq -i
    /home/ec2-user/solr/solr-data/solrindex/index -o solr/sequence -id
    key_sha1hex -f body -xm sequential -q topics:diabetes -n 500

Excerpts from my solr schema:

<fieldname="content"type="text"stored="false"indexed="true"multiValued="true"/>
<fieldname="body"type="string"stored="true"indexed="false"/>

<!-- Use the indexed/un-stored "content" field for searching 
--><copyField source="body" dest="content" />
<!-- field for the QueryParser to use when an explicit fieldname is 
absent --><defaultSearchField>content</defaultSearchField>

When I use SolrAdmin and specify fl=body the search handler returns the 
'body' field with data as expected. Yet I get the following error when 
running lucene2seq and specify '-f body':

    /IllegalArgumentException: Field 'body' does not exist in the index/

And if I specify '-f content', lucene2seq runs without errors or 
warnings, but seqdumper output shows no values for any key:

    /Key class: class org.apache.hadoop.io.Text Value Class: class
    org.apache.hadoop.io.Text
    Key: 96C4C76CF9D7449C724CA77CB8F650EAFD33E31C: Value:
    Key: D6842B81B8D09733B50BEDB4767C2A5C49E43B20: Value:
    Key: 61CB95FEE2C6BF0AC6E8A1F7738338CA36F42264: Value:
    Key: 0F9903B72A7C9F0373A5171403B3AAEB291B16E1: Value: /

Can anyone give me any suggestions as to how to track down what might be 
happening here?

Many thanks,

Terry

Jessie Wright | 11 Apr 00:21 2014
Picon

Random Forest Example: trouble getting Describe to work

Currently I'm trying to follow the Random Forest example on the website(
https://mahout.apache.org/users/stuff/partial-implementation.html).

When I try to  run Describe data I get this error:
[values missing for attribute 1]
Error here: http://pastebin.com/QGCCqvc5

It seems like the possible values for categorical aren't there, so I tried
it ignoring all categorical attributes, then I got a null pointer exception

Error here: http://pastebin.com/56AEKhse

I'm wanting to use this with my own data set, and I'm getting the same
errors with my data.

Can anyone advise if I'm doing something wrong, or if there's a workaround
I should be using?

Thank you,
Jessie
Harith Kumara | 10 Apr 11:01 2014
Picon

Mahout 0.9 examples assumes MAHOUT_JAVA_HOME

When I downloaded Mahout 0.9 and started running examples it gave JAVA_HOME not found error. However,
within the scripts they check MAHOUT_JAVA_HOME instead of JAVA_HOME. Therefore when the
MAHOUT_JAVA_HOME is temporarily exported (to JAVA_HOME), it  worked. This is my first post here, so if
this should be posted to dev list please let me know. 

Gmane