Suneel Marthi | 3 Aug 08:17 2015
Picon

[VOTE] Apache Mahout 0.11.0 Release Candidate

This is the vote for release 0.11.0 of Apache Mahout.

The vote will be going for at least 72 hours and will be closed on
Wednesday,
August 6th, 2015.  Please download, test and vote with

[ ] +1, accept RC as the official 0.11.0 release of Apache Mahout
[ ] +0, I don't care either way,
[ ] -1, do not accept RC as the official 0.11.0 release of Apache Mahout,
because...

Maven staging repo:

https://repository.apache.org/content/repositories/orgapachemahout-1013/

The git tag to be voted upon is release-0.11.0
Suneel Marthi | 3 Aug 01:42 2015
Picon

[VOTE] Apache Mahout 0.11.0 Release candidate

This is the vote for release 0.11.0 of Apache Mahout.

The vote will be going for at least 72 hours and will be closed on
Wednesday,
August 5th, 2015.  Please download, test and vote with

[ ] +1, accept RC as the official 0.11.0 release of Apache Mahout
[ ] +0, I don't care either way,
[ ] -1, do not accept RC as the official 0.11.0 release of Apache Mahout,
because...

Maven staging repo:

https://repository.apache.org/content/repositories/orgapachemahout-1012
<https://repository.apache.org/content/repositories/orgapachemahout-1012/org/apache/mahout/mahout-distribution/0.11.0/>
<https://repository.apache.org/content/repositories/orgapachebigtop-1001>

The git tag to be voted upon is release-0.11.0
Suneel Marthi | 1 Aug 05:56 2015
Picon

[VOTE] Apache Mahout 0.10.2 Release Candidate

This is a call for Votes for Mahout 0.10.2 Release candidate available at
https://repository.apache.org/content/repositories/orgapachemahout-1011

Need atleast 3 PMC +1 votes for the RC to pass. Voting runs until Sunday
Aug 2, 2015.

Please verify the following:

1. Sigs and Hashes of Release artifacts (Ted/Drew/Grant/Stevo)
2. AWS testing of {src, bin} * {tar, zip}  (Andrew ?)
3. Integration testing of {src,bin} * {tar,zip} (Suneel/AP/)
4. Run thru Examples and scripts
Suman Somasundar | 28 Jul 01:09 2015
Picon

IndexOutOfBoundsException in ALS recommendation

Hi,

I am getting the following error when I try to ALS.

I get this exception at userRatings job after the itemRatings job. 

15/07/24 18:11:01 INFO mapreduce.Job: Task Id : attempt_1437779248212_0003_m_000064_6, Status : FAILED

Error: java.lang.ArrayIndexOutOfBoundsException: -718978944

        at org.apache.hadoop.io.WritableComparator.readInt(WritableComparator.java:211)

        at org.apache.hadoop.io.IntWritable$Comparator.compare(IntWritable.java:91)

        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1265)

        at org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:99)

        at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:63)

        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1594)

        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1482)

        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:720)

        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:790)

        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)

(Continue reading)

Eje Svensson | 27 Jul 16:40 2015
Picon

Setup questions for mahout with spark

Hello,

I'm planning to setup a system with mahout and spark and would like some  
opinions regarding what versions of the components to use.

The target system is a redhat enterprise linux box.

What I would like to know are:
* What version of java people is using, jre-1.7.0-openjdk ?
* Should I use last final binary mahout-distribution-0.10.1 or compile the  
last git head ?
* What matching spark version and with what backend engine should I use?

Thanks in advance,
Eje Svensson

Ankit Goel | 23 Jul 02:53 2015
Picon

Mahout on the cloud

Hi,
After my runs on my lappy, I'm ready to port my work to the cloud. Planning
to use Amazon. One thing I noticed when I started with mahout, that there
were a lot of things unsaid on the site/wiki and took me a lot of time to
figure out. Pitfalls if I may call them. I will primarily be using
clustering on the cloud, so the code to accept new data and run it is what
I have for now.

So before I port to the cloud, are there any things I should beware of or
lookout for? Like is AWS fine with mahout? Are there any configurations I
should remember? Any advice on implementation to ease my transition and run
mahout 24hrs? Thanks

--

-- 
Regards,
Ankit Goel
http://about.me/ankitgoel
TheGeorge1918 . | 22 Jul 10:35 2015
Picon

the label of output of random forest in mahout

Hi all,

I have a question about the label in the output of random forest. Suppose I
do a binary classification with label 0 and 1. In my data description file,
I have something like
{"values":["1","0"],"label":true,"type":"categorical"}. The label 1 is in
index 0 and label 2 in index 1. Is it possible in the output file .out the
label is swapped?

I checked the source code of mahout Classifier
(mr/src/main/java/org/apache/mahout/classifier/df/mapreduce/Classifier.jara).
In the "parseOutput" function, it directly outputs the result into file
without trying to get the right label.

In the TestForest
(examples/src/main/java/org/apache/mahout/classifier/df/mapreduce/TestForest.jara).
If I specify the -a parameter. Then, it will output confusion matrix.
There, it looks like the right label is obtained by calling dataset
.getLabelString().

So, my conclusion is that the confusion matrix is always right (the user
provided label is used to compute). However, the output of prediction could
have a different label compared to the user supplied label. Is it right?

Thanks a lot

Best

Xuan
(Continue reading)

Ankit Goel | 21 Jul 05:17 2015
Picon

Partial Solr Index Clustering

Hi,
I was wondering if its possible to use only partial solr index for
clustering. For example, my crawler updates my solr index every hour with
new documents, and I just want to cluster those new documents, not the old
ones. If I was programming normally, I could query solr for the latest
documents with the time constraint and then pass it as vectors to my
clustering program. But since mahout accepts solr indices directly I
thought there might be a simpler way.

--

-- 
Regards,
Ankit Goel
http://about.me/ankitgoel
Ankit Goel | 21 Jul 02:18 2015
Picon

Kmeans clusterdump Interpretation

Hi,
I've been messing with mahout 0.10 and kmeans clustering with a solr 4.6.1
index. The data is news articles. The --field option for kmeans is set to
"content". The idField is set to "title" (just so i can analyse it faster).
The clusterdump of the kmeans result gives me a proper output, but I cant
figure out the id of the vector chosen as the center. There are only 14-15
articles so I am not hung up about the cluster performance at this time.

I used random seeds for the kmeans commandline.
For reference, this is the commandline cluster dump I am executing

bin/mahout clusterdump -i $MAHOUT_HOME/testCluster/clusters-3-final
-p $MAHOUT_HOME/testCluster/clusteredPoints -d $MAHOUT_HOME/dict.txt -b 5

The output I get is off the form

:{"r":

top terms

xxxxx==>xxxxx

Weight : [props - optional]:  Point:

 1.0 : [distance=0.0]: [{"account":0.026}.......other features]

1.0 : [distance=0.3963903651622338]: [....]

So how exactly do I get the centroid id? I have even tried accessing it
with java
(Continue reading)

Rodolfo Viana | 20 Jul 22:40 2015
Picon

java.lang.OutOfMemoryError with Mahout 0.10 and Spark 1.1.1

I’m trying to run Mahout 0.10 with Spark 1.1.1.
I have input files with 8k, 10M, 20M, 25M.

So far I run with the following configuration:

8k with 1,2,3 slaves
10M with 1, 2, 3 slaves
20M with 1,2,3 slaves

But when I try to run
bin/mahout spark-itemsimilarity --master spark://node1:7077 --input
filein.txt --output out --sparkExecutorMem 6g

with 25M I got this error:

java.lang.OutOfMemoryError: Java heap space

or

java.lang.OutOfMemoryError: GC overhead limit exceeded

Is that normal? Because when I was running 20M I didn’t get any error, now
I have 5M more.

Any ideas why this is happening?

--

-- 
Rodolfo de Lima Viana
Undergraduate in Computer Science at UFCG
(Continue reading)

Pat Ferrel | 17 Jul 19:21 2015

Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

I use it as a Trait to extend your Scala main object or wherever you have your entry point. Then just call the
method to get a SparkContext and MahoutDistributedContext created and made available as implicits.

This code creates a Spark context so do it before any distributed operations that need a context and do not
create one separately. You can just copy the code if you don’t want to use the trait.

BTW a new version of item and row similarity are going into the master branch this weekend that should run a
fair bit faster. The master now runs on Spark 1.3.1 and even 1.4 and includes many optimizations in the
matrix ops.

On Jul 17, 2015, at 7:03 AM, Hegner, Travis <THegner <at> trilliumit.com> wrote:

Pat,

I appreciate your trying to figure it out. I also have been unable to reproduce this error when using a local
(even threaded) master. I have only gotten it to occur when running via the yarn cluster. I am actually in
the process of building a new hadoop/yarn/spark cluster from scratch, and will test it out there also. My
old cluster is up to date, but has been upgraded many times. Perhaps I'll have some better luck with the new one.

I'm a little confused on where to put or how to use the snippet you provided (sorry still new to scala). Can you
describe that in the context of the RowSimTest  project on github? Maybe even a pull request to it if you are
really feeling generous! Just something to give me an idea of how to integrate it, even if non working, I can
figure it out from there. I can then apply it to my actual codebase and see if it makes a difference with a full dataset.

For the time being, I have reverted to swapping my <tag>, <document_ids> in a map and running them through
cooccurrencesIDSs() just to move forward with my project. I do want to get this solved though for the
betterment of the community.

Thanks again!

(Continue reading)


Gmane