Kevin Zhang | 28 Jan 00:07 2015

java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path

Thanks to Dmitriy for answering my previous question regarding the Spark version. I just downgraded the
version to spark-1.1.0-bin-hadoop2.4 and run my commane "mahout spark-itemsimilarity -i
./mahout-input/order_item.tsv -o ./output -f1 purchase -f2 view -os -ic 2 -fc 1 -td ," again. This time I
got error "java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path" as attached. I'm
using Mac.

Thanks for the help

at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65)

at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.executor.Executor$
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
Caused by: java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path
at java.lang.ClassLoader.loadLibrary(
at java.lang.Runtime.loadLibrary0(
at java.lang.System.loadLibrary(
at org.xerial.snappy.SnappyNativeLoader.loadLibrary(
... 26 more
15/01/27 14:54:24 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 0)
org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] null
at org.xerial.snappy.SnappyLoader.load(
at org.xerial.snappy.Snappy.<clinit>(
at org.xerial.snappy.SnappyOutputStream.<init>(
(Continue reading)

Kevin Zhang | 27 Jan 21:04 2015

spark-itemsimilarity: Exception in thread "main" com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'akka.event-handlers'


I'm new to Spark, Mahout. Just tried to run the spark-itemsimilarity but no luck. 

Here is what I did.

1. clone git Mahout git project
2. run "mvn install"
3. set $MAHOUT_HOME to the project path
4. added $MAHOUT_HOME/bin to the PATH variable
5. download Spark, extract it to a dir
6. set $SPARK_HOME to the dir ..../spark-1.2.0-bin-hadoop2.4
7. run command  "mahout spark-itemsimilarity -i ./mahout-input/order_item.tsv -o ./output -f1
purchase -f2 view -os -ic 2 -fc 1 -td ,"

Below is the output. It says "Exception in thread "main"
com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'akka.event-handlers'"

Any help or suggestion is highly appreciated.


MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.
log4j:WARN No appenders could be found for logger (org.apache.mahout.sparkbindings).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See for more info.
Using Spark's default log4j profile: org/apache/spark/
SLF4J: Class path contains multiple SLF4J bindings.
(Continue reading)

Marko Dinic | 23 Jan 11:48 2015

K-means implementation

Hello everyone,

I was digging through K-means implementation on Hadoop and I'm a bit 
confused with one thing so I wanted to check.

To calculate the distance from point to all centroids, centroids need to 
be accessed from every mapper.
So it seemed logical to me to put the centroids (sequenceFile) to the 
Distributed cache.

But, it seems that it isn't realized like that, but the sequence file is 
used like a file on HDFS. My understanding is that centroids file is 
distributed like any other file on HDFS, so any mapper needs to read 
from it by contacting each of the Data nodes on which the file is 

Please correct me if I'm wrong or if I have interpreted the code 
wrongly, but if not, why is it like that? Wouldn't it have more sense to 
use Distributed cache, since every mapper needs the centroids file? I 
guess that one problem would be that you need to copy to distributed 
cache in each iteration (because centroids are changing), but that seems 
faster than reading the file from the distributed system.

If I'm not right, can anyone please explain how is it really implemented?


zjy | 23 Jan 09:54 2015

mahout 1.0 ERROR common.AbstractJob: Unexpected


Everyone. When I run the folow command in compiled mahout 1.0 and centOS 6, I got a ERROR, did somebody know
how to solve this issue?

$MAHOUT cvb    \
 -i ${WORK_DIR}/reuters-out-matrix/matrix    \
 -o ${WORK_DIR}/reuters-lda -x 20 -k 21 -ow    \
 -dict ${WORK_DIR}/reuters-out-seqdir-sparse-lda/dictionary.file-*     \
 -dt ${WORK_DIR}/reuters-lda-topics    \
 -mt ${WORK_DIR}/reuters-lda-model

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /opt/hadoop/default/bin/hadoop and HADOOP_CONF_DIR=/opt/hadoop/default/etc/hadoop
MAHOUT-JOB: /home/mobile/zhangjianyi/mahout/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar
15/01/23 16:46:38 WARN driver.MahoutDriver: No cvb.props found on classpath, will use command-line
arguments only
15/01/23 16:46:38 ERROR common.AbstractJob: Unexpected 20 while processing Job-Specific Options:
Unexpected 20 while processing Job-Specific Options:         

Thank you !
David Benoff | 23 Jan 03:16 2015

No results using PlusAnonymousUserDataModel

Hi all,

I am not able to generate user recommendations for an anonymous user and would very much appreciate any
advice as to how to proceed.

Starting with a simple working example based on on
this, I get no results
returned when running the code below.  I DO obtain results when I pass in a user ID that exists in the input file.

I also tried using PlusAnonymousConcurrentUserDataModel without success. 

        DataModel fileDataModel = new FileDataModel(new File(csvpath));
        PlusAnonymousUserDataModel plusModel = new PlusAnonymousUserDataModel(fileDataModel);
        UserSimilarity similarity = new PearsonCorrelationSimilarity(plusModel);
        UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.1, similarity, plusModel);
        UserBasedRecommender recommender = new GenericUserBasedRecommender(plusModel,
neighborhood, similarity);

        PreferenceArray preferenceArray = new GenericUserPreferenceArray(6);
        preferenceArray.setUserID(0, PlusAnonymousUserDataModel.TEMP_USER_ID);
        preferenceArray.setItemID(0, 10);
        preferenceArray.setValue(0, 1f);

        Object result =
recommender.mostSimilarUserIDs(PlusAnonymousUserDataModel.TEMP_USER_ID, 3);
(Continue reading)

Emre Sevinc | 18 Jan 21:56 2015

Is there a built-in command for making predictions based on a logistic regression model?


I have used the

  mahout trainlogistic

command line tool to train a model on an CSV file and then

  mahout runlogistic

to observe the model's performance on test data.

Now, I want to make predictions by using another CSV file that does not
include the target variable.

Is there such a command line tool in Mahout? Something that can be run like

  mahout makeprediction modelFile newFileWithoutTargetVariable.csv

and prints something like:

  output1, probability1
  output2, probability2
  output3, probability3

or maybe just the probabilities?


Emre Sevinç
(Continue reading)

Emre Sevinc | 18 Jan 16:39 2015

Why doesn't Mahout logistic regression give a good AUC when the model is tested on training data?


I'm using the logistic regression of Mahout (version 0.9) but when I check
the created model on the same data set it was trained for, I do not see a
high value for AUC. I would expect it to be very high since it is the same
data set.

My data set is a CSV file with about 7 million lines and has 18 attributes,
some numerical and some categorical.

This is how I create the model for logistic regression (I ignore some of
the attributes):

$ mahout trainlogistic --input train.csv \
--output ./model \
--categories 2 \
--predictors attribute1 ... attribute15 \
--types w w w n n w w w w w w w n n n \
--target is_delayed \
--rate 100 \
--passes 2 \
--features 500000

And then when I check the AUC value using the model on the same data set:

$ mahout runlogistic --input train.csv --model ./model --auc --confusion
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.9-cdh5.3.0-job.jar
(Continue reading)

Johannes Schulte | 16 Jan 09:56 2015

Regularize calls in AbstractOnlineLogisticRegression


i see that one single training instance is used for regularization twice,
first in train:

    // push coefficients back to zero based on the prior


and then inside the gradient implementaton for classifying the instance
with the current model:

    // what does the current model say?

    Vector v = classifier.classify(instance);

which then calls

   <at> Override

  public Vector classifyNoLink(Vector instance) {

    // apply pending regularization to whichever coefficients matter


since there is no seal() or unseal() call the regularization is applied to
times, right? Is this planned?

(Continue reading)

Pasmanik, Paul | 15 Jan 18:46 2015

mahout 1.0 on EMR with spark

Has anyone tried running mahout 1.0 on EMR with Spark?
I've used instructions at to
get EMR cluster running spark.   I am now able to deploy EMR cluster with Spark using AWS JAVA APIs.
EMR allows running a custom script as bootstrap action which I can use to install mahout.
What I am trying to figure out is whether I would need to build mahout every time I start EMR cluster or have
pre-built artifacts and develop a script similar to what awslab is using to install spark?


The information contained in this electronic transmission is intended only for the use of the recipient
and may be confidential and privileged. Unauthorized use, disclosure, or reproduction is strictly
prohibited and may be unlawful. If you have received this electronic transmission in error, please
notify the sender immediately.
ARROYO MANCEBO David | 15 Jan 14:11 2015

Own recommender

Hi folks,
How I can start to build my own recommender system in apache mahout with my personal algorithm? I need a
custom UserSimilarity. Maybe a subclass from UserSimilarity like PearsonCorrelationSimilarity?

Regards :)
unmesha sreeveni | 15 Jan 07:06 2015

How to partition a file to smaller size for performing KNN in hadoop mapreduce

In KNN like algorithm we need to load model Data into cache for predicting the records.

Here is the example for KNN.

So if the model will be a large file say1 or 2 GB we will be able to load them into Distributed cache.

The one way is to split/partition the model Result into some files and perform the distance calculation for all records in that file and then find the min ditance and max occurance of classlabel and predict the outcome.

How can we parttion the file and perform the operation on these partition ?

ie 1 record <Distance> parttition1,partition2,.... 2nd record <Distance> parttition1,partition2,...

This is what came to my thought.

Is there any further way.

Any pointers would help me.

Thanks & Regards

Unmesha Sreeveni U.B
Hadoop, Bigdata Developer
Centre for Cyber Security | Amrita Vishwa Vidyapeetham