go canal | 3 Oct 11:14 2015

Exception in thread "main" java.lang.IllegalArgumentException: Unable to read output from "mahout -spark classpath"

Hello,I am running a very simple Mahout application in Eclipse, but got this error:
Exception in thread "main" java.lang.IllegalArgumentException: Unable to read output from "mahout
-spark classpath". Is SPARK_HOME defined?
I have SPARK_HOME defined in Eclipse as an environment variable with value of /usr/local/spark-1.5.1.
What else I need to include/set ?

 thanks, canal
go canal | 3 Oct 05:47 2015

matrix inversion in plan ?

HiI saw some distributed matrix functions included in Samsara now. Wondering if we have a plan to support
matrix inversion ?BTW, am I correct that it is distributed memory based, not out-of-core ? thanks, canal
Ankit Goel | 22 Sep 01:44 2015

Modifying kmeans algo

Hi all,
If one wanted to modify the kmeans algorithm given with the mahout package,
how would/should one go about doing it?

Also what is the function that can be used to find the median point between
2 or more vectors? As in I want the median point in vector format so that I
can use it as a new center maybe.



Ankit Goel
Disa Mhembere | 17 Sep 02:54 2015

examples/bin/cluster-reuters.sh fails on k-means: Option (1)

Hello all,

I'm running mahout 0.11.1 on an ubuntu 14.04 server. I compiled the library
with maven 3.3.3 under java 1.7 and tried to run the kmeans example
in examples/bin/cluster-reuters.sh (i.e., choice 1). The script fails with:

15/09/15 23:27:35 ERROR AbstractJob: Missing value(s) --input
Missing value(s) --input
 [--input <input> --output <output> --outputFormat <outputFormat>
<substring> --numWords <numWords> --pointsDir <pointsDir> --samplePoints
<samplePoints> --dictionary <dictionary> --dictionaryType <dictionaryType>
--evaluate --distanceMeasure <distanceMeasure> --help --tempDir <tempDir>
--startPhase <startPhase> --endPhase <endPhase>]
--input (-i) input    Path to job input directory.
15/09/15 23:27:35 INFO MahoutDriver: Program took 119 ms (Minutes:
cat: /tmp/mahout-work-disa/reuters-kmeans/clusterdump: No such file or

Any ideas how to resolve this? It seems like the script should provide he
correct input (-i) but it does not and I'm not certain what is the correct
input myself.

FYI: Other options seem to work like 2, 3.

Sincere thanks,


(Continue reading)

hsharma mailinglists | 15 Sep 23:58 2015

[mahout 0.9 | k-means] methodology for selecting k to cluster very large datasets


I have some questions around large-scale clustering. I would like to
arrive at a methodology that I can use to determine an appropriate
value of K to run K-means clustering for (at least for my scenario, if
not in general). More details follow below (apologies for the
verbosity, but I wanted to provide as much context as I could).

By 'large' I mean to imply:

* large in the number of points to cluster
* large in the dimensionality of the vector/feature-space
* large in the number of clusters

It would be doubly great if someone here has had to perform clustering
in a similar setting (similar in terms of the number of datapoints,
the nature/type of data being clustered, the number of clusters being
formed, etc.) and are willing to share their war story.

:: The context ::

I'm trying to cluster ~ 18 million sparse term-frequency vectors with
Mahout 0.9. Although these vectors were not derived from documents in
a text corpus, they were generated based on each data point being
associated with a finite number of discrete symbols. The size of the
overall symbol-set is quite large, hence the sparsity of the vector
representation. I assumed that I don't require idf-weighting because
none of these symbols can occur more than once per datapoint/vector.
Additionally, each vector has a fixed number of non-zero dimensions.

(Continue reading)

Daniel Korzekwa | 9 Sep 09:21 2015



I'm comparing the efficiency of sq_dist() from mahout to sq_dist()  from
gpml library that is based on bsxfun in octave/matlab.

It seems that computing the distance matrix in octave is 5 times faster
than in Mahout.Why is that? Can we make it faster?

 x = [1:4000]

Scala (Mahout):
  val x = Array.range(1,4000,1).map(i => i.toDouble)
  val A =  new DenseMatrix(Array(x)).transpose()
  val dM = sqDist(A)


Daniel Korzekwa
Machine Learning Engineer
https://www.linkedin.com/in/danielkorzekwa <http://danmachine.com/>
Dulakshi Vihanga | 9 Sep 07:59 2015

FileNotFountException for intro.csv in mahout userBasedCF recommendation

I created a maven project and in the pom file I added the following


I followed the example in mahout in action and added the RecommenderIntro
class to src/main/java and copied the intro.csv file into src/main/resource

Error occurs in:

*DataModel model = new FileDataModel (new File("resources\intro.csv"));*
Exception in thread "main" java.io.FileNotFoundException: resources/intro.csv
    at org.apache.mahout.cf.taste.impl.model.file.FileDataModel.<init>(FileDataModel.java:180)
    at org.apache.mahout.cf.taste.impl.model.file.FileDataModel.<init>(FileDataModel.java:167)
    at org.apache.mahout.cf.taste.impl.model.file.FileDataModel.<init>(FileDataModel.java:147)
    at recstest.UserBased.main(UserBased.java:19)

I also copied the file to main src\main\java and checked and still the
same error occurs.
What could be the reason for this?
Dulakshi Vihanga | 9 Sep 06:22 2015

Dependencies for a simple recommender system

To create a simple user-based recommender system using the latest apache
mahout as a library, what dependencies should one add to the pom file of
the project? Is it ,


or some other artifact and a version?

Note that I have already cloned and built apache mahout  by "mvn -U clean
install -Dmaven.test.skip=true". I created a maven project in eclipse to
try out the user-based CF algorithm. Specifically the one given in
https://mahout.apache.org/users/recommender/userbased-5-minutes.html My
question is related to the pom file of the created maven project.
Dulakshi Vihanga | 8 Sep 16:31 2015

Apache Mahout build failure mahout-hdfs

When I tried to build mahout using "mvn clean install
-Dmaven.test.skip=true" I got the following error and the build failed.

"Failed to execute goal on project mahout-hdfs: Could not resolve
dependencies for project org.apache.mahout:mahout-hdfs:jar:0.11.1-SNAPSHOT:
Could not find artifact
org.apache.mahout:mahout-math:jar:tests:0.11.1-SNAPSHOT in apache.snapshots
(repository.apache.org/snapshots) <http://repository.apache.org/snapshots>"

I have a ubuntu 12.04 machine with oracle java SE, jdk 1.7 and maven 3.3.3.
I cloned apache mahout using "git clone https://github.com/apache/mahout.git"
mahout on 07.09.2015.
Could someone identify what went wrong?
Lee S | 6 Sep 11:28 2015

Dose random forest support multiple input ?

Hi all,
  I've read the document of mahout random forest at

In the "Know issues and limitations " section,it says

> The "Decision Forest" code is still "a work in progress", many features
> are still missing. Here is a list of some known issues: *For now, the
> training does not support multiple input files. The input dataset must be
> one single file (this support will be available with the upcoming release)*

Does the newest version of mahout already supports multiple inputs in
random forest algo?
Lee S | 31 Aug 11:21 2015

What does maxRating parameter in ALS recommend algorithm mean?

the max rating of all the items in input data,right?
Can anybody provide me with the related essay link  of the algorithm?