Raghuveer | 22 May 13:42 2015

Analysing NDT data for POC

I am doing a POC and have a dataset of the format ( client_ip, timestamp, bytes_transferred ) and trying to do
the usecase "Predict bytes_transferred for a particular client for a give timestamp". I got as dataset
for example a client_ip which has downloaded bytes 234 for timestamp 1432292516696. Similarly
lets say i have datasets for 22nd morning, 23rd evening and 24th afternoon. So now we need to apply the
usecase here. Therefore i pass this individual bytes_transferred sets to classification to categorize
into low_download (<2500), medium_download (>2500 and <5000) and high_download (>5000).

How can i pass this dataset to classification algorithm like cBayes to categorize the client_ips based on timestamp.
Since the data file has 3 columns should i pass the file as is to sequence file conversion and then to vector or
any pre-processing is required? since this is a time series data is there any specific algorithms that can
do the job?

I need your help, kindly suggest.


lastarsenal | 22 May 08:50 2015

mahout spark FilteredRDD problem


   Recently I tried mahout spark, for example: 

./bin/mahout spark-itemsimilarity -i ${input} -o ${output} --master $MyMaster --sparkExecutorMem 2g

then I met a error like "Caused by: java.lang.ClassCastException:
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator cannot be cast to
org.apache.spark.serializer.KryoRegistrator", It seems that our spark version is NOT compatible
with mahout. 

For the spark system is deployed by ops, so what I can do just follow them. Then I do something as below:
1.  Modify  <spark.version>1.1.1</spark.version>  to  <spark.version>1.3.0</spark.version> , which is
our spark version,   in pom.xml in mahout.
2. Run mvn -DskipTests clean install
3. Get a error when build: 

error: value saveAsSequenceFile is not a member of org.apache.mahout.sparkbindings.DrmRdd[K]
[ERROR]     rdd.saveAsSequenceFile(path)
[ERROR]         ^
[ERROR]spark/src/main/scala/org/apache/mahout/sparkbindings/drm/package.scala:26: error:
object FilteredRDD is not a member of package org.apache.spark.rdd
[ERROR] import org.apache.spark.rdd.{FilteredRDD, RDD}

4. Check 1.3.0 spark, FilteredRDD is dismissed
5. Check 1.1.1 spark, FilteredRDD is available.

So, my question is how can I solve it?
(Continue reading)

zhonghongfei@yy.com | 21 May 09:30 2015

some confuse about SSVD in class ABtDenseOutJob

I've run the SSVDCLI with "-q 1" parameter, but I've got an java.lang.ArrayIndexOutOfBoundsException. 

I found that it's because when it's not a dense vector, i is the index of an Vector.Element and this index is an
item id,so it exceeds the vector size.
My question is if there is a bug in ABtDenseOutJob?

the following code is in ABtDenseOutJob --> ABtMapper --> map() : 

if (vec.isDense()) { 
for (int i = 0; i < vecSize; i++) { 
extendAColIfNeeded(i, aRowCount + 1); 
aCols[i].setQuick(aRowCount, vec.getQuick(i)); 
} else if (vec.size() > 0) { 
for (Vector.Element vecEl : vec.nonZeroes()) { 
int i = vecEl.index(); // i is item id , so it will exceed the vec.size() usually, when
extendAColIfNeeded(i, aRowCount + 1) is called it will get an error
extendAColIfNeeded(i, aRowCount + 1); 
aCols[i].setQuick(aRowCount, vecEl.get()); 

So how to fix this proplem ?
Mustafa Elbehery | 19 May 12:34 2015

TelephoneCall Logistic Regression Example

Hi Folks,

I have a question regarding the *TelephoneCall *in example package. We we
add load the training data from the CSV into the training matrix, we add a
weight for each feature-field in the feature vector.

In the TelephoneCall code, we add the weight with a *Log(v)*, logarithmic
value not the real value. I can not understand why ?!! Please find code
snippet below :-

case "age": {
  double v = Double.parseDouble(fieldValue);
  featureEncoder.addToVector(name, Math.log(v), vector);

However, in the balance field, we assign a negative value if less than
threshold, like this

case "balance": {
  double v;
  v = Double.parseDouble(fieldValue);
  if (v < -2000) {
    v = -2000;
  featureEncoder.addToVector(name, Math.log(v + 2001) - 8, vector);

Anyone can explain the logic, I am trying to run it on my own dataset,
(Continue reading)

mw | 19 May 09:37 2015

Mahout 0.10.0 with yarn 2.6

Hi Mauhout-User,

does anbody know if mahout 0.10.0 runs on yarn 2.6?


Raghuveer | 18 May 13:14 2015

HMM Modeling

I am running the following with HMM model:Step 1:echo "196 198 199 201 203 203 205 206 208 209 211 212 214 215
217 218 220 221 223 224 226 227 229 230 232 233 234 236 239 239 240 242 243 245 246" > hmm-input
Step 2:mahout baumwelch -i hmm-input -o hmm-model -nh 3 -no 250 -e .01 -m 1000
Step 3:mahout hmmpredict -m hmm-model -o hmm-predictions -l 1000
The output does NOT seems to have correct prediction values. In step 2 if i chose to have a very small -e like
.0001 then i see NaNs filling the emission matrix. Kindly suggest how to get the correct output.It would
also be helpful if you can explain what is epsilon and how it gets used.thanks in advance 
Aykut Çayır | 17 May 23:07 2015

I have a weird exception in Online logistic regression

Hi mahouters,
I have tried to implement an application  to classify Breast caner using
wisconsin dataset (UCI).  However, I have An exception Array out bounds
exception dene Vector setQuick method. Can anyone help me? Thanks
Jonathan Seale | 14 May 05:15 2015

Row Similarity


I have an astrophysical application for Mahout that I need help with.

I have 1-dimensional stellar spectra for many, many stars. Each spectrum
consists of a series of intensity values, one per wavelength of light. I
need to be able to find the cosine similarity between ALL pairs of stars.
Seems to me this is simply a user-user similarity problem where I have
stars instead of users, wavelengths instead of items, and intensities
instead of ratings/clicks.

But I'm having difficulty using mahout's row similarity package (I'm new to
this, and these days astronomers code pretty exclusively in python). I know
that I must have to 1) create a sparse matrix where each row is a star,
columns are wavelengths, and the values are intensity, and 2) implement row
similarity. But I'm just not sure how to do it. Anyone have a good resource
or be willing to help? I could probably offer some compensation to anyone
that would be willing to provide a little focussed, personalized assistance.

Dan Dong | 13 May 17:10 2015

word2vec in mahout.

  Does anyone know how to run word2vec in mahout? I could not find docs
about it on Mahout web site. Thanks.

Xavier Rampino | 13 May 11:21 2015

spark-rowsimilarity java.lang.OutOfMemoryError: Java heap space


I've tried spark-rowsimilarity with out-of-the-box setup (downloaded mahout
distribution and spark, and set up the PATH), and I stumble upon a Java
Heap space error. My input file is ~100MB. It seems the various parameters
I tried to give won't change this. I do :

~/mahout-distribution-0.10.0/bin/mahout spark-rowsimilarity --input
~/query_result.tsv --output ~/work/result -sem 24g

Do I just need to input more memory, or is there another step I can do to
solve this ?
Raghuveer | 13 May 08:21 2015

HMM model error

I am running the following with HMM model:
Step 1:

echo "196 198 199 201 203 203 205 206 208 209 211 212 214 215 217 218 220 221 223 224 226 227 229 230 232 233 234 236
239 239 240 242 243 245 246" > hmm-input

Step 2:
mahout baumwelch -i hmm-input -o hmm-model -nh 3 -no 250 -e .01 -m 1000

Step 3:
mahout hmmpredict -m hmm-model -o hmm-predictions -l 1000
The output does not seems to have correct prediction values. In step 2 if i chose to have a very small -e like
.0001 then i see NaNs filling the emission matrix. Kindly suggest how to get the correct output.