Adamantios Corais | 23 May 2013 11:55
Picon

Implementing the General Bayesian Inference ML Algorithm

Hello,

I am trying to implement the General Bayesian Inference machine learning
algorithm. The algorithm is really simple as an idea, but since I am new to
the hadoop ecosystem, I lack some experience. Anyway, here is the idea:

From one hand I have a stream of <user_id, site_id> data, and from the
other I have a static table of <site_id, probability> data. Also, let's
suppose that the streaming data are related to a specific user only. What I
want to do is read the streaming data one by one - i.e. user's visiting
pages - go to the static table, retrieve the corresponding probability, and
do some very simple computations in order to compute the user's specific
probability. Finally, store the update probability somewhere - in memory -
as it will be the one to be used when we will consider the second streaming
line, toward the updating of the user's specific probability. This process
goes on and on until all <user_id, site_id> pairs have been read.

My questions are summarised as follows:

1) Is there anything similar to be implemented already into mahout?
2) Do I need mahout to implement this algorithm anyway?
3) Should I use Pig + UDF instead?
4) Or I should do everything in MapReduce?
5) How could I store the static - and tiny - table in memory so as to avoid
loading of it again and again?

Any help is very much appreciated.
qiaoresearcher | 23 May 2013 01:42
Picon

how to run mahout examples from eclipse in linux?

Hi all,

Assume we want to run mahout examples like:

$HADOOP_HOME/bin/hadoop jar
$MAHOUT_HOME/core/target/mahout-core-≤VERSION>-job.jar
org.apache.mahout.classifier.df.tools.Describe -p
testdata/KDDTrain+.arff -f testdata/KDDTrain+.info -d N 3 C 2 N C 4 N
C 8 N 2 C 19 N L

it works well in command line, but how to run it in eclipse?

we can run the Describe class file in eclipse, but then the hadoop
support and hdfs is lost? it can only work with local file system.

any suggestion?

thanks,
yikes aroni | 22 May 2013 17:20
Picon

Hidden Markov Models and time series - 2 questions

I'm not knowledgable of statistics nor data analysis, so please be
gentle! I am using Mahout to predict time series out of control state. I've
had a fair amount of success classifying with SGD and Adaptive regression
approaches but want to see if Hidden Markov Models can do a better job for
my purposes. I have two questions.

Question 1
I train the model using HmmTrainer.trainSupervisedSequence(). The hidden
state is the status: Out-of-Control (OOC) or Not-OOC for the next point in
time. Thus when i use HmmEvaluator.decode(model, observedSequence, false),
I look at the "state" associated with the last point in the
observedSequence and take *that* as my prediction of State at t+1. First of
all -- is this sensible? Or is there a better way to use the API to get a
prediction of State at t+1 given Observations 0 through t after training?

Question 2
 Once I get my prediction -- i.e., the state the model predicts will be
associated with the last observation in my observation sequence -- how do I
use the API to get the probability of a that predicted state being correct?
I've looked at various output from HmmUtils and HmmEvaluator, but not being
strong in my knowledge of HMM, i'm not sure which (if any) are what i need.
Ultimately, I want to be able to say something like "The predicted next
state of this time series is OOC with a confidence of 0.37".

thank you
Jakub Pawłowski | 22 May 2013 15:17
Picon

Re: mahout ssvd tuning problem

Yes, I was manipulating io.sort.factor too, it speeds up reducer, values 
around 30 gives good result for me.
But my problem is not reducer, my problem is Bt-job map taks that spills 
to drive.

You mentioned Combiner, how can I turn it on ? I'm running my job from 
console like that

mahout ssvd --rank 400 --computeU true --computeV true --reduceTasks 3  
--input ${INPUT} --output ${OUTPUT} -ow --tempDir /tmp/ssvdtmp/

document at 
https://cwiki.apache.org/MAHOUT/stochastic-singular-value-decomposition.data/SSVD-CLI.pdf 
doesn't mention anything about combiner.

Thanks for your answer.

W dniu 22.05.2013 14:59, Sean Owen pisze:
> I feel like I've seen this too and it's just a bug. You're not running
> out of memory.
>
> Are you also setting io.sort.factor? that can help too. You might try
> as high as 100.
>
> Also have you tried a Combiner? if you can apply it it should help too
> as it is designed to reduce the amount of stuff spilled.
>

Jakub Pawłowski | 22 May 2013 13:32
Picon

mahout ssvd tuning problem

Hi,

I'm trying to tune mahout ssvd job to not spill so much, I'm trying to tune

<property>
  <name>io.sort.mb</name>
  <value>1047</value>
</property>

but when I try to put any bigger value, ie.

<property>
  <name>io.sort.mb</name>
  <value>1247</value>
</property>

according to hadoop source code this value can be as hight as 2047

http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/mapred/MapTask.java
line 771

I'm getting:

java.io.IOException: Spill failed
     at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1029)
     at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
     at 
org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
(Continue reading)

Jakub Pawłowski | 22 May 2013 13:31
Picon

Re: WELCOME to user <at> mahout.apache.org

Hi,

I'm trying to tune mahout ssvd job to not spill so much, I'm trying to tune

<property>
  <name>io.sort.mb</name>
  <value>1047</value>
</property>

but when I try to put any bigger value, ie.

<property>
  <name>io.sort.mb</name>
  <value>1247</value>
</property>

according to hadoop source code this value can be as hight as 2047

http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/mapred/MapTask.java
line 771

I'm getting:

java.io.IOException: Spill failed
     at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1029)
     at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
     at 
org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
(Continue reading)

Pat Ferrel | 22 May 2013 05:47
Picon
Gravatar

Interpreting Cluster Dump Metrics

Doing some clustering on text docs. I iterated over a range of k using kmeans and each time got the average
intra and inter cluster density scores as shown below. 

Just want to make sure I'm interpreting these numbers correctly. The intra-cluster density should be high
since it indicates tighter groupings. I would assume the inter-cluster density should be low to indicate
more independent clusters--ones further apart. Is this correct? 

For this sample it looks like about 20-40 clusters is "best"? Looking at the results for k=40 by eyeball they
do seem pretty good. 

The number of docs is 7008

clusters average intra-cluster density	average inter-cluster density
20	0.717687068	0.803248063
40	0.65685332	0.796054862
80	0.644257737	0.854298388
100	0.64579391	0.875902138
200	0.629816133	0.910933972
500	0.591950257	0.930948207

Stuti Awasthi | 21 May 2013 13:17
Favicon

Feature vector generation from Bag-of-Words

Hi all,

I have a query regarding the Feature Vector generation for Text documents.
I have read Mahout in Action and understood how to create the text document in feature vector weighed by Tf of
Tfidf schemes. My usecase is a little tweaked with that.

I have few keywords may be say 100 and I want to create the Feature Vector of the text documents only with these
100 keywords. So I would like to calculate the frequency of each keyword in each document and generate the
feature vector of the keyword with the frequency as weights.

Is there any already present way to do this or Il need to write the custom code?

Thanks
Stuti Awasthi

::DISCLAIMER::
----------------------------------------------------------------------------------------------------------------------------------------------------

The contents of this e-mail and any attachment(s) are confidential and intended for the named
recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the author and may not necessarily
reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying,
disclosure, modification,
distribution and / or publication of this message without the prior written consent of authorized
representative of
HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately.
(Continue reading)

Rajesh Nikam | 20 May 2013 13:13
Picon

convert input for SVD

Hello,

I have arff / csv file containing input data that I want to pass to svd :
Lanczos Singular Value Decomposition.

Which tool to use to convert it to required format ?

Thanks in Advance !

Thanks,
Rajesh
Sophie Sperner | 19 May 2013 23:22
Picon
Gravatar

mahout colt collections

Dear,

I'm experiencing difficulties with
hppc<http://labs.carrotsearch.com/hppc.html>library that I'm using. My
algorithms work perfectly fine for small inputs,
but when I go for amazon machine and want to compute larger inputs, my code
hangs on forever as a result of some hidden bugs in that library (probably
it was not tested for big data).

I'm aware of Colt CERN library for Java primitive computing. It's like
Troove, hppc, fastutil or whatever. But Colt is dead. So what I know is
mahout uses old code of Colt and what I want now is to find any document or
articles on how to use similar structures for OpenIntSet,
OpenIntIntHashMap, OpenIntObjectHashMap and so on in order to convert my
present code to mahout colt library.

If there is no such document, *could you please point me at where this
library is? (I need jar file with those collections).* Hope the API will be
quite similar to the one I use so that converting phase will be relatively
easy.

Thank you and best wishes.
--

-- 
Yours,
Sophie
Ahmet Ylmaz | 19 May 2013 18:52
Picon
Favicon

Which database should I use with Mahout

Hi,
I would like to use Mahout to make recommendations on my web site. Since the data is going to be big,
hopefully, I plan to use hadoop implementations of the recommender algorithms.

I'm currently storing the data in mysql. Should I continue with it or should I switch to a nosql database such
as mongodb or something else?

Thanks
Ahmet

Gmane