Ferran Muñoz | 26 Feb 09:08 2015
Picon

Questions on Cooccurrence Recommenders with Spark

Hello,

I have read the "Intro to Cooccurrence Recommenders with Spark" of the
Mahout documentation and I have a question regarding the unified
recommender query. What does "user's-tags-associated-with-purchases"
exactly mean? Does it mean that I have to put tags or itemids?

I understand that the "tags" field of each item document contains the tags
of this particular item. Then, what query do it have to write in order to
get recommended items using the content-based indicator?

On the other hand, how can I use ratings when computing
the spark-itemsimilarity? For example, how can I use spark-itemsimilarity
to get recommendations in MovieLens dataset (it has ratings, not boolean)?

Thank you in advance.

Ferran
NewUser | 25 Feb 21:59 2015
Picon

Lucene index and mahout topic modeling

Hello,

I have lucene index with more than one filed. I want to treat each field
from index as separate document as input to mahout topic modeling. I want to
get result : myFiled1 - probabalityT1, probabilityT2 ...
: myFiled2 - probabalityT1, probabilityT2 ...
...

 If I use lucene2seq and specify -f option with filed name ( -f  "myFiled1",
-f  "myField2" ..) I got sequential files with data from myField1, myField2
... But all of that files has the same keys so I could not know which filed
belong to which topic.   How can I do this? Does some command exist which I
can use to rename key in sequence files?

Thanks!

chirag lakhani | 23 Feb 19:17 2015
Picon

java heap space error - Naive Bayes

I am trying to train a Naive Bayes model in Mahout and I keep getting a
java heap space error.  It is a strange thing because I am using a hashing
vectorizer where I am fitting 1-gram and 2-gram tokens into a vector of
size 2^20.  My cluster consists of 7 nodes with 16 gb of ram and each time
I run it I seem to get this heap space error.  I have reduced the
dimensions of the vectors to only 16 dimensions and I still get this error
so it seems to be a more systematics issue.

I am running this on cloudera CDH 5.3, is there anything I could adjust in
terms of heap space which would allow for this to work?
Urszula Kużelewska | 20 Feb 12:50 2015
Picon

refresh recommender


Hello,

I have a problem with refreshing a recommender - it doesn't work in my case. It refreshes after at least
second call of recommender.refresh(null). 

I'm using Mahout on average size of data (~60MB), which is stored in a file. New data are written very often - a
few preferences per second. I write them in a separate file with the name beginning the same as data file
(data.1.csv, data.2.csv for input file data.csv).

I know, that the problem has been discussed here, but I didn't find any solution good for me - mysql database
is very slow.

Regards,

Ula

mw | 18 Feb 11:18 2015

LDA p(Topic|Document)

Hello,

i am using lda to build a topic model over 30k articles. However i have 
a problem to get the p(Topic|Document) for topics that have a relative 
low prior.

For example for one articles there are basically two relevant topics 
with P(10|articles)=0.09802209698050128 and

p(111|articles)=0.9001826471638066.

10

{medikamente:0.004022688319831924,krebs:0.00358368764510196,patient:0.003446116146728106,therapie:0.0033063920383982624,ebola:0.003246694015027121,krankheiten:0.003118233813442163,medizin:0.0030363609305258774,medizinische:0.002735692894516862,sierra:0.0025488043196802953,gehirn:0.0024702977723619,erkrankung:0.0024664127821501956,virus:0.0024636302428390705,weltgesundheitsorganisation:0.0024240035374940316,leone:0.0023821163154330817,medikament:0.0022636661241652403,symptome:0.0022605621780134753,erkrankungen:0.002236741916796797,ärzten:0.002201719708384397,infektion:0.0022005269799190478,diagnose:0.0021200205447816866}

111

{cia:0.0026345359875991226,foltermethoden:0.0014149174673260135,verhörmethoden:0.00140497301590793,waterboarding:0.0011580597910307684,folter:0.0011198728249932808,folterbericht:9.750488561609758E-4,ciafolter:9.63781735282257E-4,jauch:9.625685166431471E-4,methoden:9.52553863964292E-4,ussenats:9.467352925179054E-4,terrorverdächtigen:8.956828892077374E-4,folterpraktiken:8.547966922851059E-4,geheimgefängnissen:8.509344535068084E-4,senatsbericht:7.938900091637415E-4,usgeheimdienstes:7.686889634541992E-4,ciafolterbericht:7.593868675732018E-4,schlafentzug:7.423809643767796E-4,morales:6.574631769728719E-4,foltern:6.504933854409662E-4,geheimdienstausschuss:6.395705008858842E-4} 

The articles is basically about allergies. So i would assume that the 
10th topic would get a high probability. I build the model again and i 
found that the probabilistic assignment to topic equivalent to topic 10 
gets a similar value.
However there is allways another topic that gets a higher probabilistic 
assignment but semantically this topic always changes.

E.g. antoher
solution:

{virus:0.00411216419791256,medikamente:0.0038598661793248137,krebs:0.0035675996400685007,therapie:0.0033905540843555413,ebola:0.003375728280833271,patient:0.0032500773732781823,krankheiten:0.0032139112354069776,weltgesundheitsorganisation:0.0029118597430727276,medizin:0.0026998889092036088,infiziert:0.00267748391624766,infektion:0.002632653711889378,symptome:0.002575108706339337,gehirn:0.002461862973567568,erkrankungen:0.0024427886542873287,sierra:0.0024414884403193737,epidemie:0.002422322695382149,erkrankt:0.0023925116648543065,leone:0.0023567022504187318,zellen:0.002332477878623263,ärzten:0.002322240317943797}
(Continue reading)

unmesha sreeveni | 18 Feb 08:24 2015
Picon

Delete output folder automatically in CRUNCH (FlumeJava)

Hi
I am new to FlumeJava.I ran wordcount in the same.But how can I
automatically delete the outputfolder in the code block. Instead of going
back and deleting the folder.
Thanks in advance.

--

-- 
*Thanks & Regards *

*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/
mostafa ahmed | 16 Feb 21:03 2015
Picon

Dependency Problem

Hello,

I run mahout environment as a user on Centos and it work fine with me. Now,
I am trying to build mahout from source and every time I execute the

"mvn clean compile"

I got the following error :
[ERROR] Failed to execute goal on project mahout-mrlegacy:
Could not resolve dependencies for project
org.apache.mahout:mahout-mrlegacy:jar:1.0-SNAPSHOT:
Could not transfer artifact commons-io:commons-io:jar:2.4
from/to central (http://repo.maven.apache.org/maven2):
Read timed out -> [Help 1]

I am using : Maven 3.0.4  and  Hadoop 2.0.0
I read on https://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException

It could be a POM misses declaration. Can any one help me with that?

--

-- 
Best Regards,

Mostafa Ramadan

Software Developer

mail:mostafa.ramadan7 <at> gmail.com
 <http://us.f384.mail.yahoo.com/ym/Compose?To=mostafa.ramadan <at> yahoo.com>
(Continue reading)

Picon

Apache Mahout Project for GSOC 2015

Hi,

I am interested in doing a project on recommender system framework for GSOC
2015. Can somebody tell me whether Apache is offering projects in this area.

Thanks

*Prasad Priyadarshana Fernando <http://www.linkedin.com/in/prasadfernando>*
Mobile: +1 330 283 5827
万代豊 | 14 Feb 03:09 2015
Picon

"SLF4J: Class path contains multiple SLF4J bindings." error when MAHOUT_LOCAL is "TRUE"

Hi
Looks like this is typical everywhere, however I have'nt figured out how to
resolve in my case.

There is nothing I have done explicitly regarding SLF4J.
Both Hadoop and Mahout environment are built by just simply downloading jar
files. Not built locally.
Both Hadoop and Mahout have been working fine as pseudo-distributed mode
for quite a while...

Also not sure what information would be required, however, some of the
class path that might relates to this are as follows.

MAHOUT_HOME="/usr/local/mahout-distribution-0.7"
MAHOUT_LOCAL="TRUE"
CLASS_PATH="/usr/local/hadoop:/usr/local/hadoop/conf:/usr/local/mahout-distribution-0.7/conf"
HADOOP_CONF_DIR="/usr/local/hadoop/conf"
HADOOP_HOME="/usr/local/hadoop"
JAVA_HOME="/usr/java/latest"

The only thing I have done to my existing healthy Hadoop/Mahout environment
was setting MAHOUT_LOCAL "TRUE".

...
MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.
MAHOUT_LOCAL is set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/usr/local/mahout-distribution-0.7/mahout-examples-0.7-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
(Continue reading)

Eugenio Tacchini | 12 Feb 15:24 2015
Picon

Documentation

Hi all,
I am new to mahout, it has been a couple of days now since I started
working with it and I've found it very very powerful.

I noticed, however, a general lack of documentation. I am working just with
the recommender system features and, correct me if I am wrong, I can't find
anywhere some complete documentation about. All I can find in the official
website is some quick-start tutorials.

Knowing the theory about recommender systems, I expect a few very simple
documentation pages which tell me something like:

- The algorithms implemented are: 1) user-based, 2) item-based, 3)
..........
- To use 1) in your code, follow these steps:
  - choose a user similarity measures (available measures: 1) Pearson
Correlation 2) Cosine similarity 3) ..... 4).......)
  - choose a neighbors selection techniques (available techniques 1)
threshold 2) fixed number 3).... 4)....)
 - compute the neighbors using the following
 UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.1,
similarity, dm);.......

and so on until all the options available are covered.

I didn't find anything like that, for example at the moment I am trying to
figure out how to compute a prediction (user-based algorithm), e.g. predict
the rating for user x movie y but I didn't find anything about that.

Forgive me if there is something I am missing, I just want to focus on the
(Continue reading)

unmesha sreeveni | 12 Feb 11:14 2015
Picon

Neural Network in hadoop

I am trying to implement Neural Network in MapReduce. Apache mahout is
reffering this paper
<http://www.cs.stanford.edu/people/ang/papers/nips06-mapreducemulticore.pdf>

Neural Network (NN) We focus on backpropagation By defining a network
structure (we use a three layer network with two output neurons classifying
the data into two categories), each mapper propagates its set of data
through the network. For each training example, the error is back
propagated to calculate the partial gradient for each of the weights in the
network. The reducer then sums the partial gradient from each mapper and
does a batch gradient descent to update the weights of the network.

Here <http://homepages.gold.ac.uk/nikolaev/311sperc.htm> is the worked out
example for gradient descent algorithm.

Gradient Descent Learning Algorithm for Sigmoidal Perceptrons
<http://pastebin.com/6gAQv5vb>

   1. Which is the better way to parallize neural network algorithm While
   looking in MapReduce perspective? In mapper: Each Record owns a partial
   weight(from above example: w0,w1,w2),I doubt if w0 is bias. A random weight
   will be assigned initially and initial record calculates the output(o) and
   weight get updated , second record also find the output and deltaW is got
   updated with the previous deltaW value. While coming into reducer the sum
   of gradient is calculated. ie if we have 3 mappers,we will be able to get 3
   w0,w1,w2.These are summed and using batch gradient descent we will be
   updating the weights of the network.
   2. In the above method how can we ensure that which previous weight is
   taken while considering more than 1 map task.Each map task has its own
   weight updated.How can it be accurate? [image: enter image description
(Continue reading)


Gmane