jelmer | 23 Jun 15:47 2016
Picon
Gravatar

Scaling up spark Iitem similarity on big data data sets

Hi,

I am trying to build a simple recommendation engine using spark item
similarity (eg with
org.apache.mahout.math.cf.SimilarityAnalysis.cooccurrencesIDSs)

Things work fine on comparatively small dataset but I am having difficulty
scaling it up

The input I am using is CSV data containing 19.988.422 view item events
produced by 1.384.107 users. Looking at 5.135.845 distinct products

The csv data is stored on hdfs and is split up over 15 files, consequently
the resultant RDD will have 15 partitions.

After tweaking some parameters I did manage to get the job to run without
going out of memory but the job takes a very very long time to run

After running for 15 hours it still is stuck on

org.apache.spark.rdd.RDD.flatMap(RDD.scala:332)
org.apache.mahout.sparkbindings.blas.AtA$.at_a_nongraph_mmul(AtA.scala:254)
org.apache.mahout.sparkbindings.blas.AtA$.at_a(AtA.scala:61)
org.apache.mahout.sparkbindings.SparkEngine$.tr2phys(SparkEngine.scala:325)
org.apache.mahout.sparkbindings.SparkEngine$.tr2phys(SparkEngine.scala:339)
org.apache.mahout.sparkbindings.SparkEngine$.toPhysical(SparkEngine.scala:123)
org.apache.mahout.math.drm.logical.CheckpointAction.checkpoint(CheckpointAction.scala:41)
org.apache.mahout.math.drm.package$.drm2Checkpointed(package.scala:95)
org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$3.apply(SimilarityAnalysis.scala:145)
org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$3.apply(SimilarityAnalysis.scala:143)
(Continue reading)

Lee Ho Yeung | 21 Jun 09:30 2016
Picon

hmm prediction can not output normal character if predict binary

i use yahoo's finance csv and change to binary number and separate each one
and zero with a space

and then save as hmm-input and run example command below

however the output number are not exactly the future and even not a decimal
number

mahout/bin/mahout baumwelch -i hmm-input -o hmm-model -nh 3 -no 4 -e .0001
-m 1000
mahout/bin/mahout hmmpredict -m hmm-model -o hmm-predictions -l 216

f = open('/home/martin/Downloads/hmm-predictions2','a')
with open('/home/martin/Downloads/hmm-predictions', 'r') as content_file:
    content = content_file.read()
    allones = content.split(" ")
    counter = 0
    for eachone in allones:
        print(eachone)
        f.write(eachone)
        counter = counter + 1
        if counter == 8:
            counter = 0
            print(' ')
            f.write(' ')
Lee Ho Yeung | 21 Jun 07:20 2016
Picon

which function and how to use to guess missing value in matrix like this svd?

is there a svd function like this function to guess missing floating value
in matrix?
if there is this function, how to use this function and where is the result
stored?

import numpy as np
from scipy.sparse.linalg import svds
from functools import partial

def emsvd(Y, k=None, tol=1E-3, maxiter=None):
    """
    Approximate SVD on data with missing values via expectation-maximization

    Inputs:
    -----------
    Y:          (nobs, ndim) data matrix, missing values denoted by NaN/Inf
    k:          number of singular values/vectors to find (default: k=ndim)
    tol:        convergence tolerance on change in trace norm
    maxiter:    maximum number of EM steps to perform (default: no limit)

    Returns:
    -----------
    Y_hat:      (nobs, ndim) reconstructed data matrix
    mu_hat:     (ndim,) estimated column means for reconstructed data
    U, s, Vt:   singular values and vectors (see np.linalg.svd and
                scipy.sparse.linalg.svds for details)
    """

    if k is None:
        svdmethod = partial(np.linalg.svd, full_matrices=False)
(Continue reading)

Lee Ho Yeung | 21 Jun 07:16 2016
Picon

recommenditembased has error

just try to guess value in a table or matrix

firstly is that do not know where the result file is.
secondly is that it seems have error

vi rdata.txt
1,1,5
1,2,4
1,3,5
2,1,4
2,2,5
2,3,4
3,1,5
3,2,4
4,1,1
4,2,2
5,1,2
5,2,1
5,3,1

hadoop-2.7.2/bin/hadoop fs -rm -r temp

mahout/bin/mahout recommenditembased -s similarity_euclidean_distance -i
/home/martin/Downloads/rdata.txt -o /home/martin/Downloads/output.txt
--numRecommendations 5

...
16/06/20 22:13:47 INFO AbstractJob: Command line arguments:
{--endPhase=[2147483647], --excludeSelfSimilarity=[true],
--input=[temp/preparePreferenceMatrix/ratingMatrix],
(Continue reading)

Lee Ho Yeung | 21 Jun 07:13 2016
Picon

where is the result file after run parallelALS?

i follow this example to guess missing value in matrix
https://mahout.apache.org/users/recommender/intro-als-hadoop.html

mahout/bin/mahout parallelALS --input /home/martin/Downloads/rdata.txt
--output /home/martin/Downloads/output.txt --lambda 0.1 --implicitFeedback
true --alpha 0.8 --numFeatures 2 --numIterations 5  --numThreadsPerSolver 1
--tempDir tmp

mahout/bin/mahout recommendfactorized --input
/home/martin/Downloads/rdata.txt --userFeatures /home/martin/Downloads/U/
--itemFeatures /home/martin/Downloads/M/ --numRecommendations 1 --output
recommendations --maxRating 1

vi rdata.txt
1,1,5
1,2,4
1,3,5
2,1,4
2,2,5
2,3,4
3,1,5
3,2,4
4,1,1
4,2,2
5,1,2
5,2,1
5,3,1
Suneel Marthi | 13 Jun 17:03 2016
Picon

[ANNOUNCE] Apache Mahout 0.12.2 Release

The Apache Mahout PMC is pleased to announce the release of Mahout 0.12.2
which is a minor release following 0.12.1 in May 2016.

Mahout's goal is to create an environment for quickly creating machine
learning applications that scale and run on the highest performance
parallel computation engines available. Mahout comprises an interactive
environment and library that supports generalized scalable linear algebra
and includes many modern machine learning algorithms.

Having evaluated Java Swing based plotting capabilities, The project has
decided that currently, the best direction for data visualization is to
integrate with Apache Zeppelin.

Mahout developers are working towards a Mahout Zeppelin interpreter for
visualization and other notebook capabilities [1].

Mahout 0.12.2 is a maintenance release over Mahout 0.12.1 mainly focused on
providing helper functions for Mahout-Zeppelin integration with a couple of
minor bug fixes [2].

   1.

   MAHOUT-1866 Add matrix-to-tsv string function.
   2.

   MAHOUT-1863 cluster-syntheticcontrol.sh errors out with "Input path does
   not exist".
   3.

   MAHOUT-1868 purge smile plotting from the codebase.
(Continue reading)

Suneel Marthi | 11 Jun 02:59 2016
Picon

[VOTE] Mahout 0.12.2 Release Candidate 2

This is the vote for release 0.12.2 of Apache Mahout.

The vote will be going for at least 72 hours and will be closed on Sunday,
June 12th, 2016 or once there are at least 3 PMC +1 binding votes (which
ever occurs earlier).  Please download, test and vote with

[ ] +1, accept RC as the official 0.12.2 release of Apache Mahout
[ ] +0, I don't care either way,
[ ] -1, do not accept RC as the official 0.12.2 release of Apache Mahout,
because...

Maven staging repo:

 https://repository.apache.org/content/repositories/orgapachemahout-1025/
<https://repository.apache.org/content/repositories/orgapachemahout-1025/>

The git tag to be voted upon is mahout-0.12.2
Suneel Marthi | 10 Jun 23:12 2016
Picon

[VOTE] Apache Mahout 0.12.2 Release Candidate

This is the vote for release 0.12.2 of Apache Mahout.

The vote will be going for at least 72 hours and will be closed on Sunday,
June 12th, 2016 or once there are at least 3 PMC +1 binding votes (which
ever occurs earlier).  Please download, test and vote with

[ ] +1, accept RC as the official 0.12.2 release of Apache Mahout
[ ] +0, I don't care either way,
[ ] -1, do not accept RC as the official 0.12.2 release of Apache Mahout,
because...

Maven staging repo:

https://repository.apache.org/content/repositories/orgapachemahout-1024/

The git tag to be voted upon is mahout-0.12.2
forme book | 4 Jun 11:14 2016
Picon

mahout tf-idf vs lucene tf-idf

Hi,

I'm start to study text processing and I see that for evaluating two text
is possible to obtaing vector model through TF-IDF technique.

With Mahout is possible to create vectors from text with the use of
lucene.vector, if I have not misheard takes a lucene index and then map as
a tf-idf,

On the (Lucene side) has already by default this implementations, what I do
struggle to understand what is the advantage of having lucene.vector in
mahout when Lucene offer that feature out of the box ?

Maybe I'm missing something big but what’s the Connection Between then ?
 could you please explain a possible user case ?

Thanks for help

Richard
Trevor Grant | 1 Jun 16:47 2016
Picon

Location of JARs

I'm trying to refactor the Mahout dependency from the pom.xml of the Spark
interpreter (adding Mahout integration to Zeppelin)

Assuming MAHOUT_HOME is available, I see that the jars in source build live
in a different place than the jars in the binary distribution.

I'm to the point where I'm trying to come up with a good place to pick up
the required jars while allowing for:
1. flexability in Mahout versions
2. Not writing a huge block of code designed to scan several conceivable
places throughout the file system.

One thought was to put the onus on the user to move the desired jars to a
local repo within the Zeppelin directory.

Wanted to open up to input from users and dev as I consider this.

Is documentation specifying which JARs need to be moved to a specific
directory and places you are likely to find them to much to ask of users?

Other approaches?

For background, Zeppelin starts a Spark Shell and we need to make sure all
of the required Mahout jars get loaded in the class path when spark starts.
The question is where do all of these JARs relatively live.

Thanks for any feedback,
tg

Trevor Grant
(Continue reading)

Picon

Re: Clustering options


Actually, I was confused between MLlib and Samsara, thanks Pat. I was already reading the documentation of
MLlib. 

Cheers.

> On May 24, 2016, at 10:48, Pat Ferrel <pat <at> occamsmachete.com> wrote:
> 
> Mahout Samsara is more about rolling your own algo, though it has already implemented several as
examples. If you want to build your own clustering you will find a lot of what you need in the R-like DSL. 
> 
> But if you want something already built you may want to look at Spark’s MLlib kmeans.
> 
> People often ask; what is the difference between Mahout and MLlib? MLlib is a collection of algos, Mahout
is an optimized tensor math engine with many extensions and several algos. You can’t do the matrix A’B
in MLlib because it’s not an algo, it’s a bit of math—a very useful bit.
> 
> 
> On May 23, 2016, at 8:10 PM, FRANCISCO XAVIER SUMBA TORAL <xavier.sumba93 <at> ucuenca.ec> wrote:
> 
> Hi Dmitriy,
> 
> Thanks for your clarification.
> 
> Cheers.
> 
> 
>> On May 23, 2016, at 12:00, Dmitriy Lyubimov <dlieu.7 <at> gmail.com> wrote:
>> 
>> Xavier,
(Continue reading)


Gmane