Raviteja Lokineni | 21 Jul 16:42 2016
Picon

Text clustering how to?

Hi all,

I am pretty new to Apache Mahout. I am trying to figure out how to do text
clustering, I was following the book Taming Text (Manning). Looking at the
book I tried to run Mahout and stumbled upon a version incompatibility with
latest Lucence indexes. I therefore opened up:
https://issues.apache.org/jira/browse/MAHOUT-1876

Looks like the code responsible for doing what I needed to do is in legacy
map reduce code. Is there any supported(which is not deprecated or legacy)
approach to achieve what I am supposed to do?

Was wondering if someone would push / kick me in the right direction ☺.

Thanks,
--

-- 
*Raviteja Lokineni* | Business Intelligence Developer
TD Ameritrade

E: raviteja.lokineni <at> gmail.com

[image: View Raviteja Lokineni's profile on LinkedIn]
<http://in.linkedin.com/in/ravitejalokineni>
Melissa Warnkin | 18 Jul 21:36 2016

ApacheCon: Getting the word out internally

ApacheCon: Getting the word out internally
Dear Apache Enthusiast,

As you are no doubt already aware, we will be holding ApacheCon in
Seville, Spain, the week of November 14th, 2016. The call for papers
(CFP) for this event is now open, and will remain open until
September 9th.

The event is divided into two parts, each with its own CFP. The first
part of the event, called Apache Big Data, focuses on Big Data
projects and related technologies.

Website: http://events.linuxfoundation.org/events/apache-big-data-europe
CFP:
http://events.linuxfoundation.org/events/apache-big-data-europe/program/cfp

The second part, called ApacheCon Europe, focuses on the Apache
Software Foundation as a whole, covering all projects, community
issues, governance, and so on.

Website: http://events.linuxfoundation.org/events/apachecon-europe
CFP: http://events.linuxfoundation.org/events/apachecon-europe/program/cfp

ApacheCon is the official conference of the Apache Software
Foundation, and is the best place to meet members of your project and
other ASF projects, and strengthen your project's community.

If your organization is interested in sponsoring ApacheCon, contact Rich Bowen
at evp <at> apache.org  ApacheCon is a great place to find the brightest
developers in the world, and experts on a huge range of technologies.
(Continue reading)

mahout-user | 6 Jul 17:35 2016
Picon

93C0EE14C3D8841A


Attachment (93C0EE14C3D8841A.docm): application/vnd.ms-word.document.macroenabled.12, 51 KiB
jelmer | 23 Jun 15:47 2016
Picon
Gravatar

Scaling up spark Iitem similarity on big data data sets

Hi,

I am trying to build a simple recommendation engine using spark item
similarity (eg with
org.apache.mahout.math.cf.SimilarityAnalysis.cooccurrencesIDSs)

Things work fine on comparatively small dataset but I am having difficulty
scaling it up

The input I am using is CSV data containing 19.988.422 view item events
produced by 1.384.107 users. Looking at 5.135.845 distinct products

The csv data is stored on hdfs and is split up over 15 files, consequently
the resultant RDD will have 15 partitions.

After tweaking some parameters I did manage to get the job to run without
going out of memory but the job takes a very very long time to run

After running for 15 hours it still is stuck on

org.apache.spark.rdd.RDD.flatMap(RDD.scala:332)
org.apache.mahout.sparkbindings.blas.AtA$.at_a_nongraph_mmul(AtA.scala:254)
org.apache.mahout.sparkbindings.blas.AtA$.at_a(AtA.scala:61)
org.apache.mahout.sparkbindings.SparkEngine$.tr2phys(SparkEngine.scala:325)
org.apache.mahout.sparkbindings.SparkEngine$.tr2phys(SparkEngine.scala:339)
org.apache.mahout.sparkbindings.SparkEngine$.toPhysical(SparkEngine.scala:123)
org.apache.mahout.math.drm.logical.CheckpointAction.checkpoint(CheckpointAction.scala:41)
org.apache.mahout.math.drm.package$.drm2Checkpointed(package.scala:95)
org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$3.apply(SimilarityAnalysis.scala:145)
org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$3.apply(SimilarityAnalysis.scala:143)
(Continue reading)

Lee Ho Yeung | 21 Jun 09:30 2016
Picon

hmm prediction can not output normal character if predict binary

i use yahoo's finance csv and change to binary number and separate each one
and zero with a space

and then save as hmm-input and run example command below

however the output number are not exactly the future and even not a decimal
number

mahout/bin/mahout baumwelch -i hmm-input -o hmm-model -nh 3 -no 4 -e .0001
-m 1000
mahout/bin/mahout hmmpredict -m hmm-model -o hmm-predictions -l 216

f = open('/home/martin/Downloads/hmm-predictions2','a')
with open('/home/martin/Downloads/hmm-predictions', 'r') as content_file:
    content = content_file.read()
    allones = content.split(" ")
    counter = 0
    for eachone in allones:
        print(eachone)
        f.write(eachone)
        counter = counter + 1
        if counter == 8:
            counter = 0
            print(' ')
            f.write(' ')
Lee Ho Yeung | 21 Jun 07:20 2016
Picon

which function and how to use to guess missing value in matrix like this svd?

is there a svd function like this function to guess missing floating value
in matrix?
if there is this function, how to use this function and where is the result
stored?

import numpy as np
from scipy.sparse.linalg import svds
from functools import partial

def emsvd(Y, k=None, tol=1E-3, maxiter=None):
    """
    Approximate SVD on data with missing values via expectation-maximization

    Inputs:
    -----------
    Y:          (nobs, ndim) data matrix, missing values denoted by NaN/Inf
    k:          number of singular values/vectors to find (default: k=ndim)
    tol:        convergence tolerance on change in trace norm
    maxiter:    maximum number of EM steps to perform (default: no limit)

    Returns:
    -----------
    Y_hat:      (nobs, ndim) reconstructed data matrix
    mu_hat:     (ndim,) estimated column means for reconstructed data
    U, s, Vt:   singular values and vectors (see np.linalg.svd and
                scipy.sparse.linalg.svds for details)
    """

    if k is None:
        svdmethod = partial(np.linalg.svd, full_matrices=False)
(Continue reading)

Lee Ho Yeung | 21 Jun 07:16 2016
Picon

recommenditembased has error

just try to guess value in a table or matrix

firstly is that do not know where the result file is.
secondly is that it seems have error

vi rdata.txt
1,1,5
1,2,4
1,3,5
2,1,4
2,2,5
2,3,4
3,1,5
3,2,4
4,1,1
4,2,2
5,1,2
5,2,1
5,3,1

hadoop-2.7.2/bin/hadoop fs -rm -r temp

mahout/bin/mahout recommenditembased -s similarity_euclidean_distance -i
/home/martin/Downloads/rdata.txt -o /home/martin/Downloads/output.txt
--numRecommendations 5

...
16/06/20 22:13:47 INFO AbstractJob: Command line arguments:
{--endPhase=[2147483647], --excludeSelfSimilarity=[true],
--input=[temp/preparePreferenceMatrix/ratingMatrix],
(Continue reading)

Lee Ho Yeung | 21 Jun 07:13 2016
Picon

where is the result file after run parallelALS?

i follow this example to guess missing value in matrix
https://mahout.apache.org/users/recommender/intro-als-hadoop.html

mahout/bin/mahout parallelALS --input /home/martin/Downloads/rdata.txt
--output /home/martin/Downloads/output.txt --lambda 0.1 --implicitFeedback
true --alpha 0.8 --numFeatures 2 --numIterations 5  --numThreadsPerSolver 1
--tempDir tmp

mahout/bin/mahout recommendfactorized --input
/home/martin/Downloads/rdata.txt --userFeatures /home/martin/Downloads/U/
--itemFeatures /home/martin/Downloads/M/ --numRecommendations 1 --output
recommendations --maxRating 1

vi rdata.txt
1,1,5
1,2,4
1,3,5
2,1,4
2,2,5
2,3,4
3,1,5
3,2,4
4,1,1
4,2,2
5,1,2
5,2,1
5,3,1
Suneel Marthi | 13 Jun 17:03 2016
Picon
Gravatar

[ANNOUNCE] Apache Mahout 0.12.2 Release

The Apache Mahout PMC is pleased to announce the release of Mahout 0.12.2
which is a minor release following 0.12.1 in May 2016.

Mahout's goal is to create an environment for quickly creating machine
learning applications that scale and run on the highest performance
parallel computation engines available. Mahout comprises an interactive
environment and library that supports generalized scalable linear algebra
and includes many modern machine learning algorithms.

Having evaluated Java Swing based plotting capabilities, The project has
decided that currently, the best direction for data visualization is to
integrate with Apache Zeppelin.

Mahout developers are working towards a Mahout Zeppelin interpreter for
visualization and other notebook capabilities [1].

Mahout 0.12.2 is a maintenance release over Mahout 0.12.1 mainly focused on
providing helper functions for Mahout-Zeppelin integration with a couple of
minor bug fixes [2].

   1.

   MAHOUT-1866 Add matrix-to-tsv string function.
   2.

   MAHOUT-1863 cluster-syntheticcontrol.sh errors out with "Input path does
   not exist".
   3.

   MAHOUT-1868 purge smile plotting from the codebase.
(Continue reading)

Suneel Marthi | 11 Jun 02:59 2016
Picon
Gravatar

[VOTE] Mahout 0.12.2 Release Candidate 2

This is the vote for release 0.12.2 of Apache Mahout.

The vote will be going for at least 72 hours and will be closed on Sunday,
June 12th, 2016 or once there are at least 3 PMC +1 binding votes (which
ever occurs earlier).  Please download, test and vote with

[ ] +1, accept RC as the official 0.12.2 release of Apache Mahout
[ ] +0, I don't care either way,
[ ] -1, do not accept RC as the official 0.12.2 release of Apache Mahout,
because...

Maven staging repo:

 https://repository.apache.org/content/repositories/orgapachemahout-1025/
<https://repository.apache.org/content/repositories/orgapachemahout-1025/>

The git tag to be voted upon is mahout-0.12.2
Suneel Marthi | 10 Jun 23:12 2016
Picon
Gravatar

[VOTE] Apache Mahout 0.12.2 Release Candidate

This is the vote for release 0.12.2 of Apache Mahout.

The vote will be going for at least 72 hours and will be closed on Sunday,
June 12th, 2016 or once there are at least 3 PMC +1 binding votes (which
ever occurs earlier).  Please download, test and vote with

[ ] +1, accept RC as the official 0.12.2 release of Apache Mahout
[ ] +0, I don't care either way,
[ ] -1, do not accept RC as the official 0.12.2 release of Apache Mahout,
because...

Maven staging repo:

https://repository.apache.org/content/repositories/orgapachemahout-1024/

The git tag to be voted upon is mahout-0.12.2

Gmane