Miteshkumar Patel | 24 Jun 22:42 2015

provide name to id mapping in spectralkmeansDriver.java example

Hi All,
I'm new to Mahout. I'm using the SpectralKMeansDriver.java example provided in the Mahout library. The
code works fine. 
After code execution I want to find which IDs belong to which clusters. 
I'm using the 'mahout clusterdump' command to dump the clusters and it dumps the centroid, radius and
points. but I can't find which ID belongs to which cluster information.
I found the below snipnet in the code example 'SpectralKMeansDriver.java',  which I presume is looking
for the input mapping file but how do can I provide this from the command line.    // Restore name to id
mapping and read through the cluster assignments    Path mappingPath = new Path(new
Path(conf.get("hadoop.tmp.dir")), "generic_input_mapping");    List<String> mapping = new
ArrayList<>();    FileSystem fs = FileSystem.get(mappingPath.toUri(), conf);    if
(fs.exists(mappingPath)) {      SequenceFile.Reader reader = new SequenceFile.Reader(fs,
mappingPath, conf);      Text mappingValue = new Text();      IntWritable mappingIndex = new
IntWritable();      while (reader.next(mappingIndex, mappingValue)) {        String s =
mappingValue.toString();        mapping.add(s);      }      HadoopUtil.delete(conf,
mappingPath);    } else {      log.warn("generic input mapping file not found!");    }
Would appreciate any pointers.
Regards,Mitesh
Pat Ferrel | 24 Jun 15:39 2015

Re: FP-Growth deprecated

What do you mean by MR

On Jun 22, 2015, at 11:22 PM, guo.weizhan <guo.weizhan <at> gmail.com> wrote:

We are using MR

在 Pat Ferrel <pat <at> occamsmachete.com>,2015年6月22日 上午8:37写道:

What is your application? 

On Jun 17, 2015, at 7:06 AM, guo weizhan <guo.weizhan <at> gmail.com <mailto:guo.weizhan <at> gmail.com>>
wrote: 

Hi All, 

I found the FP-Growth was deprecated since 0.8, but we want this algorithm 
to do the association analysis. Do I have to use the old version or  Is 
there any other association analysis I can use in the lastest version? 

Thanks, 
Guo 

James Donnelly | 19 Jun 11:35 2015
Picon

Realtime update of similarity matrices

Hi,

First of all, a big thanks to Ted and Pat, and all the authors and
developers around Mahout.

I'm putting together an eCommerce recommendation framework, and have a
couple of questions from using the latest tools in Mahout 1.0.

I've seen it hinted by Pat that real-time updates (incremental learning)
are made possible with the latest Mahout tools here:

http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/

But once I have gone through the first phase of data processing, I'm not
clear on the basic direction for maintaining the generated data, e.g with
added products and incremental user behaviour data.

The only way I can see is to update my input data,  then re-run the entire
process of generating the similarity matrices using the itemSimilarity and
rowSImilarity jobs.  Is there a better way?

James
Picon

Cannot Compile Mahout Math Module

Hi,

Mahout Math modules has compilation issues. Does anyone know the root
course?

Thanks

Information:Using javac 1.8.0_45 to compile java sources
Information:java: Errors occurred while compiling module 'mahout-math'
Information:6/17/15 10:43 AM - Compilation completed with 100 errors and 0
warnings in 6s 7ms
/home/prasad/SrcCodes/apache-mahout-distribution-0.10.1/math/src/main/java/org/apache/mahout/math/Sorting.java
Error:(24, 39) java: cannot find symbol
  symbol:   class ByteComparator
  location: package org.apache.mahout.math.function
Error:(25, 39) java: cannot find symbol
  symbol:   class CharComparator
  location: package org.apache.mahout.math.function
Error:(26, 39) java: cannot find symbol
  symbol:   class DoubleComparator
  location: package org.apache.mahout.math.function
Error:(27, 39) java: cannot find symbol
  symbol:   class FloatComparator
  location: package org.apache.mahout.math.function
Error:(28, 39) java: cannot find symbol
  symbol:   class IntComparator
  location: package org.apache.mahout.math.function
Error:(29, 39) java: cannot find symbol
  symbol:   class LongComparator
  location: package org.apache.mahout.math.function
(Continue reading)

guo weizhan | 17 Jun 16:06 2015
Picon

FP-Growth deprecated

Hi All,

I found the FP-Growth was deprecated since 0.8, but we want this algorithm
to do the association analysis. Do I have to use the old version or  Is
there any other association analysis I can use in the lastest version?

Thanks,
Guo
Gustavo Frederico | 12 Jun 03:59 2015

Mahout recommender scalability, hardware architecture

  I'm trying to understand if/how the Mahout recommender can be used in
production for an e-commerce site. My general question is how does it
scale? I know I would need to make some assumptions, but roughly speaking,
how would it scale as a function of the # of products in the catalogue, #
of shoppers and page requests?
  I watched Ellen Friedman's presentation at
http://chariotsolutions.com/screencast/philly-ete-34-build-recommender-best-practices-apache-mahout-solr-ellen-friedman/.
It's interesting to see Solr there for offline training, but - given a
certain catalogue size, # of shoppers and average daily page requests - can
I aim at everything behind-the-fireall or will I have to resort to a hosted
cloud-based architecture, with on-demand provisioning of servers, etc ? For
example, a catalogue of 10K products, 40K registered shoppers, and 3K
average page views daily (* over all pages) what would be a feasible
architecture and why?

Thanks

Gustavo Frederico
Patrice Seyed | 11 Jun 22:53 2015
Picon

populating and serializing large sparse matrices

Hi,

I'm looking for a good solution to populate and serialize a large
sparse matrix using Mahout or related libraries. I noticed
SparseMatrix is not serializable when I considered serializing this
java object to file.  In an experiment to serialize out to a sequence
file, my ~3mil row matrix (avg ~20 col, sparse), after about 500k row
the sequence file
was taking about 115 GB space. Lucene is another idea but has similar
demand on disk space.

Are there more efficient ways of serializing a matrix to disk? Is
there something akin to python's ndarray? (Which I have noticed
handles quite large spare matrices population/serialization well.)

The object DistributedRowMatrix was mentioned to me but, 1) does it
suit my use case? The constructor takes a sequence file as an argument
(the generation of which I am having the issue with), 2) there is not
a method for accessing a row at an index, which I would need.

Thanks in advance for any suggestions,
Best,
Patrice

Raghuveer | 11 Jun 12:28 2015

Building Mahout Source

I took mahout-0.10.0 source and build it using "mvn install -DskipTests". I dont see any jar in root target
folder or distribution target folder. Nor in the maven local repository in
/.m2/repository/org/apache/mahout. And i dont see the maven repo with 0.10.0 version yet. Kindly
suggest where should i be looking into or how i can get the mentioned version.

Mihai Dascalu | 10 Jun 21:58 2015
Picon

Runtime Interner Exception

Hi!

After upgrading to Mahout 0.10.1, I have a runtime exception in the following Hadoop code in which I create
the input matrix for performing SSVD:

// prepare output matrix
81:		final Configuration conf = new Configuration();

83:		SequenceFile.Writer writer = SequenceFile.createWriter(conf,
				Writer.file(new Path(path + "/" + outputFileName)),
				Writer.keyClass(IntWritable.class),
				Writer.valueClass(VectorWritable.class));

while in the console we have:
...
[Loaded org.apache.hadoop.util.StringInterner from file:/Users/mihaidascalu/Dropbox%20(Personal)/Workspace/Eclipse/ReaderBenchDev/lib/Mahout/mahout-mr-0.10.1-job.jar]
...
java.lang.VerifyError: (class: com/google/common/collect/Interners, method: newWeakInterner
signature: ()Lcom/google/common/collect/Interner;) Incompatible argument to function
	at org.apache.hadoop.util.StringInterner.<clinit>(StringInterner.java:48)
	at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2293)
	at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2185)
	at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2102)
	at org.apache.hadoop.conf.Configuration.get(Configuration.java:851)
	at org.apache.hadoop.io.SequenceFile.getDefaultCompressionType(SequenceFile.java:234)
	at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:264)
	at services.semanticModels.LSA.CreateInputMatrix.parseCorpus(CreateInputMatrix.java:83)
	at services.semanticModels.LSA.CreateInputMatrix.main(CreateInputMatrix.java:197)

Any suggestions? I tried adding guava-14.0.1.jar as dependency, but it did not fix it
(Continue reading)

Petr Shestov | 9 Jun 16:02 2015

Interpretation of SVDRecommender

Hi all!
I'm trying to build SVDRecommender using Mahout. Code is simple:
DataModel model = new FileDataModel(new File("data.csv"));
SVDRecommender recommender = new SVDRecommender(model, new SVDPlusPlusFactorizer(model, 10, 20));
All my ratings are doubles between 0 and 1. However recommender in most cases predicts values above 1. How
could it happen? Is it a feature of svd-based algorithm? Since approximate decomposition is created.
The same problem with ALSWRFactorizer.
Thanks!

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information.  Any unauthorized review, use, disclosure or distribution
is prohibited.  If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
Bhaskar Bagchi | 5 Jun 08:51 2015
Picon

Query about recommendation evaluator.

Hi,

I was working with the GenericRecommenderIRStatsEvaluator when I noticed
that the GenericRelevantItemsDataSplitter.java class only removes the *good
user preferences* for the user for which the evaluation is being run and
keeps all the other data points and builds the data point for every user
separately before evaluation. This makes the loop O(n^2).

Why don't we make a single split of data using the percentage provided by
the user and build a single recommender model using this split, which can
be used to evaluate all the users? This will make the evaluator pretty fast.

Can anyone help me with making a single data split for evaluation?

--

-- 
Thanks and Regards

Bhaskar Bagchi
Data Science Intern
TinyOwl

Gmane