Erdem Sahin | 4 May 16:05 2015
Picon

SGD IndexOutOfBoundsException

Hi Mahout users,

We are trying to train an SGD model with 2 classes and running into an
IndexOutOfBounds exception.

It looks like the exception happens after TrainNewsGroups.java calls
learningAlgorithm.close() and while it calls SGDHelper.dissect()

Any help would be appreciated.

Many thanks,
Erdem Sahin

Model Dissection
body=i    -0.0    nowifi    1.0    -0.023956501144642588    2.0
-0.023956501144642588
body=have    -0.0    nowifi    1.0    -0.016857144357448513    2.0
-0.016857144357448513
body=you    -0.0    nowifi    1.0    -0.01673702475723127    2.0
-0.01673702475723127
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 2,
Size: 2
    at java.util.ArrayList.rangeCheck(ArrayList.java:635)
    at java.util.ArrayList.get(ArrayList.java:411)
    at org.apache.mahout.classifier.sgd.SGDHelper.dissect(SGDHelper.java:73)
    at org.apache.mahout.classifier.sgd.TrainNewsGroups.main(
TrainNewsGroups.java:130)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(
NativeMethodAccessorImpl.java:57)
(Continue reading)

Shushant Arora | 28 Apr 19:31 2015
Picon

default no of reducers

In Normal MR job can I configure ( cluster wide) default number of reducers
- if I don't specify any reducers in my job
Mihai Dascalu | 28 Apr 09:25 2015
Picon

Problems running SSVD directly from Java

Hi!

I’ve been experimenting with the SSVDSolver and unfortunately, during runtime, I encounter this error:

10648576 [Thread-13] WARN org.apache.hadoop.mapred.LocalJobRunner  - job_local1958711697_0001
java.lang.NoClassDefFoundError: org/apache/commons/httpclient/HttpMethod
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:546)
Caused by: java.lang.ClassNotFoundException: org.apache.commons.httpclient.HttpMethod
	at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 1 more

Exception in thread "Thread-13" java.lang.NoClassDefFoundError: org/apache/commons/httpclient/HttpMethod
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:562)
Caused by: java.lang.ClassNotFoundException: org.apache.commons.httpclient.HttpMethod
	at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 1 more 

The actual invocation is:
(Continue reading)

Erdem Sahin | 28 Apr 01:57 2015
Picon

trainnb labelindex not found error - help requested

Hi Mahout users,

I'm trying to run the classify-20-newsgroups.sh  script
<https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh>
and
it fails with a FileNotFoundException when it gets to the "trainnb"
command. All prior steps run successfully. I'm trying algo 1 or algo 2.

I have modified the script slightly so that it reads my input data instead
of the canonical data set. I've created a "wifidata" folder on the local FS
which has the following structure:
wifidata/havewifi
wifidata/nowifi

and within havewifi and nowifi, there exist files with text file names and
text content.These eventually get copied to HDFS.

I'm not clear if the "labelindex" file, which cannot be found, is supposed
to be created by trainnb or by a prior step.

Please see the details of the modified script and the error below. Any help
would be appreciated.

Thanks and best regards,
Erdem Sahin

Script:

if [ "$1" = "--help" ] || [ "$1" = "--?" ]; then
  echo "This script runs SGD and Bayes classifiers over the classic 20 News
(Continue reading)

lastarsenal | 27 Apr 13:32 2015

Hadoop SSVD OutOfMemory Problem

Hi, All,

     Recently, I tried mahout's hadoop ssvd(mahout-0.9 or mahout-1.0)  job. There's a java heap space out of
memory problem  in ABtDenseOutJob. I found the reason, the ABtDenseOutJob map code is as below:

    protected void map(Writable key, VectorWritable value, Context context)
      throws IOException, InterruptedException {

      Vector vec = value.get();

      int vecSize = vec.size();
      if (aCols == null) {
        aCols = new Vector[vecSize];
      } else if (aCols.length < vecSize) {
        aCols = Arrays.copyOf(aCols, vecSize);
      }

      if (vec.isDense()) {
        for (int i = 0; i < vecSize; i++) {
          extendAColIfNeeded(i, aRowCount + 1);
          aCols[i].setQuick(aRowCount, vec.getQuick(i));
        }
      } else if (vec.size() > 0) {
        for (Vector.Element vecEl : vec.nonZeroes()) {
          int i = vecEl.index();
          extendAColIfNeeded(i, aRowCount + 1);
          aCols[i].setQuick(aRowCount, vecEl.get());
        }
      }
      aRowCount++;
(Continue reading)

shaikhah otaibi | 24 Apr 18:02 2015
Picon

item-item similarity

Hello

I have a problem with finding similarities between items in Mahout. Most of
the similarity values are "NaN".
In my work, I want to calculate the similarity between research papers that
users bookmark in their libraries. Those users are connected using three
different implicit social networks based on their bookmarking behavior in
CiteULike, now I want to show which social network will connect the most
similar users (user-based) OR show that those connected users share similar
information (item-based). So, I need to compute the similarities between
users. Since, there is no explicit rating, I tried Loglikelihood and
Tanimoto in Mahout but I am getting lots of NaN values. I tried user-based
and Item-based. I am not sure if my code is 100% correct since I am new to
Mahout. Especially for the item-based since that I am not sure if the
inverted matrix is built by mahout. I mean building the item-item matrix .

I tried to build the model using:

DataModel model = *new*
GenericBooleanPrefDataModel(GenericBooleanPrefDataModel.*toDataMap*(*new*
FileDataModel(*new* File("FILENAME.csv"))));

Then I calculate the similarity using:

ItemSimilarity similarity = *new* TanimotoCoefficientSimilarity(model);
Then I used the list of the paperids to go through the matrix and print the
similarities. For instance, if I have paperids: 1,2,3,4,etc.
I tried to print the similarities between paper1 and paper as:

System.*out*.println("item similarity:"+similarity.itemSimilarity(1, 2));
(Continue reading)

mw | 21 Apr 12:37 2015

SparseVectorsFromSequenceFiles tfidf fail

Hello,

I am trying to get tfidf vectors from a corpus of 100k documents. I 
noticed that tfidf sequence file is empty, while the tf vectors are not.

Here is the log from SparseVectorsFromSequenceFiles:

INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: 
Maximum n-gram size is: 1
INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: 
Minimum LLR value: 1.0
INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: Number 
of reduce tasks: 1
INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: 
Tokenizing documents in /opt/seq
INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: 
Creating Term Frequency Vectors
INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: 
Calculating IDF
INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: Pruning

Here is the tfidf output dir:

root <at> test:[/opt/sparse/tfidf-vectors] # ll
total 20K
drwxr-xr-x 2 tomcat7 tomcat7 4.0K Apr 21 12:27 .
drwxr-xr-x 9 tomcat7 tomcat7 4.0K Apr 21 12:27 ..
-rw-r--r-- 1 tomcat7 tomcat7   90 Apr 21 12:27 part-r-00000
-rw-r--r-- 1 tomcat7 tomcat7   12 Apr 21 12:27 .part-r-00000.crc
-rw-r--r-- 1 tomcat7 tomcat7    0 Apr 21 12:27 _SUCCESS
(Continue reading)

Pat Ferrel | 16 Apr 18:15 2015

Re: mahout 1.0 on EMR with spark item-similarity

OK, this is cool. Almost there!

In order to answer the question you have to decide how you will persist the indicators. The output of
spark-itemsimilarity can be directly indexed but requires phrase searches on text fields. It requires
the item-id strings to be tokenized so they aren’t broken up in the analyzer used for Solr phrase
queries. The better way to do it is store the indicators as multi-valued fields on Solr or a DB. Many ways to
slice this.

If you want to index the output of Mahout directly we will assume the item IDs are tokenized and so contain no
spaces, ., comma, or other punctuation that will break a phrase, so we can encode the user history as a
single string of space separated item tokens.

To do a query with something like “ipad iphone” we’ll need to setup Solr like this, which is a bit of a
guess since I use a DB—not the raw output files:

    <fieldType name=“indicator" class="solr.TextField" omitNorms=“false"/> <!— NOTICE NO
TOKENIZER OR ANALYZER USED—>
    <field name=“purchase" stored=“true" type="indicator" multiValued="true" indexed="true"/>

“OR” is the default query operator so unless you’ve messed with that it should be fine. You need that
because if you have multiple fields in the future you want them ORed as well as the terms The query would be
something like:

q=purchase: (“iphone ipad")

So you are applying the items the user has purchased only to the purchase indicator field. As you add
cross-cooccurrence actions the fieldType will stay the same and you will add a field for “views".

q=purchase: (“iphone ipad”) view: (“history of items viewed")
And so on. 
(Continue reading)

Pat Ferrel | 14 Apr 20:26 2015

Error in AtA

Running cooccurrence calc on Spark 1.1.1 Hadoop 2.4 with Yarn

This isn’t a problem in our code is it? I’ve never run into a TaskResultLost, not sure what can cause that.

TaskResultLost (result lost from block manager)

nivea.m <https://gd-a.slack.com/team/nivea.m>[11:01 AM]
collect at AtA.scala:121    97/213 (25 failed)

nivea.m <https://gd-a.slack.com/team/nivea.m>[11:01 AM]
org.apache.spark.rdd.RDD.collect(RDD.scala:774)
org.apache.mahout.sparkbindings.blas.AtA$.at_a_slim(AtA.scala:121)
org.apache.mahout.sparkbindings.blas.AtA$.at_a(AtA.scala:50)
org.apache.mahout.sparkbindings.SparkEngine$.tr2phys(SparkEngine.scala:231)
org.apache.mahout.sparkbindings.SparkEngine$.tr2phys(SparkEngine.scala:242)
org.apache.mahout.sparkbindings.SparkEngine$.toPhysical(SparkEngine.scala:108)
org.apache.mahout.math.drm.logical.CheckpointAction.checkpoint(CheckpointAction.scala:40)
org.apache.mahout.math.drm.package$.drm2Checkpointed(package.scala:90)
org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$3.apply(SimilarityAnalysis.scala:129)
org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$3.apply(SimilarityAnalysis.scala:127)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:176)
scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
scala.collection.TraversableOnce$class.to <http://class.to/>(TraversableOnce.scala:273)
scala.collection.AbstractIterator.to <http://scala.collection.abstractiterator.to/>(Iterator.scala:1157)
scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:257)
scala.collection.AbstractIterator.toList(Iterator.scala:1157)
(Continue reading)

Suneel Marthi | 12 Apr 12:48 2015
Picon

Apache Mahout 0.10.0 Released

The Apache Mahout PMC is pleased to announce the release of Mahout 0.10.0.
Mahout's goal is to create an environment for quickly creating machine
learning applications that scale and run on the highest performance
parallel computation engines available. Mahout comprises an interactive
environment and library that supports generalized scalable linear algebra
and includes many modern machine learning algorithms. This release has some
major changes from 0.9, including the new Apache Spark back-end (with H2O
in progress), a new matrix math DSL, streamlined content and bug fixes.

The Mahout Math environment we call “Samsara” for its symbol of universal
renewal. It reflects a fundamental rethinking of how scalable machine
learning algorithms are built and customized. Mahout-Samsara is here to
help people create their own math while providing some off-the-shelf
algorithm implementations. At its base are general linear algebra and
statistical operations along with the data structures to support them. It’s
written in Scala with Mahout-specific extensions, and runs most fully on
Spark.

To get started with Apache Mahout 0.10.0, download the release artifacts
and signatures from http://www.apache.org/dist/mahout/0.10.0/

Many thanks to the contributors and committers who were part of this
release. Please see below for the Release Highlights.

RELEASE HIGHLIGHTS

Mahout-Samsara has implementations for these generalized concepts:

   -

(Continue reading)

PierLorenzo Bianchini | 10 Apr 14:49 2015

evaluating recommender

Hi all,
I have a question on the results of an evaluation (I'm using "RMSRecommenderEvaluator").

I'm getting a result of "0.7432629235004433" with one of the recommenders I'm testing. I read in several
places that "0.0" would be the perfect result, but I couldn't find which ranges are acceptable.
I've seen values ranging from 0.49 to 1.04 with different implementations (I mostly do user-based with
Pearson and model based with SVD transformations) and setting different parameters. I've also seen
values up to 3.0 but I was testing "bad" cases (low amount of data used, bad percentage of trainign data,
etc.; I guess I could get results even worse than that but I didn't try it)
When can I consider that my recommender is "good enough", when should I consider that my evaluation is too
bad? (for now I randomly assumed that 0.9 is a good value and I'm trying to stick around that value)
Perhaps someone knows where I could find a documentation for this? any help would be appreciated.
Thank you! Regards,

PL

*FYI* I have a user/movie/rating dataset. 6000 users for 3900 movies. I have a static training file with
800.000 triplets and I'm using them to evaluate different types of recommender (this is a university
requirement, I'm not talking about production environments)


Gmane