hdev ml | 1 Sep 2010 01:03
Picon

Re: Question about data warehousing and mining through Mahout

Thanks Chris for the answers.

1. The data is just going to grow. This 1.5TB of data is just from one
module. There are other modules, which may have similar kind of data in the
log. After discarding the data this 3.0TB becomes 1.5TB. It is still huge.
Since this is just a months data, and we would want to make use of atleast
past 6-12 months, the data size goes in the 10TB-20TB area. So I am guessing
Hadoop is the right answer. I am just not sure which sub-project to use in
this case.

2. When you say querying with Hive, note that I want to use the same hive
data for future data mining, so my question was -- can that be done with
Mahout integrating with Hive layer, instead of Hadoop layer directly. or
Maybe we can use the Hive data directly if at all I can reverse engineer the
data format that hive uses internally. Hopefully it does not have compressed
data.

3. Hhhmm..That seems like a very good suggestion. I am not averse to the
idea of writing my own implementation of mining algorithms. I am just
worried about their accuracy and stability. So summary is basically do the
transformation and statistical part first. When it comes to data mining,
write your own algorithms or use Mahout (if at all hive integration is
possible, or maybe reuse the raw text files or output dump of Hive tables)

On Tue, Aug 31, 2010 at 3:38 PM, Chris Bates <
christopher.andrew.bates <at> gmail.com> wrote:

> From my experience (merging machine learning with business goals), I'll
> offer a few pieces of advice that may help guide you.
>
(Continue reading)

hdev ml | 1 Sep 2010 01:10
Picon

Re: Question about data warehousing and mining through Mahout

Thanks Ted. That again validates my path.

Thanks Ted, Chris and Sean for your valuable inputs.

Community Rocks!!!! Off the topic - A few years back, I was on the JavaCC
mailing list. There were 2 guys - one from New Zealand and the other one
from France - replying to my problems. I was literally getting
round-the-clock support. More power to the community!!!!

-H

On Tue, Aug 31, 2010 at 3:48 PM, Ted Dunning <ted.dunning <at> gmail.com> wrote:

> For categorization, there are several different answers to the integration
> problem, but text
> export of a sampled and curated data file is pretty typical as a data path.
>
> The on-line sequential classifiers are a bit more flexible and would allow
> different input
> formats at the cost of coding on your part.
>
> Keep in mind that Hive is keeping flat files in HDFS anyway.  Adding an
> additional format
> so that you don't have to copy a Hive output file one extra time isn't
> hard,
> but neither is
> it hard to have Hive pop out something like comma separated values.
>
> On Tue, Aug 31, 2010 at 3:41 PM, hdev ml <hdevml <at> gmail.com> wrote:
>
(Continue reading)

Ted Dunning | 1 Sep 2010 02:34
Picon

Re: Question about data warehousing and mining through Mahout

I think that Chris was actually recommending stuff that is too simple to
call data-mining.

Basically this stuff is simpler than any machine learning algorithm so there
isn't anything really
to write.

An example for recommendations is to simply recommend the most popular items
to everybody,
possibly with a bit of dithering so it doesn't look so static.  This *is*
actually a recommendation
algorithm just like random selection is.  Both of these provide interesting
baseline levels for
clicks and engagement.  You *might* want to use Mahout to implement these,
but it is probably
better to get the rest of the framework in place first.

On Tue, Aug 31, 2010 at 4:03 PM, hdev ml <hdevml <at> gmail.com> wrote:

> 3. Hhhmm..That seems like a very good suggestion. I am not averse to the
> idea of writing my own implementation of mining algorithms. I am just
> worried about their accuracy and stability. So summary is basically do the
> transformation and statistical part first. When it comes to data mining,
> write your own algorithms or use Mahout (if at all hive integration is
> possible, or maybe reuse the raw text files or output dump of Hive tables)
>
hdev ml | 1 Sep 2010 04:00
Picon

Re: Question about data warehousing and mining through Mahout

I see what you are saying. Yes, I am going into the Hive direction for now.
Currently, installing/configuring the packages.

I think after a couple of months when things settle down with Hadoop and
Hive, I will take the Mahout course.

Thanks

-H

On Tue, Aug 31, 2010 at 5:34 PM, Ted Dunning <ted.dunning <at> gmail.com> wrote:

> I think that Chris was actually recommending stuff that is too simple to
> call data-mining.
>
> Basically this stuff is simpler than any machine learning algorithm so
> there
> isn't anything really
> to write.
>
> An example for recommendations is to simply recommend the most popular
> items
> to everybody,
> possibly with a bit of dithering so it doesn't look so static.  This *is*
> actually a recommendation
> algorithm just like random selection is.  Both of these provide interesting
> baseline levels for
> clicks and engagement.  You *might* want to use Mahout to implement these,
> but it is probably
> better to get the rest of the framework in place first.
(Continue reading)

Ted Dunning | 1 Sep 2010 04:15
Picon

Re: Question about data warehousing and mining through Mahout

The Manning book on Mahout will have caught up with you by then and will
have a killer section on building classification models to go with the
recommendations and clustering sections that aere already available.

On Tue, Aug 31, 2010 at 7:00 PM, hdev ml <hdevml <at> gmail.com> wrote:

> I think after a couple of months when things settle down with Hadoop and
> Hive, I will take the Mahout course.
>
Sean Owen | 1 Sep 2010 09:58
Picon
Gravatar

Re: Question about data warehousing and mining through Mahout

Hive does something fairly unrelated to Mahout. It's an indexing and
query system. Both might start from the same source data, but to do
different things. There is no common format, no. Mahout generally
operates on text files or "Vectors" in SequenceFiles. So there's some
translation there at least.

But I think a message here is that there's more preparation and
thought necessary to start data mining. It's not like you point a data
mining tool at some data and answers start flowing automatically.
You'd have to be deliberately extracting and preparing data anyhow.

On Tue, Aug 31, 2010 at 11:41 PM, hdev ml <hdevml <at> gmail.com> wrote:
> Thanks Sean for the answers. Thanks for Ted for validation.
>
> Now my question is, since I want to do both reporting of large data/
> datawarehouse, let's assume I choose Hive for that.
>
> Now can Mahout integrate with Hive to make use of this data for learning,
> mining etc.? or do I have to export the hive data into text files which can
> be hosted by Haddop/HDFS which later on Mahout can use for data mining.
>
> In short, can data warehousing part be done by Hive and then can data mining
> part be done by Mahout on this hive data?
>
> -H
>
> On Tue, Aug 31, 2010 at 3:03 PM, Sean Owen <srowen <at> gmail.com> wrote:
>
>> On Tue, Aug 31, 2010 at 10:55 PM, hdev ml <hdevml <at> gmail.com> wrote:
>> > Per my understanding of hive, we can do some statistical reporting, like
(Continue reading)

Sean Owen | 1 Sep 2010 10:07
Picon
Gravatar

Re: SVDRecommender convergence?

All in all that sounds roughly reasonable to me -- that's a reasonable
number of features and iterations, and the eval result is varying by
just about 3% (scale of 1-5).

On Tue, Aug 31, 2010 at 11:55 PM, Lance Norskog <goksron <at> gmail.com> wrote:
> About the SVDRecommender- 10 features and 50 iterations gave
> evaluation scores from .70 to .81, solely by changing the random
> seeds.

hdev ml | 1 Sep 2010 19:48
Picon

Re: Question about data warehousing and mining through Mahout

I agree with you that there is preparation needed for Mahout processing.

I was just trying to save on that effort by re-using the data in hive
instead of double processing it.

I may have some more questions when I actually dive into the mining part.
(possibly a couple of months down the line).

Thanks for your inputs.

On Wed, Sep 1, 2010 at 12:58 AM, Sean Owen <srowen <at> gmail.com> wrote:

> Hive does something fairly unrelated to Mahout. It's an indexing and
> query system. Both might start from the same source data, but to do
> different things. There is no common format, no. Mahout generally
> operates on text files or "Vectors" in SequenceFiles. So there's some
> translation there at least.
>
> But I think a message here is that there's more preparation and
> thought necessary to start data mining. It's not like you point a data
> mining tool at some data and answers start flowing automatically.
> You'd have to be deliberately extracting and preparing data anyhow.
>
> On Tue, Aug 31, 2010 at 11:41 PM, hdev ml <hdevml <at> gmail.com> wrote:
> > Thanks Sean for the answers. Thanks for Ted for validation.
> >
> > Now my question is, since I want to do both reporting of large data/
> > datawarehouse, let's assume I choose Hive for that.
> >
> > Now can Mahout integrate with Hive to make use of this data for learning,
(Continue reading)

Valerio Ceraudo | 2 Sep 2010 03:05
Picon

from Arff to Vector

hi all,
i'm at my last step for thesi,covert an arff file in a vector.
I founded this,as like suggested in an old post:
https://cwiki.apache.org/MAHOUT/creating-vectors-from-wekas-arff-format.html

I did two different attempted:

1) From trunk folder I moved in trunk/utils and then I used mvn install but when
i get the snapshoot and i try to run Driver.class i took this error message:

java.lang.ClassNotFoundException:org.apache.commons.cli2.OptionException

so i attempted another way:

i looked inside Driver.java to see the imports:

import org.apache.commons.cli2.CommandLine;
import org.apache.commons.cli2.Group;
import org.apache.commons.cli2.Option;
import org.apache.commons.cli2.OptionException;
import org.apache.commons.cli2.builder.ArgumentBuilder;
import org.apache.commons.cli2.builder.DefaultOptionBuilder;
import org.apache.commons.cli2.builder.GroupBuilder;
import org.apache.commons.cli2.commandline.Parser;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.mahout.common.CommandLineUtil;
(Continue reading)

Lance Norskog | 2 Sep 2010 04:24
Picon

Re: About the SVDRecommender

Sorry, a 10-20% spread in evaluation results. The value goes from .6+
to .8+ each time I run SVDRecommender(datamodel, features=20,
iterations=50);

This is the value from AverageAbsoluteDifferenceRecommenderEvaluator.

I'll buy that 10k is not enough. Right ho!

On Tue, Aug 31, 2010 at 10:16 AM, Sean Owen <srowen <at> gmail.com> wrote:
> Presumably in the result of the evaluation -- average absolute
> difference in actual/estimated preference.
>
> The eval trains with a random subset of the data and tests with the rest.
>
> I just realized from your other mail that you are using a data set
> with 10,000 ratings only. That's fairly small and I wouldn't be
> surprised if the random choice of training set begins to be
> significant to the model.
>
> You could try 100K ratings or more simply to see if that's the issue;
> I don't know that it is.
>
> On Tue, Aug 31, 2010 at 6:08 PM, Ted Dunning <ted.dunning <at> gmail.com> wrote:
>> A 20% spread in what?
>>
>> Speed?  Results?  Iterations?
>>
>> On Mon, Aug 30, 2010 at 11:26 PM, Lance Norskog <goksron <at> gmail.com> wrote:
>>
>>> SVDRecommender is really sensitive to the random number seed. AADRE
(Continue reading)


Gmane