Lance Norskog | 1 Aug 05:29 2011
Picon

Re: Parallel FPGrowth driver - what is a good demo?

I've rewritten the FPGrowth wiki page. Is still a bit ragged. Please
critique for content.

https://cwiki.apache.org/confluence/display/MAHOUT/Parallel+Frequent+Pattern+Mining

On Thu, Jul 28, 2011 at 12:59 AM, Lance Norskog <goksron <at> gmail.com> wrote:
> Ok, now I've succeeded in running fpgrowth, both sequential and
> mapreduce, from the 'fpg' job and the flag that chooses 'sequential'
> from 'mapreduce'. I've done this from two different datasets,
> accidents.dat and retail.dat. I only ran the first thousand lines of
> both datasets for time reasons.
>
> Both sequential and mapreduce locate the same ids as being in
> patterns. Examining the patterns in detail, they do not match but
> patterns involving id X generally the same size. Successive runs of
> each variant give exactly the same results, so having sequential and
> mapreduce give different result sets is puzzling. Pulling the
> distances is a little difficult with text processing.
>
> What can account for the different outputs of map/reduce and
> sequential (pseudo-distributed) modes?
>
>
>
>
> On 7/27/11, Lance Norskog <goksron <at> gmail.com> wrote:
>> I'll prep a current version.
>>
>> On 7/27/11, Robin Anil <robin.anil <at> gmail.com> wrote:
>>> On Tue, Jul 26, 2011 at 11:06 PM, Lance Norskog <goksron <at> gmail.com>
(Continue reading)

Sean Owen | 1 Aug 09:19 2011
Picon

Re: OSX/Hadoop problem: filename 'LICENSE' and dir 'license/' clash in mahout-examples-0.6-SNAPSHOT-job.jar

Great report here. I imagine the answer is to make 'license' into
'licenses'. Let me have a look and file a JIRA with patch.

Sean

On Sun, Jul 31, 2011 at 6:41 PM, Dan Brickley <danbri <at> danbri.org> wrote:
> With SVN 'At revision 1152597.', and freshly rebuilt:
>
> jar -tvf /Users/danbri/Documents/workspace/trunk/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
> | grep -i license
>
>  19355 Sat Feb 26 19:16:30 CET 2011 META-INF/LICENSE.txt
>  11358 Sun Apr 11 21:45:12 CEST 2010 META-INF/LICENSE
>  1596 Mon Dec 20 15:47:30 CET 2010 LICENSE
>     0 Sun Dec 01 11:57:24 CET 2002 license/
>  4083 Sun Dec 01 11:57:24 CET 2002 license/LICENSE.dom-documentation.txt
>  3595 Sun Dec 01 11:57:24 CET 2002 license/LICENSE.dom-software.txt
>   804 Sun Dec 01 11:57:24 CET 2002 license/LICENSE.sax.txt
>  2827 Sun Dec 01 11:57:24 CET 2002 license/LICENSE.txt
>  1274 Sun Dec 01 11:57:24 CET 2002 license/README.dom.txt
>   715 Sun Dec 01 11:57:24 CET 2002 license/README.sax.txt
>   672 Sun Dec 01 11:57:24 CET 2002 license/README.txt
>
>
> This situation seems to quite confuse Hadoop. The underlying OSX
> filesystem doesn't support file and directory names differing only by
> case; see http://developer.apple.com/library/mac/#documentation/Java/Conceptual/Java14Development/01-JavaOverview/JavaOverview.html
>
> mahout  lucene.vector --dir solr/data/index/ --output bar/vecs --field
> label --idField id --dictOut bar/dict.out --norm 2
(Continue reading)

Dan Brickley | 1 Aug 10:50 2011

Re: OSX/Hadoop problem: filename 'LICENSE' and dir 'license/' clash in mahout-examples-0.6-SNAPSHOT-job.jar

On 1 August 2011 09:19, Sean Owen <srowen <at> gmail.com> wrote:
> Great report here. I imagine the answer is to make 'license' into
> 'licenses'. Let me have a look and file a JIRA with patch.

Thanks. I tried adding into ./examples/src/main/assembly/job.xml

        <exclude>LICENSE</exclude>
        <exclude>license*</exclude>

...but that was a bad guess. Renaming to 'licenses' sounds better and
simpler; I just didn't track down yet where it's actually coming from.

cheers,

Dan

Jeff Eastman | 1 Aug 18:15 2011

RE: Analyzing the clusterdump output - kmeans clustering


-----Original Message-----
From: Abhik Banerjee [mailto:banerjee.abhik.hcl <at> gmail.com] 
Sent: Friday, July 29, 2011 3:48 PM
To: user <at> mahout.apache.org
Subject: Analyzing the clusterdump output - kmeans clustering

Hi,

I managed to run the kmeans algorithm on a cloudera vm , using the
help provided at the wiki and help at the forum . I got my output and
am trying to use the clusterdump to analyze my result.

 (I seemed to give 5 iterations , but it seems to have formed only 4
clusters , I am also curious about that , I ran this below command )

mahout kmeans -i hdfs://localhost/mahout_input/ip -o
hdfs://localhost/mahout_output/output_kmeans_07_29_1/ -dm
org.apache.mahout.common.distance.EuclideanDistanceMeasure -cd 1.0 -c
hdfs://localhost/mahout_input/centroids_07_29_1 -k 5 -x 5 -cl

after k means completion on hadoop cloudera vm I ran this command :-

mahout clusterdump --seqFileDir
hdfs://localhost/mahout_output/output_kmeans_07_29_1/clusters-5/part-r-00000
--pointsDir hdfs://localhost/mahout_output/output_kmeans_07_29_1/clusteredPoints
--output kmeans_07_29_1_cl5.tx

and when I look into the text file I see a structure like this

(Continue reading)

Jeff Eastman | 1 Aug 18:19 2011

RE: Analyzing the clusterdump output - kmeans clustering

It is possible for kmeans to fail to assign any points to one of its clusters, and I believe this is the
explanation for your seeing only 4 clusters when you requested 5. Also, looking at your clustered data,
the points are printed out in sparse vector notation (index:value) and your input vectors only appear to
have 2 or 3 nonzero elements. This could be the reason for dropping one of the clusters. 

-----Original Message-----
From: Abhik Banerjee [mailto:banerjee.abhik.hcl <at> gmail.com] 
Sent: Friday, July 29, 2011 3:48 PM
To: user <at> mahout.apache.org
Subject: Analyzing the clusterdump output - kmeans clustering

Hi,

I managed to run the kmeans algorithm on a cloudera vm , using the
help provided at the wiki and help at the forum . I got my output and
am trying to use the clusterdump to analyze my result.

 (I seemed to give 5 iterations , but it seems to have formed only 4
clusters , I am also curious about that , I ran this below command )

mahout kmeans -i hdfs://localhost/mahout_input/ip -o
hdfs://localhost/mahout_output/output_kmeans_07_29_1/ -dm
org.apache.mahout.common.distance.EuclideanDistanceMeasure -cd 1.0 -c
hdfs://localhost/mahout_input/centroids_07_29_1 -k 5 -x 5 -cl

after k means completion on hadoop cloudera vm I ran this command :-

mahout clusterdump --seqFileDir
hdfs://localhost/mahout_output/output_kmeans_07_29_1/clusters-5/part-r-00000
--pointsDir hdfs://localhost/mahout_output/output_kmeans_07_29_1/clusteredPoints
(Continue reading)

Jeff Eastman | 1 Aug 18:20 2011

RE: Doubt regarding the kmeans clustering results on mahout

It's really not possible for the clustering to produce NamedVectors but you are free to send it points which
are named. Those points will pass through the clustering process and be available in the output.

-----Original Message-----
From: Frank Scholten [mailto:frank <at> frankscholten.nl] 
Sent: Saturday, July 30, 2011 4:21 AM
To: user <at> mahout.apache.org
Subject: Re: Doubt regarding the kmeans clustering results on mahout

Maybe it should produce NamedVectors by default as well. This is
another of those optional settings
that is often needed in practice.

On Fri, Jul 29, 2011 at 11:42 PM, Jeff Eastman <jeastman <at> narus.com> wrote:
> No problem. I really think the default needs to be changed anyway. Perhaps this will get me to do it.
>
> -----Original Message-----
> From: Abhik Banerjee [mailto:banerjee.abhik.hcl <at> gmail.com]
> Sent: Friday, July 29, 2011 1:48 PM
> To: user <at> mahout.apache.org
> Subject: Re: Doubt regarding the kmeans clustering results on mahout
>
> Thanks a lot , I missed that part in the wiki , My mistake.
>
>

Abhik Banerjee | 1 Aug 18:40 2011
Picon

Re: Analyzing the clusterdump output - kmeans clustering

Hi Jeff, Thanks a lot for your help. 

Frank Scholten | 1 Aug 21:05 2011
Picon

Re: Doubt regarding the kmeans clustering results on mahout

Ah, I meant maybe seq2sparse could produce namedvectors by default.
There was a discussion on that some time ago on
https://issues.apache.org/jira/browse/MAHOUT-401

On Mon, Aug 1, 2011 at 6:20 PM, Jeff Eastman <jeastman <at> narus.com> wrote:
> It's really not possible for the clustering to produce NamedVectors but you are free to send it points
which are named. Those points will pass through the clustering process and be available in the output.
>
> -----Original Message-----
> From: Frank Scholten [mailto:frank <at> frankscholten.nl]
> Sent: Saturday, July 30, 2011 4:21 AM
> To: user <at> mahout.apache.org
> Subject: Re: Doubt regarding the kmeans clustering results on mahout
>
> Maybe it should produce NamedVectors by default as well. This is
> another of those optional settings
> that is often needed in practice.
>
> On Fri, Jul 29, 2011 at 11:42 PM, Jeff Eastman <jeastman <at> narus.com> wrote:
>> No problem. I really think the default needs to be changed anyway. Perhaps this will get me to do it.
>>
>> -----Original Message-----
>> From: Abhik Banerjee [mailto:banerjee.abhik.hcl <at> gmail.com]
>> Sent: Friday, July 29, 2011 1:48 PM
>> To: user <at> mahout.apache.org
>> Subject: Re: Doubt regarding the kmeans clustering results on mahout
>>
>> Thanks a lot , I missed that part in the wiki , My mistake.
>>
>>
(Continue reading)

Jeff Eastman | 1 Aug 21:57 2011

RE: Doubt regarding the kmeans clustering results on mahout

That issue has been closed and seq2sparse, in trunk at least, reports it has a -nv (--namedVector) option.
Have you tried it?

-----Original Message-----
From: Frank Scholten [mailto:frank <at> frankscholten.nl] 
Sent: Monday, August 01, 2011 12:06 PM
To: user <at> mahout.apache.org
Subject: Re: Doubt regarding the kmeans clustering results on mahout

Ah, I meant maybe seq2sparse could produce namedvectors by default.
There was a discussion on that some time ago on
https://issues.apache.org/jira/browse/MAHOUT-401

On Mon, Aug 1, 2011 at 6:20 PM, Jeff Eastman <jeastman <at> narus.com> wrote:
> It's really not possible for the clustering to produce NamedVectors but you are free to send it points
which are named. Those points will pass through the clustering process and be available in the output.
>
> -----Original Message-----
> From: Frank Scholten [mailto:frank <at> frankscholten.nl]
> Sent: Saturday, July 30, 2011 4:21 AM
> To: user <at> mahout.apache.org
> Subject: Re: Doubt regarding the kmeans clustering results on mahout
>
> Maybe it should produce NamedVectors by default as well. This is
> another of those optional settings
> that is often needed in practice.
>
> On Fri, Jul 29, 2011 at 11:42 PM, Jeff Eastman <jeastman <at> narus.com> wrote:
>> No problem. I really think the default needs to be changed anyway. Perhaps this will get me to do it.
>>
(Continue reading)

Jeff Eastman | 1 Aug 22:39 2011

RE: Doubt regarding the kmeans clustering results on mahout

Ah, ok, I guess you meant both should produce those behaviors by default rather than upon request. That
would require either changing the semantics of -cl & -nv to disable these behaviors vs. enabling them,
removing both options and introducing new options to disable them, or making both options require a
true/false value (default=true). None are particularly attractive, IMHO.

-----Original Message-----
From: Jeff Eastman [mailto:jeastman <at> Narus.com] 
Sent: Monday, August 01, 2011 12:57 PM
To: user <at> mahout.apache.org
Subject: RE: Doubt regarding the kmeans clustering results on mahout

That issue has been closed and seq2sparse, in trunk at least, reports it has a -nv (--namedVector) option.
Have you tried it?

-----Original Message-----
From: Frank Scholten [mailto:frank <at> frankscholten.nl] 
Sent: Monday, August 01, 2011 12:06 PM
To: user <at> mahout.apache.org
Subject: Re: Doubt regarding the kmeans clustering results on mahout

Ah, I meant maybe seq2sparse could produce namedvectors by default.
There was a discussion on that some time ago on
https://issues.apache.org/jira/browse/MAHOUT-401

On Mon, Aug 1, 2011 at 6:20 PM, Jeff Eastman <jeastman <at> narus.com> wrote:
> It's really not possible for the clustering to produce NamedVectors but you are free to send it points
which are named. Those points will pass through the clustering process and be available in the output.
>
> -----Original Message-----
> From: Frank Scholten [mailto:frank <at> frankscholten.nl]
(Continue reading)


Gmane