David Ciemiewicz (JIRA | 1 Jun 06:56 2009
Picon

[jira] Created: (PIG-826) DISTINCT as "Function" rather than statement - High Level Pig

DISTINCT as "Function" rather than statement - High Level Pig
-------------------------------------------------------------

                 Key: PIG-826
                 URL: https://issues.apache.org/jira/browse/PIG-826
             Project: Pig
          Issue Type: New Feature
            Reporter: David Ciemiewicz

In SQL, a user would think nothing of doing something like:

{code}
select
    COUNT(DISTINCT(user)) as user_count,
    COUNT(DISTINCT(country)) as country_count,
    COUNT(DISTINCT(url) as url_count
from
    server_logs;
{code}

But in Pig, we'd need to do something like the following.  And this is about the most
compact version I could come up with.

{code}
Logs = load 'log' using PigStorage()
        as ( user: chararray, country: chararray, url: chararray);

DistinctUsers = distinct (foreach Logs generate user);
DistinctCountries = distinct (foreach Logs generate country);
DistinctUrls = distinct (foreach Logs generate url);
(Continue reading)

David Ciemiewicz (JIRA | 1 Jun 07:08 2009
Picon

[jira] Commented: (PIG-801) Pig needs to handle scalar aliases to improve programmer and code execution efficiency


    [
https://issues.apache.org/jira/browse/PIG-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714972#action_12714972
] 

David Ciemiewicz commented on PIG-801:
--------------------------------------

I'm very much beginning to like the idea of introducing some "syntactic sugar" in Pig for an "forall"  or
"overall" statement that would allow one to write the "high level pig" for this case as:

{code}
Total = forall CountryPopulations generate SUM(CountryPopulations.population) as population;
{code}

or as:

{code}
Total = overall CountryPopulations generate SUM(CountryPopulations.population) as population;
{code}

Yeah, I know I could use construct:

{code}
Total = foreach (group CountryPopulations all) generate SUM(CountryPopulations.population) as population;
{code}

 But I like syntactic sugar.

Then again, it would be really good if Pig just supported:  Since this would need to be done for SQL, it could be
(Continue reading)

David Ciemiewicz (JIRA | 1 Jun 07:30 2009
Picon

[jira] Updated: (PIG-753) Provide support for UDFs without parameters


     [
https://issues.apache.org/jira/browse/PIG-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Ciemiewicz updated PIG-753:
---------------------------------

    Summary: Provide support for UDFs without parameters  (was: Do not support UDF not providing parameter)

> Provide support for UDFs without parameters
> -------------------------------------------
>
>                 Key: PIG-753
>                 URL: https://issues.apache.org/jira/browse/PIG-753
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Jeff Zhang
>
> Pig do not support UDF without parameters, it force me provide a parameter.
> like the following statement:
>  B = FOREACH A GENERATE bagGenerator();  this will generate error. I have to provide a parameter like following
>  B = FOREACH A GENERATE bagGenerator($0);
>  

--

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

(Continue reading)

Jeff Zhang (JIRA | 1 Jun 16:47 2009
Picon

[jira] Updated: (PIG-753) Provide support for UDFs without parameters


     [
https://issues.apache.org/jira/browse/PIG-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Zhang updated PIG-753:
---------------------------

    Attachment: Pig_753_Patch.txt

attach the patch

> Provide support for UDFs without parameters
> -------------------------------------------
>
>                 Key: PIG-753
>                 URL: https://issues.apache.org/jira/browse/PIG-753
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Jeff Zhang
>         Attachments: Pig_753_Patch.txt
>
>
> Pig do not support UDF without parameters, it force me provide a parameter.
> like the following statement:
>  B = FOREACH A GENERATE bagGenerator();  this will generate error. I have to provide a parameter like following
>  B = FOREACH A GENERATE bagGenerator($0);
>  

--

-- 
This message is automatically generated by JIRA.
(Continue reading)

Jeff Zhang (JIRA | 1 Jun 16:49 2009
Picon

[jira] Updated: (PIG-753) Provide support for UDFs without parameters


     [
https://issues.apache.org/jira/browse/PIG-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Zhang updated PIG-753:
---------------------------

        Fix Version/s: 0.3.0
    Affects Version/s: 0.3.0
               Status: Patch Available  (was: Open)

Submit the patch.
Now we do not have to provider a parameter for UDF, zero-parameters UDF is also OK too.

> Provide support for UDFs without parameters
> -------------------------------------------
>
>                 Key: PIG-753
>                 URL: https://issues.apache.org/jira/browse/PIG-753
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.3.0
>            Reporter: Jeff Zhang
>             Fix For: 0.3.0
>
>         Attachments: Pig_753_Patch.txt
>
>
> Pig do not support UDF without parameters, it force me provide a parameter.
> like the following statement:
(Continue reading)

David Ciemiewicz (JIRA | 1 Jun 18:27 2009
Picon

[jira] Updated: (PIG-826) DISTINCT as "Function/Operator" rather than statement/operator - High Level Pig


     [
https://issues.apache.org/jira/browse/PIG-826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Ciemiewicz updated PIG-826:
---------------------------------

    Summary: DISTINCT as "Function/Operator" rather than statement/operator - High Level Pig  (was:
DISTINCT as "Function" rather than statement - High Level Pig)

> DISTINCT as "Function/Operator" rather than statement/operator - High Level Pig
> -------------------------------------------------------------------------------
>
>                 Key: PIG-826
>                 URL: https://issues.apache.org/jira/browse/PIG-826
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: David Ciemiewicz
>
> In SQL, a user would think nothing of doing something like:
> {code}
> select
>     COUNT(DISTINCT(user)) as user_count,
>     COUNT(DISTINCT(country)) as country_count,
>     COUNT(DISTINCT(url) as url_count
> from
>     server_logs;
> {code}
> But in Pig, we'd need to do something like the following.  And this is about the most
> compact version I could come up with.
(Continue reading)

Santhosh Srinivasan (JIRA | 1 Jun 18:29 2009
Picon

[jira] Commented: (PIG-697) Proposed improvements to pig's optimizer


    [
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715143#action_12715143
] 

Santhosh Srinivasan commented on PIG-697:
-----------------------------------------

The graph operation pushAfter was added as a complementary operation to pushBefore. Currently, on the
logical side, there are no concrete use cases for pushAfter. The only operator that truly supports
multiple outputs is split. Our current model for split is to have an no-op split operator that has multiple
successors, split outputs, each of which is the equivalent of a filter. The split output has inner plans
which could have projection operators that hold references to the split's predecessor. 

When an operator is pushed after split, the operator will be placed between the split and split output. As a
result, when rewire on split is called, the call is dispatched to the split output. The references in the
split output after the rewire will now point to split's predecessor instead of pointing to the operator
that was pushed after.

The intention of the pushAfter in the case of a split is to push it after the split output. However, the
generic pushAfter operation does not distinguish between split and split output. A possible way out is to
override this method in the logical plan and duplicate most of the code in the OperatorPlan and add new code
to handle split.

As of now, the pushAfter will not be used in the logical layer.

> Proposed improvements to pig's optimizer
> ----------------------------------------
>
>                 Key: PIG-697
(Continue reading)

Santhosh Srinivasan (JIRA | 1 Jun 18:31 2009
Picon

[jira] Created: (PIG-827) Redesign graph operations in OperatorPlan

Redesign graph operations in OperatorPlan
-----------------------------------------

                 Key: PIG-827
                 URL: https://issues.apache.org/jira/browse/PIG-827
             Project: Pig
          Issue Type: Improvement
          Components: impl
    Affects Versions: 0.2.1
            Reporter: Santhosh Srinivasan
             Fix For: 0.2.1

The graph operations swap, insertBetween, pushBefore, etc. have to be re-implemented in a layered
fashion. The layering will facilitate the re-use of operations. In addition, use of operator.rewire in
the aforementioned operations requires transaction like ability due to various pre-conditions.
Often, the result of one of the operations leaves the graph in an inconsistent state for the rewire
operation. Clear layering and assignment of the ability to rewire will remove these inconsistencies.
For now, use of rewire has resulted in a slightly less maintainable code along with the necessity to use
rewire with discretion.

--

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Alan Gates (JIRA | 1 Jun 18:53 2009
Picon

[jira] Resolved: (PIG-825) PIG_HADOOP_VERSION should be 18


     [
https://issues.apache.org/jira/browse/PIG-825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates resolved PIG-825.
----------------------------

       Resolution: Fixed
    Fix Version/s: 0.3.0

Patch checked in.  Thanks Dmitriy.

> PIG_HADOOP_VERSION should be 18
> -------------------------------
>
>                 Key: PIG-825
>                 URL: https://issues.apache.org/jira/browse/PIG-825
>             Project: Pig
>          Issue Type: Bug
>          Components: grunt
>            Reporter: Dmitriy V. Ryaboy
>             Fix For: 0.3.0
>
>         Attachments: pig-825.patch, pig-825.patch
>
>
> PIG_HADOOP_VERSION should be set to 18, not 17, as Hadoop 0.18 is now considered default.
> Patch coming.

--

-- 
(Continue reading)

Mridul Muralidharan | 1 Jun 20:55 2009
Picon

Re: A proposal for changing pig's memory management

Alan Gates wrote:
> 
> On May 19, 2009, at 10:30 PM, Mridul Muralidharan wrote:
> 
>>
>> I am still not very convinced about the value about this 
>> implementation - particularly considering the advances made since 1.3 
>> in memory allocators and garbage collection.
> 
> My fundamental concern is not with the slowness of garbage collection.  
> I am asserting (along with the paper) that garbage collection is not an 
> optimal choice for a large data processing system.  I don't want to 
> improve the garbage collector, I want to manage a subset of the memory 
> without it.

I should probably have elaborated better.
Most objects in Pig are in young generation (pls correct me if I am 
wrong) - so promoting them from there (which is handled pretty optimally 
and blazingly fast by vm) into slower/longer memory pools should be done 
with some thought (management of buffers, etc).

The only (corner) cases where this is not valid, from top of my head, is 
when a single tuple becomes really large due to a bag (usually) with 
either large number of tuples in it, or tuples with larger payloads : 
and imo that results in quite similar costs with this proposal too - but 
I could be wrong.

> 
>>
>>
(Continue reading)


Gmane