David Ciemiewicz (JIRA | 1 Jan 02:52 2009
Picon

[jira] Created: (PIG-596) Anonymous tuples in bags create ParseExceptions

Anonymous tuples in bags create ParseExceptions
-----------------------------------------------

                 Key: PIG-596
                 URL: https://issues.apache.org/jira/browse/PIG-596
             Project: Pig
          Issue Type: Bug
    Affects Versions: types_branch
            Reporter: David Ciemiewicz

{code}
One = load 'one.txt' using PigStorage() as ( one: int );

LabelledTupleInBag = foreach One generate { ( 1, 2 ) } as mybag { tuplelabel: tuple ( a, b ) };

AnonymousTupleInBag = foreach One generate { ( 2, 3 ) } as mybag { tuple ( a, b ) }; -- Anonymous tuple creates bug

Tuples = union LabelledTupleInBag, AnonymousTupleInBag;

dump Tuples;
{code}

java.io.IOException: Encountered "{ tuple" at line 6, column 66.
Was expecting one of:
    "parallel" ...
    ";" ...
    "," ...
    ":" ...
    "(" ...
    "{" <IDENTIFIER> ...
(Continue reading)

Shubham Chopra (JIRA | 1 Jan 04:50 2009
Picon

[jira] Commented: (PIG-572) A PigServer.registerScript() method, which lets a client programmatically register a Pig Script.


    [
https://issues.apache.org/jira/browse/PIG-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660178#action_12660178
] 

Shubham Chopra commented on PIG-572:
------------------------------------

1. One use case is when users have a share-able script that does some preprocessing on data so users can then
reuse it and build on it when using embedded pig. Its straight forward when using grunt, but not so much when
using embedded pig. I'll ask Chris to point out any other use cases that he might have in mind.
2. This was for the example generator. I need a maping between logical operators and the generated
examples. For cases where the grids are behind firewalls and can only be accessed through gateways, I need
to run the example generator on the gateway and then serialize the maps to disk to get them to local
machines. This needs the logical plan to be serializable.

> A PigServer.registerScript() method, which lets a client programmatically register a Pig Script.
> ------------------------------------------------------------------------------------------------
>
>                 Key: PIG-572
>                 URL: https://issues.apache.org/jira/browse/PIG-572
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: types_branch
>            Reporter: Shubham Chopra
>            Priority: Minor
>             Fix For: types_branch
>
>         Attachments: registerScript.patch
>
(Continue reading)

David Ciemiewicz (JIRA | 1 Jan 18:06 2009
Picon

[jira] Commented: (PIG-596) Anonymous tuples in bags create ParseExceptions


    [
https://issues.apache.org/jira/browse/PIG-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660201#action_12660201
] 

David Ciemiewicz commented on PIG-596:
--------------------------------------

Note that specifying the tuple without the tuple designator doesn't work either.

{code}
One = load 'one.txt' using PigStorage() as ( one: int );

LabelledTupleInBag = foreach One generate { ( 1, 2 ) } as mybag { tuplelabel: tuple ( a, b ) };

AnonymousTupleInBag = foreach One generate { ( 2, 3 ) } as mybag { ( a, b ) };

Tuples = union LabelledTupleInBag, AnonymousTupleInBag;

dump Tuples;
{code}

> Anonymous tuples in bags create ParseExceptions
> -----------------------------------------------
>
>                 Key: PIG-596
>                 URL: https://issues.apache.org/jira/browse/PIG-596
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
(Continue reading)

David Ciemiewicz (JIRA | 1 Jan 18:10 2009
Picon

[jira] Commented: (PIG-596) Anonymous tuples in bags create ParseExceptions


    [
https://issues.apache.org/jira/browse/PIG-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660202#action_12660202
] 

David Ciemiewicz commented on PIG-596:
--------------------------------------

The reason I think it is important to be able to create anonymous tuples is because the tuples are anonymous
in the LOAD statements.  Because, if you FLATTEN a bag such as mybag, any intermediate tuple label is
immediately lost and the results of the flatten are mybag::a, mybag::b.  They are not
mybag::tuplelabel::a, mybag::tuplelabel::b;

> Anonymous tuples in bags create ParseExceptions
> -----------------------------------------------
>
>                 Key: PIG-596
>                 URL: https://issues.apache.org/jira/browse/PIG-596
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: David Ciemiewicz
>
> {code}
> One = load 'one.txt' using PigStorage() as ( one: int );
> LabelledTupleInBag = foreach One generate { ( 1, 2 ) } as mybag { tuplelabel: tuple ( a, b ) };
> AnonymousTupleInBag = foreach One generate { ( 2, 3 ) } as mybag { tuple ( a, b ) }; -- Anonymous tuple creates bug
> Tuples = union LabelledTupleInBag, AnonymousTupleInBag;
> dump Tuples;
> {code}
(Continue reading)

Alan Gates (JIRA | 2 Jan 18:21 2009
Picon

[jira] Commented: (PIG-596) Anonymous tuples in bags create ParseExceptions


    [
https://issues.apache.org/jira/browse/PIG-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660351#action_12660351
] 

Alan Gates commented on PIG-596:
--------------------------------

Flattening a bag gets rid of two layers of containment, both the bag and the tuple.  So the result of
FLATTEN(bag(tuple(x, y, z)) is x, y, z not tuple(x, y, z).

At this point I believe tuples must be named in the LOAD statement as well as in foreach.  I'm not necessarily
voting against anonymous tuples.  But I do believe Pig Latin is consistent in requiring names for tuples at
the moment.

> Anonymous tuples in bags create ParseExceptions
> -----------------------------------------------
>
>                 Key: PIG-596
>                 URL: https://issues.apache.org/jira/browse/PIG-596
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: David Ciemiewicz
>
> {code}
> One = load 'one.txt' using PigStorage() as ( one: int );
> LabelledTupleInBag = foreach One generate { ( 1, 2 ) } as mybag { tuplelabel: tuple ( a, b ) };
> AnonymousTupleInBag = foreach One generate { ( 2, 3 ) } as mybag { tuple ( a, b ) }; -- Anonymous tuple creates bug
> Tuples = union LabelledTupleInBag, AnonymousTupleInBag;
(Continue reading)

Alan Gates (JIRA | 2 Jan 22:19 2009
Picon

[jira] Commented: (PIG-572) A PigServer.registerScript() method, which lets a client programmatically register a Pig Script.


    [
https://issues.apache.org/jira/browse/PIG-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660388#action_12660388
] 

Alan Gates commented on PIG-572:
--------------------------------

Passes all the tests.  I'd like to wait until the Christmas vacation is over to give other committers a chance
to comment before checking it in.  If I don't see any comments after a few days I'll check it in.

> A PigServer.registerScript() method, which lets a client programmatically register a Pig Script.
> ------------------------------------------------------------------------------------------------
>
>                 Key: PIG-572
>                 URL: https://issues.apache.org/jira/browse/PIG-572
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: types_branch
>            Reporter: Shubham Chopra
>            Priority: Minor
>             Fix For: types_branch
>
>         Attachments: registerScript.patch
>
>
> A PigServer.registerScript() method, which lets a client programmatically register a Pig Script.
> For example, say theres a script my_script.pig with the following content:
> a = load '/data/my_data.txt';
> b = filter a by $0 > '0';
(Continue reading)

Pradeep Kamath (JIRA | 3 Jan 00:59 2009
Picon

[jira] Updated: (PIG-558) Distinct followed by a Join results in Invalid size 0 for a tuple error


     [
https://issues.apache.org/jira/browse/PIG-558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pradeep Kamath updated PIG-558:
-------------------------------

    Assignee: Pradeep Kamath
      Status: Patch Available  (was: Open)

The issue was that table 1 has only one column which is also the join key. Due to a recent optimization wherein
parts of the value which are in the key would be omitted, this results in an empty tuple being sent as the
value from POLocalRearrange.  The POPackage following the POLocalRearrange would look at metadata
stored in itself to figure out how to construct the value out of the key if necessary. However when the
POLocalRearrange is in a reduce and the POPackage is in the next map, the POLocalRearrange output gets
written to DFS in BinStorage format resulting in a tuple of size 0 being written out. BinStorage while
reading considers a tuple of size 0 to be a fatal error.

Fix:
The patch fixes BinStorage to consider a tuple of size 0 to be a valid tuple which is reconstructed as such.
The POPackage then builds up the correct value from the key. The patch also has a unit test to test this.

The unit test depends on certain functions introduced in MiniCluster and
test/org/apache/pig/test/Util.java as of the patch in PIG-580. If PIG-580 is not committed before this
patch, then the "additional" patch ("PIG-558-additional.patch") attached here should also be applied.

> Distinct followed by a Join results in Invalid size 0 for a tuple error
> -----------------------------------------------------------------------
>
>                 Key: PIG-558
(Continue reading)

Pradeep Kamath (JIRA | 3 Jan 01:01 2009
Picon

[jira] Updated: (PIG-558) Distinct followed by a Join results in Invalid size 0 for a tuple error


     [
https://issues.apache.org/jira/browse/PIG-558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pradeep Kamath updated PIG-558:
-------------------------------

    Attachment: PIG-558-additional.patch
                PIG-558.patch

> Distinct followed by a Join results in Invalid size 0 for a tuple error
> -----------------------------------------------------------------------
>
>                 Key: PIG-558
>                 URL: https://issues.apache.org/jira/browse/PIG-558
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Viraj Bhat
>            Assignee: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: PIG-558-additional.patch, PIG-558.patch, table1, table2
>
>
> The following Pig script does a right outer join after the DISTINCT.
> {code}
> nonuniqtable1 = LOAD 'table1' AS (f1:chararray);
> table1 = DISTINCT nonuniqtable1;
(Continue reading)

Benjamin Reed (JIRA | 4 Jan 01:19 2009
Picon

[jira] Commented: (PIG-570) Large BZip files Seem to loose data in Pig


    [
https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660518#action_12660518
] 

Benjamin Reed commented on PIG-570:
-----------------------------------

Ah sorry, i did the patch with respect to trunk. i'll regen. it is probably possible to create a smaller test
case, but it will take awhile. i did simple brute force trial an error to get the test case. (i thought it was
pretty good that i was able to keep it to under 2M :) are you concerned about the size or the run time?

> Large BZip files  Seem to loose data in Pig
> -------------------------------------------
>
>                 Key: PIG-570
>                 URL: https://issues.apache.org/jira/browse/PIG-570
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch, 0.0.0, 0.1.0, site
>         Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
>            Reporter: Alex Newman
>             Fix For: types_branch, 0.0.0, 0.1.0, site
>
>         Attachments: bzipTest.bz2, PIG-570.patch
>
>
> So I don't believe  bzip2 input to pig is working, at least not with large files. It seems as though map files
are getting cut off. The maps complete way too quickly and the actual row of data that pig tries to process
often randomly gets cut, and becomes incomplete. Here are my symptoms:
(Continue reading)

Olga Natkovich (JIRA | 5 Jan 19:15 2009
Picon

[jira] Commented: (PIG-570) Large BZip files Seem to loose data in Pig


    [
https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660838#action_12660838
] 

Olga Natkovich commented on PIG-570:
------------------------------------

Ok, lets keep the size and just make the patch against types branch, thanks.

> Large BZip files  Seem to loose data in Pig
> -------------------------------------------
>
>                 Key: PIG-570
>                 URL: https://issues.apache.org/jira/browse/PIG-570
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch, 0.0.0, 0.1.0, site
>         Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
>            Reporter: Alex Newman
>             Fix For: types_branch, 0.0.0, 0.1.0, site
>
>         Attachments: bzipTest.bz2, PIG-570.patch
>
>
> So I don't believe  bzip2 input to pig is working, at least not with large files. It seems as though map files
are getting cut off. The maps complete way too quickly and the actual row of data that pig tries to process
often randomly gets cut, and becomes incomplete. Here are my symptoms:
> - Maps seem to be completing in a unbelievably fast rate
> With uncompressed data
(Continue reading)


Gmane