Mridul Jain (JIRA | 17 Apr 04:36 2014
Picon

[Commented] (PIG-3453) Implement a Storm backend to Pig


    [
https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13972242#comment-13972242
] 

Mridul Jain commented on PIG-3453:
----------------------------------

We completed the implementation for "Pig on Storm" last month and are now testing it in production so that we
could potentially opensource it. Here is an abstract:
“We propose PIG as the primary language for expressing realtime stream processing logic and provide a
working prototype on Storm. We also illustrate how legacy code written for MR in PIG, can run with minimal
to no changes, on Storm. This includes running the existing PIG UDFs, seamlessly on Storm. Though PIG or
Storm do not take any position on state, we have provided built-in support for advanced state semantics
like sliding windows, global mutable state etc, which are required in real world applications. We take a
detailed look into a prototype application (realtime anomaly detection/trending system), which
elucidates the performance characteristics of this framework and rich expressibility of complex
programming logic via PIG on streaming.
Finally, we propose a "Hybrid Mode" where a single PIG script can express logic for both realtime streaming
and batch jobs and also defines data exchange mechanisms between the two, without breaking the semantic &
syntactic sanctity of PIG. The underlying system figures out what parts of the this PIG script to  run on MR
and what on Storm, automatically."

> Implement a Storm backend to Pig
> --------------------------------
>
>                 Key: PIG-3453
>                 URL: https://issues.apache.org/jira/browse/PIG-3453
>             Project: Pig
>          Issue Type: New Feature
(Continue reading)

jira | 17 Apr 03:03 2014
Picon

Subscription: PIG patch available

Issue Subscription
Filter: PIG patch available (20 issues)

Subscriber: pigdaily

Key         Summary
PIG-3901    Organize the Pig properties file and document all properties
            https://issues.apache.org/jira/browse/PIG-3901
PIG-3899    Fix memory leak with PigTezLogger
            https://issues.apache.org/jira/browse/PIG-3899
PIG-3894    Datetime function AddDuration, SubtractDuration and all Between functions don't check for
null values in the input tuple.
            https://issues.apache.org/jira/browse/PIG-3894
PIG-3877    Getting Geo Latitude/Longitude from Address Lines
            https://issues.apache.org/jira/browse/PIG-3877
PIG-3874    FileLocalizer temp path can sometimes be non-unique
            https://issues.apache.org/jira/browse/PIG-3874
PIG-3873    Geo distance calculation using Haversine
            https://issues.apache.org/jira/browse/PIG-3873
PIG-3867    Added hadoop home to build classpath for build pig with unit test on windows
            https://issues.apache.org/jira/browse/PIG-3867
PIG-3866    Create ThreadLocal classloader per PigContext
            https://issues.apache.org/jira/browse/PIG-3866
PIG-3865    Remodel the XMLLoader to work to be faster and more maintainable
            https://issues.apache.org/jira/browse/PIG-3865
PIG-3861    duplicate jars get added to distributed cache
            https://issues.apache.org/jira/browse/PIG-3861
PIG-3825    Stats collection needs to be changed for hadoop2 (with auto local mode)
            https://issues.apache.org/jira/browse/PIG-3825
PIG-3772    Syntax error when casting an inner schema of a bag and line break involved
(Continue reading)

Cheolsoo Park (JIRA | 17 Apr 02:41 2014
Picon

[Updated] (PIG-3898) Refactor PPNL for non-MR execution engine


     [
https://issues.apache.org/jira/browse/PIG-3898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheolsoo Park updated PIG-3898:
-------------------------------

    Attachment: PIG-3898-1-tez.patch
                PIG-3898-1-trunk.patch

Thank you everyone for your comments. I am uploading my wip patches.

The changes include-
# Update initialPlanNotification() in PPNL as suggested.
# Move non-MR specific code from MRScriptState to ScriptState. PIG-3419 moved too much code to
MRScriptState, so I am moving back to ScriptState whatever is applicable to both MR and Tez.

Regarding exposing OperatorPlan in the API, I agree that we should avoid it if possible. But since Ambrose
and Lipstick use it heavily, it won't be easy to take it back at this point. Nevertheless, it's definitely
worth to discuss.

> Refactor PPNL for non-MR execution engine
> -----------------------------------------
>
>                 Key: PIG-3898
>                 URL: https://issues.apache.org/jira/browse/PIG-3898
>             Project: Pig
>          Issue Type: Task
>            Reporter: Cheolsoo Park
>            Assignee: Cheolsoo Park
(Continue reading)

Philip (flip) Kromer (JIRA | 17 Apr 02:17 2014
Picon

[Updated] (PIG-3901) Organize the Pig properties file and document all properties


     [
https://issues.apache.org/jira/browse/PIG-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Philip (flip) Kromer updated PIG-3901:
--------------------------------------

    Status: Patch Available  (was: Open)

> Organize the Pig properties file and document all properties
> ------------------------------------------------------------
>
>                 Key: PIG-3901
>                 URL: https://issues.apache.org/jira/browse/PIG-3901
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Philip (flip) Kromer
>            Priority: Minor
>              Labels: conf, config, documentation, properties, settings
>         Attachments: organize_pig_properties.patch
>
>
> The current pig.properties file can use some love. Each property should be introduced by a documentation
string explaining
> * what the feature does,
> * what its default and other allowed values are,
> * why a user might change it from the default,
> * and what might go wrong with each.
> The documentation should follow a common format -- I propose the following guidelines:
> * Each property should supply either a bulleted list of acceptable values, indicating the default; or
(Continue reading)

Philip (flip) Kromer (JIRA | 17 Apr 02:17 2014
Picon

[Updated] (PIG-3901) Organize the Pig properties file and document all properties


     [
https://issues.apache.org/jira/browse/PIG-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Philip (flip) Kromer updated PIG-3901:
--------------------------------------

    Attachment: organize_pig_properties.patch

Work in progress: Organize the pig properties file and document all properties

> Organize the Pig properties file and document all properties
> ------------------------------------------------------------
>
>                 Key: PIG-3901
>                 URL: https://issues.apache.org/jira/browse/PIG-3901
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Philip (flip) Kromer
>            Priority: Minor
>              Labels: conf, config, documentation, properties, settings
>         Attachments: organize_pig_properties.patch
>
>
> The current pig.properties file can use some love. Each property should be introduced by a documentation
string explaining
> * what the feature does,
> * what its default and other allowed values are,
> * why a user might change it from the default,
> * and what might go wrong with each.
(Continue reading)

Philip (flip) Kromer (JIRA | 17 Apr 02:11 2014
Picon

[Updated] (PIG-3901) Organize the Pig properties file and document all properties


     [
https://issues.apache.org/jira/browse/PIG-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Philip (flip) Kromer updated PIG-3901:
--------------------------------------

    Status: Open  (was: Patch Available)

> Organize the Pig properties file and document all properties
> ------------------------------------------------------------
>
>                 Key: PIG-3901
>                 URL: https://issues.apache.org/jira/browse/PIG-3901
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Philip (flip) Kromer
>            Priority: Minor
>              Labels: conf, config, documentation, properties, settings
>
> The current pig.properties file can use some love. Each property should be introduced by a documentation
string explaining
> * what the feature does,
> * what its default and other allowed values are,
> * why a user might change it from the default,
> * and what might go wrong with each.
> The documentation should follow a common format -- I propose the following guidelines:
> * Each property should supply either a bulleted list of acceptable values, indicating the default; or
provide the default value inline with the description
> * Don't say 'This setting lets you control whether Pig will decide to use the Hemiconducer feature', say
(Continue reading)

Philip (flip) Kromer (JIRA | 17 Apr 02:11 2014
Picon

[Updated] (PIG-3901) Organize the Pig properties file and document all properties


     [
https://issues.apache.org/jira/browse/PIG-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Philip (flip) Kromer updated PIG-3901:
--------------------------------------

    Status: Patch Available  (was: Open)

> Organize the Pig properties file and document all properties
> ------------------------------------------------------------
>
>                 Key: PIG-3901
>                 URL: https://issues.apache.org/jira/browse/PIG-3901
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Philip (flip) Kromer
>            Priority: Minor
>              Labels: conf, config, documentation, properties, settings
>
> The current pig.properties file can use some love. Each property should be introduced by a documentation
string explaining
> * what the feature does,
> * what its default and other allowed values are,
> * why a user might change it from the default,
> * and what might go wrong with each.
> The documentation should follow a common format -- I propose the following guidelines:
> * Each property should supply either a bulleted list of acceptable values, indicating the default; or
provide the default value inline with the description
> * Don't say 'This setting lets you control whether Pig will decide to use the Hemiconducer feature', say
(Continue reading)

Philip (flip) Kromer (JIRA | 17 Apr 02:09 2014
Picon

[Created] (PIG-3901) Organize the Pig properties file and document all properties

Philip (flip) Kromer created PIG-3901:
-----------------------------------------

             Summary: Organize the Pig properties file and document all properties
                 Key: PIG-3901
                 URL: https://issues.apache.org/jira/browse/PIG-3901
             Project: Pig
          Issue Type: Improvement
            Reporter: Philip (flip) Kromer
            Priority: Minor

The current pig.properties file can use some love. Each property should be introduced by a documentation
string explaining

* what the feature does,
* what its default and other allowed values are,
* why a user might change it from the default,
* and what might go wrong with each.

The documentation should follow a common format -- I propose the following guidelines:

* Each property should supply either a bulleted list of acceptable values, indicating the default; or
provide the default value inline with the description
* Don't say 'This setting lets you control whether Pig will decide to use the Hemiconducer feature', say
'Enables the hemiconducer feature, which [...]'
* Don't document the internals of the feature. Describe its impact on job execution or performance.
* Use consistent indentation, title formatting, and block delimiting. (The current patch does not yet do
so completely, as I'm figuring it out)
* Place each setting in the appropriate block according to its impact on the user experience.
* Call out Experimental features with `EXPERIMENTAL`, but group them with similar settings.
(Continue reading)

Philip (flip) Kromer (JIRA | 17 Apr 01:46 2014
Picon

[Updated] (PIG-3900) SAMPLE and RANDOM should optionally stabilize their output from run-to-run, even across a large input set


     [
https://issues.apache.org/jira/browse/PIG-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Philip (flip) Kromer updated PIG-3900:
--------------------------------------

    Description: 
SAMPLE and RANDOM should be able to give output that is stable from run-to-run, yet random across a large
input set. Although PIG-2965 allows the RANDOM function to be constructed with a seed, each mapper will
generate the same sequence of values, which is unacceptable.

It's typically undesirable to have the output of a large job be completely non-deterministic. Testing
becomes difficult, and failed map tasks don't provide the same output from attempt to attempt, which
complicates debugging.

The most desirable implementation would provide a guarantee that a given seed and input data would produce
an identical result in any environment. I believe this is difficult in a distributed environment, however.

If each mapper added the index of its task ID to the provided seed, then the output would be stable for most
practical purposes -- as long as the assignment of input splits to mappers doesn't change from job to job,
the number produced for each row won't change from job to job. Doing it this way would be backwards
compatible with the current Pig 0.12.0 implementation (PIG-2965) in the case of a single mapper (which is
the only justifiable use of the current seed feature). Alternatively, one could use a hash of the input
file path, the split offset, and the provided seed. Both approaches are not stable if the
splitCombination logic is not stable. 

Suggested documentation for new functionality of RANDOM:

{quote}
(Continue reading)

Philip (flip) Kromer (JIRA | 17 Apr 01:44 2014
Picon

[Created] (PIG-3900) SAMPLE and RANDOM should optionally stabilize their output from run-to-run, even across a large input set

Philip (flip) Kromer created PIG-3900:
-----------------------------------------

             Summary: SAMPLE and RANDOM should optionally stabilize their output from run-to-run, even across a large
input set
                 Key: PIG-3900
                 URL: https://issues.apache.org/jira/browse/PIG-3900
             Project: Pig
          Issue Type: Bug
            Reporter: Philip (flip) Kromer
            Priority: Minor

SAMPLE and RANDOM should be able to give output that is stable from run-to-run, yet random across a large
input set. Although PIG-2965 allows the RANDOM function to be constructed with a seed, each mapper will
generate the same sequence of values, which is unacceptable.

It's typically undesirable to have the output of a large job be completely non-deterministic. Testing
becomes more complicated, and failed map tasks don't provide the same output from attempt to attempt,
making debugging difficult.

The most desirable implementation would provide a guarantee that a given seed and input data would produce
an identical result in any environment. I believe this is difficult in a distributed environment, however.

If each mapper added the index of its task ID to the provided seed, then the output would be stable for most
practical purposes -- as long as the assignment of input splits to mappers doesn't change from job to job,
the number produced for each row won't change from job to job. Doing it this way would be backwards
compatible with the current Pig 0.12.0 implementation (PIG-2965) in the case of a single mapper (which is
the only justifiable use of the current seed feature). Alternatively, one could use a hash of the input
file path, the split offset, and the provided seed. Both approaches are not stable if the
splitCombination logic is not stable. 
(Continue reading)

Philip (flip) Kromer (JIRA | 17 Apr 01:44 2014
Picon

[Updated] (PIG-3900) SAMPLE and RANDOM should optionally stabilize their output from run-to-run, even across a large input set


     [
https://issues.apache.org/jira/browse/PIG-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Philip (flip) Kromer updated PIG-3900:
--------------------------------------

    Description: 
SAMPLE and RANDOM should be able to give output that is stable from run-to-run, yet random across a large
input set. Although PIG-2965 allows the RANDOM function to be constructed with a seed, each mapper will
generate the same sequence of values, which is unacceptable.

It's typically undesirable to have the output of a large job be completely non-deterministic. Testing
becomes more complicated, and failed map tasks don't provide the same output from attempt to attempt,
making debugging difficult.

The most desirable implementation would provide a guarantee that a given seed and input data would produce
an identical result in any environment. I believe this is difficult in a distributed environment, however.

If each mapper added the index of its task ID to the provided seed, then the output would be stable for most
practical purposes -- as long as the assignment of input splits to mappers doesn't change from job to job,
the number produced for each row won't change from job to job. Doing it this way would be backwards
compatible with the current Pig 0.12.0 implementation (PIG-2965) in the case of a single mapper (which is
the only justifiable use of the current seed feature). Alternatively, one could use a hash of the input
file path, the split offset, and the provided seed. Both approaches are not stable if the
splitCombination logic is not stable. 

Suggested documentation for new functionality of RANDOM:

{quote}
(Continue reading)


Gmane