Nhat Hoang | 24 Jul 14:06 2014
Picon

An optimization for ROLLUP Operation on Apache Pig with 50% faster [PIG-4066]

Hello everyone,

I am currently a master student at EURECOM (www.eurecom.fr). I am working
on a project related to Apache Pig in the context of a EU-funded project
Bigfoot (www.bigfootproject.eu).

Based on our previous work: “Duy-Hung Phan, Matteo Dell’Amico, Pietro
Michiardi: On the design space of MapReduce ROLLUP aggregates” (
http://www.eurecom.fr/en/publication/4212/download/rs-publi-4212_2.pdf), I
am working on a new family of algorithms to address some limitations of the
current ROLLUP operator in Apache Pig: the IRG (in-reducer grouping), the
hybrid IRG, and chained-IRG. I have an implementation that indicates
superior performance to the existing ROLLUP implementation.

You can find out more information on this work here:
https://issues.apache.org/jira/browse/PIG-4066. I've also created a review
request on the review board: https://reviews.apache.org/r/23804/
It would be very helpful for me if someone can review and have some
feedback on this patch.

Looking forward for the feedback.

Regards,
Quang-Nhat HOANG-XUAN
Lorand Bendig (JIRA | 24 Jul 11:04 2014
Picon

[Updated] (PIG-3365) Run as uber job if there is only one input split


     [
https://issues.apache.org/jira/browse/PIG-3365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lorand Bendig updated PIG-3365:
-------------------------------

    Attachment: PIG-3365-2.patch

[~rohini], thanks for having look at the patch.
The aim of this initial patch was to follow the logic of Hadoop's ubertask decision at the Pig side.
{{JobControlCompiler#okToRunLocal}} just takes an additional parameter {{totalInputFileSize}}
from {{JobControlCompiler#getJob}} where the ubertask decision is done. Because totalInputFileSize
is needed for the decision I passed this parameter to okToRunLocal() so that it won't be recalculated.
I attached a further patch based on your suggestion. As far as I see enabling uber mode in
{{PigInputFormat#getSplits}} will be picked up by the job.

> Run as uber job if there is only one input split
> ------------------------------------------------
>
>                 Key: PIG-3365
>                 URL: https://issues.apache.org/jira/browse/PIG-3365
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Rohini Palaniswamy
>            Assignee: Lorand Bendig
>              Labels: Performance
>         Attachments: PIG-3365-2.patch, PIG-3365.patch
>
>
(Continue reading)

Rohini Palaniswamy (JIRA | 24 Jul 06:28 2014
Picon

[Commented] (PIG-3854) Simplify plan of Limit on Tez


    [
https://issues.apache.org/jira/browse/PIG-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072776#comment-14072776
] 

Rohini Palaniswamy commented on PIG-3854:
-----------------------------------------

1) In case of union - TestTezCompiler.testUnionLimit
2) In case of split and orderby - TestLimitVariable.testLimitVariable2

> Simplify plan of Limit on Tez
> -----------------------------
>
>                 Key: PIG-3854
>                 URL: https://issues.apache.org/jira/browse/PIG-3854
>             Project: Pig
>          Issue Type: Sub-task
>          Components: tez
>            Reporter: Rohini Palaniswamy
>             Fix For: 0.14.0
>
>
>  When checking out union followed by limit in PIG-3835, saw that the plan was very complex. We should
certainly be able to simplify it.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

(Continue reading)

Rohini Palaniswamy (JIRA | 24 Jul 06:26 2014
Picon

[Updated] (PIG-3854) Simplify plan of Limit on Tez


     [
https://issues.apache.org/jira/browse/PIG-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rohini Palaniswamy updated PIG-3854:
------------------------------------

    Summary: Simplify plan of Limit on Tez  (was: Improve performance of Limit on Tez)

> Simplify plan of Limit on Tez
> -----------------------------
>
>                 Key: PIG-3854
>                 URL: https://issues.apache.org/jira/browse/PIG-3854
>             Project: Pig
>          Issue Type: Sub-task
>          Components: tez
>            Reporter: Rohini Palaniswamy
>             Fix For: 0.14.0
>
>
>  When checking out union followed by limit in PIG-3835, saw that the plan was very complex. We should
certainly be able to simplify it.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

Rohini Palaniswamy (JIRA | 24 Jul 06:17 2014
Picon

[Updated] (PIG-3365) Run as uber job if there is only one input split


     [
https://issues.apache.org/jira/browse/PIG-3365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rohini Palaniswamy updated PIG-3365:
------------------------------------

    Status: Open  (was: Patch Available)

> Run as uber job if there is only one input split
> ------------------------------------------------
>
>                 Key: PIG-3365
>                 URL: https://issues.apache.org/jira/browse/PIG-3365
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Rohini Palaniswamy
>            Assignee: Lorand Bendig
>              Labels: Performance
>         Attachments: PIG-3365.patch
>
>
> Hadoop 2 has support for uber mode (mapreduce.job.ubertask.enable=true) which runs the map and reduce
on Application Master itself and reduces the overhead of launching a separate map/reduce task. 

--
This message was sent by Atlassian JIRA
(v6.2#6252)

(Continue reading)

Rohini Palaniswamy (JIRA | 24 Jul 05:46 2014
Picon

[Commented] (PIG-4002) Disable combiner when map-side aggregation is used


    [
https://issues.apache.org/jira/browse/PIG-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072755#comment-14072755
] 

Rohini Palaniswamy commented on PIG-4002:
-----------------------------------------

If we set differently on InputDescriptor and OutputDescriptor it should work I believe. 

> Disable combiner when map-side aggregation is used
> --------------------------------------------------
>
>                 Key: PIG-4002
>                 URL: https://issues.apache.org/jira/browse/PIG-4002
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.13.0
>            Reporter: Travis Woodruff
>            Assignee: Travis Woodruff
>            Priority: Minor
>         Attachments: PIG-4002-1.patch
>
>
> This may be controversial, so I'd like others' opinions on this.
> It is not currently possible to disable the combiner and use map-side aggregation at the same time. This is
a problematic since map-side aggregation effectively combines in the mapper, so running the combiner
adds expensive combiner execution (combiner requires deserialization & reserialization) for little
to no value.
> PIG-2829 had a patch to disable the combiner when map-side aggregation is used (along with some other
(Continue reading)

Rohini Palaniswamy (JIRA | 24 Jul 05:41 2014
Picon

[Commented] (PIG-4064) Fix tez auto parallelism test failures


    [
https://issues.apache.org/jira/browse/PIG-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072753#comment-14072753
] 

Rohini Palaniswamy commented on PIG-4064:
-----------------------------------------

Looks good. Can you clarify a doubt though. TestEvalPipeline2.testLimitAutoReducer - Why for the same
reducer bytes, the parallelism produced by MR and Tez is very different? What is the differentiating behavior?

Also can you create a separate jira to fix secondary sort optimizer to support multiple predecessors. 

Could you also remove the below comments which are not valid anymore when checking in. 
// MR SecondaryKeyOptimizer currently does not check for this case.  (Prashant fixed MR in PIG-3827)
//Let's add  combiner if possible.  (Copied over wrongly i believe)
//TODO: Case of from plan leaf being POUnion. (Don't think we have POUnion in Tez plan at all)

> Fix tez auto parallelism test failures
> --------------------------------------
>
>                 Key: PIG-4064
>                 URL: https://issues.apache.org/jira/browse/PIG-4064
>             Project: Pig
>          Issue Type: Sub-task
>          Components: tez
>            Reporter: Rohini Palaniswamy
>            Assignee: Daniel Dai
>             Fix For: 0.14.0
>
(Continue reading)

Cheolsoo Park (JIRA | 24 Jul 04:07 2014
Picon

[Resolved] (PIG-4068) ObjectCache cause ClassCastException


     [
https://issues.apache.org/jira/browse/PIG-4068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheolsoo Park resolved PIG-4068.
--------------------------------

    Resolution: Fixed

Committed to trunk. Thank you Daniel!

> ObjectCache cause ClassCastException
> ------------------------------------
>
>                 Key: PIG-4068
>                 URL: https://issues.apache.org/jira/browse/PIG-4068
>             Project: Pig
>          Issue Type: Sub-task
>          Components: tez
>            Reporter: Cheolsoo Park
>            Assignee: Cheolsoo Park
>             Fix For: 0.14.0
>
>         Attachments: PIG-4068-1.patch
>
>
> Here is stack trace-
> {code}
> 2014-07-23 20:19:41,503 INFO [TezChild] org.apache.tez.runtime.task.TezTaskRunner: Encounted an
error while executing task: attempt_1405989231017_0003_1_10_000087_2
(Continue reading)

jira | 24 Jul 03:03 2014
Picon

Subscription: PIG patch available

Issue Subscription
Filter: PIG patch available (17 issues)

Subscriber: pigdaily

Key         Summary
PIG-4066    An optimization for ROLLUP operation in Pig
            https://issues.apache.org/jira/browse/PIG-4066
PIG-4065    Fix failing unit tests in Tez
            https://issues.apache.org/jira/browse/PIG-4065
PIG-4047    Break up pig withouthadoop and fat jar
            https://issues.apache.org/jira/browse/PIG-4047
PIG-4008    Pig code change to enable Tez Local mode 
            https://issues.apache.org/jira/browse/PIG-4008
PIG-4004    Upgrade the Pigmix queries from the (old) mapred API to mapreduce
            https://issues.apache.org/jira/browse/PIG-4004
PIG-4002    Disable combiner when map-side aggregation is used
            https://issues.apache.org/jira/browse/PIG-4002
PIG-3952    PigStorage accepts '-tagSplit' to return full split information
            https://issues.apache.org/jira/browse/PIG-3952
PIG-3911    Define unique fields with  <at> OutputSchema
            https://issues.apache.org/jira/browse/PIG-3911
PIG-3877    Getting Geo Latitude/Longitude from Address Lines
            https://issues.apache.org/jira/browse/PIG-3877
PIG-3873    Geo distance calculation using Haversine
            https://issues.apache.org/jira/browse/PIG-3873
PIG-3866    Create ThreadLocal classloader per PigContext
            https://issues.apache.org/jira/browse/PIG-3866
PIG-3861    duplicate jars get added to distributed cache
            https://issues.apache.org/jira/browse/PIG-3861
(Continue reading)

Daniel Dai (JIRA | 24 Jul 02:46 2014
Picon

[Updated] (PIG-4057) Group All followed by CROSS with default parallelism produces wrong results


     [
https://issues.apache.org/jira/browse/PIG-4057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-4057:
----------------------------

    Attachment: PIG-4057-2.patch

> Group All followed by CROSS with default parallelism produces wrong results
> ---------------------------------------------------------------------------
>
>                 Key: PIG-4057
>                 URL: https://issues.apache.org/jira/browse/PIG-4057
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>            Assignee: Daniel Dai
>             Fix For: 0.14.0
>
>         Attachments: PIG-4057-1.patch, PIG-4057-2.patch
>
>
> SET default_parallel 199;
> ......
> by_size = ...
> uniq_vals = .....
> grpd = group uniq_vals all;
> all_vals = FOREACH grpd GENERATE uniq_vals;
> cross_result = CROSS by_size, all_vals;
(Continue reading)

Daniel Dai (JIRA | 24 Jul 02:34 2014
Picon

[Commented] (PIG-4068) ObjectCache cause ClassCastException


    [
https://issues.apache.org/jira/browse/PIG-4068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072633#comment-14072633
] 

Daniel Dai commented on PIG-4068:
---------------------------------

+1

> ObjectCache cause ClassCastException
> ------------------------------------
>
>                 Key: PIG-4068
>                 URL: https://issues.apache.org/jira/browse/PIG-4068
>             Project: Pig
>          Issue Type: Sub-task
>          Components: tez
>            Reporter: Cheolsoo Park
>            Assignee: Cheolsoo Park
>             Fix For: 0.14.0
>
>         Attachments: PIG-4068-1.patch
>
>
> Here is stack trace-
> {code}
> 2014-07-23 20:19:41,503 INFO [TezChild] org.apache.tez.runtime.task.TezTaskRunner: Encounted an
error while executing task: attempt_1405989231017_0003_1_10_000087_2
> java.lang.ClassCastException: java.lang.Boolean cannot be cast to java.util.Map
(Continue reading)


Gmane