jira | 24 Oct 08:00 2014
Picon

Subscription: PIG patch available

Issue Subscription
Filter: PIG patch available (21 issues)

Subscriber: pigdaily

Key         Summary
PIG-4247    S3 properties are not picked up from core-site.xml in local mode
            https://issues.apache.org/jira/browse/PIG-4247
PIG-4246    HBaseStorage should implement getShipFiles
            https://issues.apache.org/jira/browse/PIG-4246
PIG-4241    Auto local mode mistakenly converts large jobs to local mode when using with Hive tables
            https://issues.apache.org/jira/browse/PIG-4241
PIG-4239    "pig.output.lazy" not works in spark mode
            https://issues.apache.org/jira/browse/PIG-4239
PIG-4224    Upload Tez payload history string to timeline server
            https://issues.apache.org/jira/browse/PIG-4224
PIG-4160    -forcelocaljars / -j flag when using a remote url for a script
            https://issues.apache.org/jira/browse/PIG-4160
PIG-4111    Make Pig compiles with avro-1.7.7
            https://issues.apache.org/jira/browse/PIG-4111
PIG-4103    Fix TestRegisteredJarVisibility(after PIG-4083)
            https://issues.apache.org/jira/browse/PIG-4103
PIG-4084    Port TestPigRunner to Tez
            https://issues.apache.org/jira/browse/PIG-4084
PIG-4066    An optimization for ROLLUP operation in Pig
            https://issues.apache.org/jira/browse/PIG-4066
PIG-4004    Upgrade the Pigmix queries from the (old) mapred API to mapreduce
            https://issues.apache.org/jira/browse/PIG-4004
PIG-4002    Disable combiner when map-side aggregation is used
            https://issues.apache.org/jira/browse/PIG-4002
(Continue reading)

Daniel Dai (JIRA | 24 Oct 07:36 2014
Picon

[Commented] (PIG-4246) HBaseStorage should implement getShipFiles


    [
https://issues.apache.org/jira/browse/PIG-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14182463#comment-14182463
] 

Daniel Dai commented on PIG-4246:
---------------------------------

+1

> HBaseStorage should implement getShipFiles
> ------------------------------------------
>
>                 Key: PIG-4246
>                 URL: https://issues.apache.org/jira/browse/PIG-4246
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>             Fix For: 0.14.0
>
>         Attachments: PIG-4246-1.patch
>
>
> HBaseStorage.initializeHBaseClassLoaderResources() uses TableMapReduceUtil APIs to add
dependency jars. That sets the tmpjars setting which makes JobClient ship the jars to hdfs and use that
path in distributed cache. That bypasses the optimizations in PIG-2672 and PIG-3861 which avoid
shipping the jars to hdfs. Instead it should implement the getShipFiles() API introduced in PIG-4141 so
that PIG-2672 or PIG-3861 avoid shipping the same jar multiple times to hdfs for a job.

(Continue reading)

liyunzhang_intel (JIRA | 24 Oct 04:32 2014
Picon

[Commented] (PIG-4168) Initial implementation of unit tests for Pig on Spark


    [
https://issues.apache.org/jira/browse/PIG-4168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14182336#comment-14182336
] 

liyunzhang_intel commented on PIG-4168:
---------------------------------------

Hi [~rohini]:
Thanks your comment. The low value of  io.sort.mb  is to avoid pig runing out of memory in local mode. If you
think the 1 is low. I can change it to other value such as "200".

> Initial implementation of unit tests for Pig on Spark
> -----------------------------------------------------
>
>                 Key: PIG-4168
>                 URL: https://issues.apache.org/jira/browse/PIG-4168
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: Praveen Rachabattuni
>            Assignee: liyunzhang_intel
>         Attachments: PIG-4168.patch, PIG-4168_1.patch, PIG-4168_2.patch, PIG-4168_3.patch,
PIG-4168_4.patch, PIG-4168_5.patch
>
>
> 1.ant clean jar;  pig-0.14.0-SNAPSHOT-core-h1.jar will be generated by the command
> 2.export SPARK_PIG_JAR=$PIG_HOME/pig-0.14.0-SNAPSHOT-core-h1.jar 
> 3.build hadoop1 and spark env.spark run in local mode
>   jps:
(Continue reading)

Rohini Palaniswamy (JIRA | 24 Oct 00:44 2014
Picon

[Updated] (PIG-4246) HBaseStorage should implement getShipFiles


     [
https://issues.apache.org/jira/browse/PIG-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rohini Palaniswamy updated PIG-4246:
------------------------------------
    Fix Version/s: 0.14.0
         Assignee: Rohini Palaniswamy
           Status: Patch Available  (was: Open)

> HBaseStorage should implement getShipFiles
> ------------------------------------------
>
>                 Key: PIG-4246
>                 URL: https://issues.apache.org/jira/browse/PIG-4246
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>             Fix For: 0.14.0
>
>         Attachments: PIG-4246-1.patch
>
>
> HBaseStorage.initializeHBaseClassLoaderResources() uses TableMapReduceUtil APIs to add
dependency jars. That sets the tmpjars setting which makes JobClient ship the jars to hdfs and use that
path in distributed cache. That bypasses the optimizations in PIG-2672 and PIG-3861 which avoid
shipping the jars to hdfs. Instead it should implement the getShipFiles() API introduced in PIG-4141 so
that PIG-2672 or PIG-3861 avoid shipping the same jar multiple times to hdfs for a job.

(Continue reading)

Rohini Palaniswamy (JIRA | 24 Oct 00:44 2014
Picon

[Updated] (PIG-4246) HBaseStorage should implement getShipFiles


     [
https://issues.apache.org/jira/browse/PIG-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rohini Palaniswamy updated PIG-4246:
------------------------------------
    Attachment: PIG-4246-1.patch

> HBaseStorage should implement getShipFiles
> ------------------------------------------
>
>                 Key: PIG-4246
>                 URL: https://issues.apache.org/jira/browse/PIG-4246
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Rohini Palaniswamy
>             Fix For: 0.14.0
>
>         Attachments: PIG-4246-1.patch
>
>
> HBaseStorage.initializeHBaseClassLoaderResources() uses TableMapReduceUtil APIs to add
dependency jars. That sets the tmpjars setting which makes JobClient ship the jars to hdfs and use that
path in distributed cache. That bypasses the optimizations in PIG-2672 and PIG-3861 which avoid
shipping the jars to hdfs. Instead it should implement the getShipFiles() API introduced in PIG-4141 so
that PIG-2672 or PIG-3861 avoid shipping the same jar multiple times to hdfs for a job.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
(Continue reading)

Rohini Palaniswamy (JIRA | 24 Oct 00:42 2014
Picon

[Updated] (PIG-3456) Reduce threadlocal conf access in backend for each record


     [
https://issues.apache.org/jira/browse/PIG-3456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rohini Palaniswamy updated PIG-3456:
------------------------------------
      Resolution: Fixed
    Hadoop Flags: Reviewed
          Status: Resolved  (was: Patch Available)

Checked into branch-0.14 and trunk. Thanks for the review Daniel.

> Reduce threadlocal conf access in backend for each record
> ---------------------------------------------------------
>
>                 Key: PIG-3456
>                 URL: https://issues.apache.org/jira/browse/PIG-3456
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.11.1
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>             Fix For: 0.14.0
>
>         Attachments: PIG-3456-1-no-whitespace.patch, PIG-3456-1.patch, PIG-3456-2.patch, PIG-3456-3.patch
>
>
> Noticed few things while browsing code
> 1) DefaultTuple has a protected boolean isNull = false; which is never used. Removing this gives ~3-5%
improvement for big jobs
(Continue reading)

Rohini Palaniswamy (JIRA | 23 Oct 23:03 2014
Picon

[Updated] (PIG-3456) Reduce threadlocal conf access in backend for each record


     [
https://issues.apache.org/jira/browse/PIG-3456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rohini Palaniswamy updated PIG-3456:
------------------------------------
    Attachment: PIG-3456-3.patch

> Reduce threadlocal conf access in backend for each record
> ---------------------------------------------------------
>
>                 Key: PIG-3456
>                 URL: https://issues.apache.org/jira/browse/PIG-3456
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.11.1
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>             Fix For: 0.14.0
>
>         Attachments: PIG-3456-1-no-whitespace.patch, PIG-3456-1.patch, PIG-3456-2.patch, PIG-3456-3.patch
>
>
> Noticed few things while browsing code
> 1) DefaultTuple has a protected boolean isNull = false; which is never used. Removing this gives ~3-5%
improvement for big jobs
> 2) Config checking with ThreadLocal conf is repeatedly done for each record. For eg: createDataBag in
POCombinerPackage. But initialized only for first time in other places like POPackage, POJoinPackage, etc.

--
(Continue reading)

Rohini Palaniswamy (JIRA | 23 Oct 22:49 2014
Picon

[Updated] (PIG-3861) duplicate jars get added to distributed cache


     [
https://issues.apache.org/jira/browse/PIG-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rohini Palaniswamy updated PIG-3861:
------------------------------------
      Resolution: Fixed
    Hadoop Flags: Reviewed
          Status: Resolved  (was: Patch Available)

Committed to branch-0.14 and trunk. Thanks Mona for the patch and Daniel for the review.

> duplicate jars get added to distributed cache
> ---------------------------------------------
>
>                 Key: PIG-3861
>                 URL: https://issues.apache.org/jira/browse/PIG-3861
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Mona Chitnis
>            Assignee: Mona Chitnis
>            Priority: Minor
>             Fix For: 0.14.0
>
>         Attachments: PIG-3681-1.patch, PIG-3861-2.patch, PIG-3861-3.patch, PIG-3861-4.patch, PIG-3861-5.patch
>
>
> PigContext's scriptJars should handle de-duplication of jars to account for script engines e.g.
JythonScriptEngine performing various jar loading for module and sometimes adding same jar twice.
AlsoJobControlCompiler.shipToHdfs() needs a check against adding the same jar more than once, under
(Continue reading)

Cheolsoo Park (JIRA | 23 Oct 22:18 2014
Picon

[Updated] (PIG-4241) Auto local mode mistakenly converts large jobs to local mode when using with Hive tables


     [
https://issues.apache.org/jira/browse/PIG-4241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheolsoo Park updated PIG-4241:
-------------------------------
    Attachment: PIG-4241-2.patch

Fixed {{TestInputSizeReducerEstimator}} in a new patch.

> Auto local mode mistakenly converts large jobs to local mode when using with Hive tables
> ----------------------------------------------------------------------------------------
>
>                 Key: PIG-4241
>                 URL: https://issues.apache.org/jira/browse/PIG-4241
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>            Reporter: Cheolsoo Park
>            Assignee: Cheolsoo Park
>             Fix For: 0.15.0
>
>         Attachments: PIG-4241-1.patch, PIG-4241-2.patch
>
>
> The current implementation of auto local mode has two severe problems-
> # It assumes file-based inputs, and it always converts jobs with non-file-based inputs into local mode
unless the {{LoadMetadata.getStatistics().getSizeInBytes()}} returns >100M. This is particularly
problematic when using Pig with Hive tables with custom LoadFuncs that did not implement LoadMetadata interface.
> # It lists all the files to compute the total size. The algorithm is like this. First, compute the total
(Continue reading)

Cheolsoo Park (JIRA | 23 Oct 21:52 2014
Picon

[Updated] (PIG-4247) S3 properties are not picked up from core-site.xml in local mode


     [
https://issues.apache.org/jira/browse/PIG-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheolsoo Park updated PIG-4247:
-------------------------------
    Status: Patch Available  (was: Open)

> S3 properties are not picked up from core-site.xml in local mode
> ----------------------------------------------------------------
>
>                 Key: PIG-4247
>                 URL: https://issues.apache.org/jira/browse/PIG-4247
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Cheolsoo Park
>            Assignee: Cheolsoo Park
>             Fix For: 0.15.0
>
>         Attachments: PIG-4247-1.patch
>
>
> Even in local mode, {{fs.s3}} properties need to be set if the job accesses s3 files (for eg, registering
jars on s3, loading input files on s3, etc). In particular, {{fs.s3.awsSecretAccessKey}} and
{{fs.s3.awsAccessKey}} are usually set in {{core-site.xml}}, but since local mode doesn't load
{{core-site.xml}}, these properties have to be set again in pig.properties. This adds operational
overhead because whenever rotating aws keys, {{pig.properties}} also needs to be updated. So it would be
nice if {{fs.s3}} properties can be set even in local mode.

--
(Continue reading)

Cheolsoo Park (JIRA | 23 Oct 21:51 2014
Picon

[Updated] (PIG-4247) S3 properties are not picked up from core-site.xml in local mode


     [
https://issues.apache.org/jira/browse/PIG-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheolsoo Park updated PIG-4247:
-------------------------------
    Attachment: PIG-4247-1.patch

Uploading a patch that load s3-related properties from {{core-site.xml}} regardless whether it's local
mode or not.

> S3 properties are not picked up from core-site.xml in local mode
> ----------------------------------------------------------------
>
>                 Key: PIG-4247
>                 URL: https://issues.apache.org/jira/browse/PIG-4247
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Cheolsoo Park
>            Assignee: Cheolsoo Park
>             Fix For: 0.15.0
>
>         Attachments: PIG-4247-1.patch
>
>
> Even in local mode, {{fs.s3}} properties need to be set if the job accesses s3 files (for eg, registering
jars on s3, loading input files on s3, etc). In particular, {{fs.s3.awsSecretAccessKey}} and
{{fs.s3.awsAccessKey}} are usually set in {{core-site.xml}}, but since local mode doesn't load
{{core-site.xml}}, these properties have to be set again in pig.properties. This adds operational
overhead because whenever rotating aws keys, {{pig.properties}} also needs to be updated. So it would be
(Continue reading)


Gmane