jira | 28 Jun 08:01 2016
Picon

Subscription: PIG patch available

Issue Subscription
Filter: PIG patch available (34 issues)

Subscriber: pigdaily

Key         Summary
PIG-4935    TEZ_USE_CLUSTER_HADOOP_LIBS is always set to true
            https://issues.apache.org/jira/browse/PIG-4935
PIG-4927    Support stop.on.failure in spark mode
            https://issues.apache.org/jira/browse/PIG-4927
PIG-4926    Modify the content of start.xml for spark mode
            https://issues.apache.org/jira/browse/PIG-4926
PIG-4922    Deadlock between SpillableMemoryManager and InternalSortedBag$SortedDataBagIterator
            https://issues.apache.org/jira/browse/PIG-4922
PIG-4919    Upgrade spark.version to 1.6.1
            https://issues.apache.org/jira/browse/PIG-4919
PIG-4918    Pig on Tez cannot switch pig.temp.dir to another fs
            https://issues.apache.org/jira/browse/PIG-4918
PIG-4903    Avoid add all spark dependency jars to  SPARK_YARN_DIST_FILES and SPARK_DIST_CLASSPATH
            https://issues.apache.org/jira/browse/PIG-4903
PIG-4897    Scope of param substitution for run/exec commands
            https://issues.apache.org/jira/browse/PIG-4897
PIG-4893    Task deserialization time is too long for spark on yarn mode
            https://issues.apache.org/jira/browse/PIG-4893
PIG-4886    Add PigSplit#getLocationInfo to fix the NPE found in log in spark mode
            https://issues.apache.org/jira/browse/PIG-4886
PIG-4871     Not use OperatorPlan#forceConnect in MultiQueryOptimizationSpark
            https://issues.apache.org/jira/browse/PIG-4871
PIG-4854    Merge spark branch to trunk
            https://issues.apache.org/jira/browse/PIG-4854
(Continue reading)

liyunzhang_intel (JIRA | 28 Jun 07:01 2016
Picon

[Updated] (PIG-4797) JoinOptimization for spark mode


     [
https://issues.apache.org/jira/browse/PIG-4797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

liyunzhang_intel updated PIG-4797:
----------------------------------
    Summary: JoinOptimization for spark mode  (was: Analyze JOIN performance and improve the same.)

> JoinOptimization for spark mode
> -------------------------------
>
>                 Key: PIG-4797
>                 URL: https://issues.apache.org/jira/browse/PIG-4797
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: Pallavi Rao
>            Assignee: liyunzhang_intel
>              Labels: spork
>         Attachments: Join performance analysis.pdf, PIG-4797.patch, PIG-4797_2.patch
>
>
> There are a big  performance difference in join between spark and mr mode.
> {code}
> daily = load './NYSE_daily' as (exchange:chararray, symbol:chararray,
>             date:chararray, open:float, high:float, low:float,
>             close:float, volume:int, adj_close:float);
> divs  = load './NYSE_dividends' as (exchange:chararray, symbol:chararray,
>             date:chararray, dividends:float);
> jnd   = join daily by (exchange, symbol), divs by (exchange, symbol);
(Continue reading)

Siddhi Mehta (JIRA | 28 Jun 03:06 2016
Picon

[Updated] (PIG-4939) QueryParserUtils.setHdfsServers(QueryParserUtils.java:104) should not be called for non-dfs methods


     [
https://issues.apache.org/jira/browse/PIG-4939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siddhi Mehta updated PIG-4939:
------------------------------
    Description: 
{code}
A = load 'hbase://query/SELECT ID,NAME,DATE FROM HIRES WHERE DATE > TO_DATE('1990-12-21 05:55:00.000');
STORE A into 'output';
{code}

The above script throws an exception because it treats the location as an fs path and tries to convert it to a
URI after splitting it based on comma.

The code that tries to the same is 

{code}
 String buildLoadOp(SourceLocation loc, String alias, String filename, FuncSpec funcSpec,
LogicalSchema schema)
    throws ParserValidationException {
        String absolutePath;
        LoadFunc loFunc;
        try {
            // Load LoadFunc class from default properties if funcSpec is null. Fallback on PigStorage if LoadFunc is
not specified in properties.
            funcSpec = funcSpec == null ? new
FuncSpec(pigContext.getProperties().getProperty(PigConfiguration.PIG_DEFAULT_LOAD_FUNC,
PigStorage.class.getName())) : funcSpec;
            loFunc = (LoadFunc)PigContext.instantiateFuncFromSpec(funcSpec);
(Continue reading)

Siddhi Mehta (JIRA | 28 Jun 02:57 2016
Picon

[Commented] (PIG-4939) QueryParserUtils.setHdfsServers(QueryParserUtils.java:104) should not be called for non-dfs methods


    [
https://issues.apache.org/jira/browse/PIG-4939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352143#comment-15352143
] 

Siddhi Mehta commented on PIG-4939:
-----------------------------------

[~prkommireddi]
Is already a notion of non-fs loaders. 

If not for the approach I was thinking
1. A property that has the comma separated list of non-dfs paths. The property could be by the loader in the
relativeToAbsolutePath mapping phase.
2.  An Interface that the load/store func implements to indicate 

Thoughts?

> QueryParserUtils.setHdfsServers(QueryParserUtils.java:104) should not be called for non-dfs methods
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-4939
>                 URL: https://issues.apache.org/jira/browse/PIG-4939
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Siddhi Mehta
>            Priority: Minor
>
> QueryParserUtils.setHdfsServers(QueryParserUtils.java:104) should not be called for non-dfs methods
(Continue reading)

Siddhi Mehta (JIRA | 28 Jun 02:50 2016
Picon

[Updated] (PIG-4939) QueryParserUtils.setHdfsServers(QueryParserUtils.java:104) should not be called for non-dfs methods


     [
https://issues.apache.org/jira/browse/PIG-4939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siddhi Mehta updated PIG-4939:
------------------------------
    Description: 
QueryParserUtils.setHdfsServers(QueryParserUtils.java:104) should not be called for non-dfs methods

{code}
 String buildLoadOp(SourceLocation loc, String alias, String filename, FuncSpec funcSpec,
LogicalSchema schema)
    throws ParserValidationException {
        String absolutePath;
        LoadFunc loFunc;
        try {
            // Load LoadFunc class from default properties if funcSpec is null. Fallback on PigStorage if LoadFunc is
not specified in properties.
            funcSpec = funcSpec == null ? new
FuncSpec(pigContext.getProperties().getProperty(PigConfiguration.PIG_DEFAULT_LOAD_FUNC,
PigStorage.class.getName())) : funcSpec;
            loFunc = (LoadFunc)PigContext.instantiateFuncFromSpec(funcSpec);
            ......
            .......
            if (absolutePath == null) {
                absolutePath = loFunc.relativeToAbsolutePath( filename, QueryParserUtils.getCurrentDir(
pigContext ) );

                if (absolutePath!=null) {
                    QueryParserUtils.setHdfsServers( absolutePath, pigContext );
(Continue reading)

Siddhi Mehta (JIRA | 28 Jun 02:21 2016
Picon

[Created] (PIG-4939) QueryParserUtils.setHdfsServers(QueryParserUtils.java:104) should not be called for non-dfs methods

Siddhi Mehta created PIG-4939:
---------------------------------

             Summary: QueryParserUtils.setHdfsServers(QueryParserUtils.java:104) should not be called for
non-dfs methods
                 Key: PIG-4939
                 URL: https://issues.apache.org/jira/browse/PIG-4939
             Project: Pig
          Issue Type: Improvement
          Components: impl
            Reporter: Siddhi Mehta
            Priority: Minor

On Mon, Jun 27, 2016 at 1:04 PM, Prashant Kommireddi
<prash1784@...> wrote:
Agreed. This method call isn't needed for phoenix loader (or any such
non-direct-fs loaders). You should allow a config to handle it.

On Mon, Jun 27, 2016 at 12:14 PM, Siddhi Mehta <sm26217@...> wrote:

> Hello All,
>
> I am getting a URISyntaxException when I try to execute my pig script using
> PHoenixHBaseLoader. Traced attached below.
> Looking through the code Pig splits multiple paths provided to it based on
> comma(',') and during the query parsing step
> QueryParserUtils.setHdfsServers(absolutePath, pigContext) tried to split
> paths based on comma(',') and create URI's/PATHS for the same.
>
> Certain loaders like 'PhoenixHBaseLoader' donot pass hdfs locations and
(Continue reading)

Ivo Lenting (JIRA | 27 Jun 22:34 2016
Picon

[Created] (PIG-4938) [PiggyBank] XPath returns empty values when using aggregation method

Ivo Lenting created PIG-4938:
--------------------------------

             Summary: [PiggyBank] XPath returns empty values when using aggregation method
                 Key: PIG-4938
                 URL: https://issues.apache.org/jira/browse/PIG-4938
             Project: Pig
          Issue Type: Bug
          Components: piggybank
    Affects Versions: 0.15.0
            Reporter: Ivo Lenting
            Priority: Minor

I have a xml file which I want to parse using the piggybank XPath udf.

The xml is:
<Aa name="test1">	
	<Bb Cc="1"/>
	<Bb Cc="1"/>
	<Bb Cc="1"/>
	<Bb Cc="1"/>
	<Dd>test2</Dd>
</Aa>

The xpath contains a sum aggregate to sum all Cc values. 
The complete pig script:

REGISTER piggybank.jar
DEFINE XPath org.apache.pig.piggybank.evaluation.xml.XPath();
DEFINE XPathAll org.apache.pig.piggybank.evaluation.xml.XPathAll();
(Continue reading)

Siddhi Mehta | 27 Jun 21:14 2016
Picon

URIException with PhoenixHBaseLoader

Hello All,

I am getting a URISyntaxException when I try to execute my pig script using
PHoenixHBaseLoader. Traced attached below.
Looking through the code Pig splits multiple paths provided to it based on
comma(',') and during the query parsing step
QueryParserUtils.setHdfsServers(absolutePath, pigContext) tried to split
paths based on comma(',') and create URI's/PATHS for the same.

Certain loaders like 'PhoenixHBaseLoader' donot pass hdfs locations and
instead work with passing PhoenixQueryStatement in the location.
e.g.
*A = load 'hbase://query/SELECT ID,NAME,DATE FROM HIRES WHERE DATE >
TO_DATE('1990-12-21 05:55:00.000')*

This locations needs not be parsed to get hdfsservers path from them.
Does it make sense to introduce a config/loader property to annotate if the
loader/store is dealing with hdfs locations and based on the property make
a function call to  QueryParserUtils.setHdfsServers(absolutePath,
pigContext).

*Thoughts?*

***** Stack trace *****

Caused by: Failed to parse: Pig script failed to parse:
<line 1, column 23> pig script failed to validate:
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative
path in absolute URI: CREATED_DATE FROM HIRES WHERE
CREATED_DATE>=TO_DATE('1990-12-21
(Continue reading)

Rohini Palaniswamy (JIRA | 27 Jun 20:48 2016
Picon

[Commented] (PIG-4937) Pigmix hangs when generating data after rows is set as 625000000 in test/perf/pigmix/conf/config.sh


    [
https://issues.apache.org/jira/browse/PIG-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351594#comment-15351594
] 

Rohini Palaniswamy commented on PIG-4937:
-----------------------------------------

bq. Now it shows that 40 mappers are still running
  That is expected. The datagen takes a long time as it generates complex data structures and data is all
random generated. It usually takes 7-9 hours for me with 90 tasks and a 10 or 20 node cluster. Since it is both
CPU and disk intensive better to increase the number of nodes if you can.

> Pigmix hangs when generating data after rows  is set as 625000000 in  test/perf/pigmix/conf/config.sh
> -----------------------------------------------------------------------------------------------------
>
>                 Key: PIG-4937
>                 URL: https://issues.apache.org/jira/browse/PIG-4937
>             Project: Pig
>          Issue Type: Bug
>            Reporter: liyunzhang_intel
>         Attachments: pigmix1.PNG, pigmix2.PNG
>
>
> use the default setting in test/perf/pigmix/conf/config.sh, generate data by
> "ant -v -Dharness.hadoop.home=$HADOOP_HOME -Dhadoopversion=23  pigmix-deploy >ant.pigmix.deploy"
> it hangs in the log:
> {code}
>  [exec] Generating mapping file for column d:1:100000:z:5 into hdfs://bdpe41:8020/user/root/tmp/tmp-1056793210/tmp-786100428
>      [exec] processed 99%.
(Continue reading)

Rohini Palaniswamy (JIRA | 27 Jun 18:28 2016
Picon

[Commented] (PIG-4937) Pigmix hangs when generating data after rows is set as 625000000 in test/perf/pigmix/conf/config.sh


    [
https://issues.apache.org/jira/browse/PIG-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351360#comment-15351360
] 

Rohini Palaniswamy commented on PIG-4937:
-----------------------------------------

How much disk space do you have? The input dataset is 1TB and couple of tests generate same amount of data.
There is an option to cleanup output right after test. If you use that you will need at least 3TB of hdfs space
(9TB including replication factor of 3).  I would suggest adding more nodes if possible. 3 nodes is too
small and might take 1-2 days to finish running the tests.

For datagen, you don't need mapreduce.task.io.sort.mb 1G. So you can reduce that and also reduce
container size to 1G. 

> Pigmix hangs when generating data after rows  is set as 625000000 in  test/perf/pigmix/conf/config.sh
> -----------------------------------------------------------------------------------------------------
>
>                 Key: PIG-4937
>                 URL: https://issues.apache.org/jira/browse/PIG-4937
>             Project: Pig
>          Issue Type: Bug
>            Reporter: liyunzhang_intel
>         Attachments: pigmix1.PNG, pigmix2.PNG
>
>
> use the default setting in test/perf/pigmix/conf/config.sh, generate data by
> "ant -v -Dharness.hadoop.home=$HADOOP_HOME -Dhadoopversion=23  pigmix-deploy >ant.pigmix.deploy"
> it hangs in the log:
(Continue reading)

liyunzhang_intel (JIRA | 27 Jun 18:05 2016
Picon

[Updated] (PIG-4937) Pigmix hangs when generating data after rows is set as 625000000 in test/perf/pigmix/conf/config.sh


     [
https://issues.apache.org/jira/browse/PIG-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

liyunzhang_intel updated PIG-4937:
----------------------------------
    Attachment: pigmix2.PNG

> Pigmix hangs when generating data after rows  is set as 625000000 in  test/perf/pigmix/conf/config.sh
> -----------------------------------------------------------------------------------------------------
>
>                 Key: PIG-4937
>                 URL: https://issues.apache.org/jira/browse/PIG-4937
>             Project: Pig
>          Issue Type: Bug
>            Reporter: liyunzhang_intel
>         Attachments: pigmix1.PNG, pigmix2.PNG
>
>
> use the default setting in test/perf/pigmix/conf/config.sh, generate data by
> "ant -v -Dharness.hadoop.home=$HADOOP_HOME -Dhadoopversion=23  pigmix-deploy >ant.pigmix.deploy"
> it hangs in the log:
> {code}
>  [exec] Generating mapping file for column d:1:100000:z:5 into hdfs://bdpe41:8020/user/root/tmp/tmp-1056793210/tmp-786100428
>      [exec] processed 99%.
>      [exec] Generating input files into hdfs://bdpe41:8020/user/root/tmp/tmp-1056793210/tmp595036324
>      [exec] Submit hadoop job...
>      [exec] 16/06/25 23:06:32 INFO client.RMProxy: Connecting to ResourceManager at bdpe41/10.239.47.137:8032
>      [exec] 16/06/25 23:06:32 INFO client.RMProxy: Connecting to ResourceManager at bdpe41/10.239.47.137:8032
>      [exec] 16/06/25 23:06:32 INFO mapred.FileInputFormat: Total input paths to process : 90
(Continue reading)


Gmane