Updated: (PIG-1501) need to investigate the impact of compression on pig performance
Yan Zhou (JIRA <
jira@...>
2010-08-31 23:36:55 GMT
[
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yan Zhou updated PIG-1501:
--------------------------
Release Note:
This feature will save HDFS space used to store the intermediate data used by PIG and potentially improve
query execution speed. In general, the more intermediate data generated, the more storage and speedup benefits.
There are no backward compatibility issues as result of this feature.
Two java properties are used to control the behavoir:
pig.tmpfilecompression, default to false, tells if the temporary files should be compressed or not. If
true, then
pig.tmpfilecompression.codec specifies which compression codec to use. Currently, PIG only accepts
"gz" and "lzo" as possible values. Since LZO is under GPL license, Hadoop may need to be configured to use
LZO codec. Please refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details.
An example is the following "test.pig" script:
register pigperf.jar;
A = load '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent:long, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links);
B1 = filter A by timespent == 4;
B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term);
C = join B1 by query_term, B by query_term using 'skewed' parallel 300;
(Continue reading)