[jira] [Created] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
Weichen Ye (JIRA <jira@...
2014-11-27 01:59:12 GMT
Weichen Ye created HBASE-12590:
Summary: A solution for data skew in HBase-Mapreduce Job
Issue Type: Improvement
Affects Versions: 2.0.0
Reporter: Weichen Ye
In production environment, data skew is a very common case. A HBase table always contains a lot of small
regions and several large regions. Small regions waste a lot of computing resources. If we use a job to scan
a table with 3000 small regions, we need a job with 3000 mappers. Large regions always block the job. If in a
100-region table, one region is far larger then the other 99 regions. When we run a job with the table as
input, 99 mappers will be completed very quickly, and we need to wait for the last mapper for a long time.
Add two new configuration.
hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in HBase-MapReduce
jobs. The default value is false.
hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size of mapreduce splits.
If a region size is large than the target size, cut the region into two split.If the sum of several small
continuous region size less than the target size, combine these regions into one split.