Varun Sharma | 31 Jul 21:12 2014

Questions about custom helix rebalancer/controller/agent


I am trying to write a customized rebalancing algorithm. I would like to
run the rebalancer every 30 minutes inside a single thread. I would also
like to completely disable Helix triggering the rebalancer.

I have a few questions:
1) What's the best way to run the custom controller ? Can I simply
instantiate a ZKHelixAdmin object and then keep running my rebalancer
inside a thread or do I need to do something more.

Apart from rebalancing, I want to do other things inside the the
controller, so it would be nice if I could simply fire up the controller
through code. I could not find this in the documentation.

2) Same question for the Helix agent. My Helix Agent is a JVM process which
does other things apart from exposing the callbacks for state transitions.
Is there a code sample for the same ?

3) How do I disable Helix triggered rebalancing once I am able to run the
custom controller ?

4) During my custom rebalance run, how I can get the current cluster state
- is it through ClusterDataCache.getIdealState() ?

5) For clients talking to the cluster, does helix provide an easy
abstraction to find the partition distribution for a helix resource ?

(Continue reading)

CHU, THEODORA | 31 Jul 20:36 2014

HBase + PySpark

Does anyone know how to read data from OpenTSDB/HBase into PySpark? I’ve searched multiple forums and
can’t seem to find the answer. Thanks!
Parkirat | 31 Jul 17:20 2014

Hbase Mapreduce API - Reduce to a file is not working properly.

Hi All,

I am using Mapreduce API to read Hbase Table, based on some scan operation
in mapper and putting the data to a file in reducer.
I am using Hbase Version "Version".

Now in my job, my mapper output a key as text and value as text, but my
reducer output key as text and value as nullwritable, but it seems *hbase
mapreduce api dont consider reducer*, and outputs both key and value as

Moreover if the same key comes twice, it goes to the file twice, even if my
reducer want to log it only once.

Could anybody help me with this problem?

Parkirat Singh Bagga.

View this message in context:
Sent from the HBase User mailing list archive at

Wilm Schumacher | 31 Jul 17:17 2014

hbase and hadoop (for "normal" hdfs) cluster together?


I have a "conceptional" question and would appreciate hints.

My task is to save files to hdfs and to maintain some informations about
them in a hbase db and then serve both to the application.

Per file I have around 50 rows with 10 columns (in 2 column families) in
the tables, which have string values of length around 100.

The files have normal size (perhaps between some kB to 100 MB or so).

By this estimation the number of files are way smaller than the the
number of rows (times columns), but the space on disk is way larger for
the files than the space for the hbase. I would further estimate, that
for every get on a file there should be around hundreds of getRows on
the hbase.

For the files I want to run an hadoop cluster (obviously). The question
now arises: should I run the hbase on the same hadoop cluster?

The pro of running together is obvious: i would only have to run one
hadoop cluster which would which would save time, money and nerves.

On the other hand it wouldn't be possible to make special adjustments
for optimizing the cluster for one or the other task. E.g. if I want to
make the hbase more "distributed" by optimizing the replication (to
let's say 6) I would have to use a doubled amount of disk for the
"normal" files, too.

(Continue reading)

Chandrashekhar Kotekar | 31 Jul 09:25 2014

Re: Could not resolve the DNS name of slave2:60020

No, we are using "hbase-0.90.6-cdh3u6"

Chandrash3khar Kotekar
Mobile - +91 8600011455

On Thu, Jul 31, 2014 at 12:49 PM, Qiang Tian <tianq01@...> wrote:

> see
> it looks you are using a very old release? 0.90 perhaps?
> On Thu, Jul 31, 2014 at 2:24 PM, Chandrashekhar Kotekar <
> shekhar.kotekar@...> wrote:
> > This is how /etc/hosts file looks like on HBase master node
> >
> > ubuntu <at> master:~$ cat /etc/hosts
> > master
> > # slave1
> > # slave1
> > slave1
> > slave2
> >
> > and the code which actually tries to connect is as shown below:
> >
> > hTableInterface=hTablePool.getTable("POINTS_TABLE");
> >
> >
(Continue reading)

Jianshi Huang | 31 Jul 05:01 2014

Best practice for writing to HFileOutputFormat(2) with multiple Column Families

I need to generate from a 2TB dataset and exploded it to 4 Column Families.

The result dataset is likely to be 20TB or more. I'm currently using Spark
so I sorted the (rk, cf, cq) myself. It's huge and I'm considering how to
optimize it.

My question is:
Should I sort and write each column family one by one, or should I put them
all together then do sort and write?

Does my question make sense?


Jianshi Huang

LinkedIn: jianshi
Twitter:  <at> jshuang
Github & Blog:
yl wu | 31 Jul 03:13 2014

Can I put all columns into a row key?

Hi all,

I am trying to design the row key for a table. Our application would
perform many queries on columns. My question is that is that a good way to
put all values from columns into the row key? Thus I can use filters like
FuzzyRowFilter to get rows and parse the values directly from row keys. How
would this affect the read performance with a long row key? Would it be
worse than use column filters to fetch data from regular columns?

Any opinion and experience would be really appreciated. Thank you very much

Thomas Kwan | 31 Jul 02:50 2014

Hbase MR Job with 2 OutputForm classes possible?

Hi there,

I have a Hbase MR job that reads data from HDFS, do a Hbase Get, and then
do some data transformation. Then I need to put the data back  to Hbase as
well as write data to a HDFS file directory (so I can import it back into

The current job creation logic is similar to the following:

    public static Job createHBaseJob(Configuration conf, String []args)
    throws IOException {
        Path inputDir = new Path(args[0]);
        String tableName = args[1];
        String params = args[2];

        Job job = new Job(conf, NAME + "_" + tableName + " " + params);

        FileInputFormat.setInputPaths(job, inputDir);

        // No reducers.  Just write straight to table.  Call
        // to set up the TableOutputFormat.
        TableMapReduceUtil.initTableReducerJob(tableName, null, job);

        return job;
(Continue reading)

Colin Kincaid Williams | 30 Jul 22:41 2014

Hbase / How to Migrate

I'm preparing to migrate an hbase database between clusters, but using the
same hbase version 0.92.1-cdh4.1.3. I found the document had a link to an older
document about the design .
In that document was the statment:

Sometimes, the amount of on-filesystem data that needs to be changed will
be large so migration will need to run a MR job.

I am planning on migrating a large HBase database in the near future. Will
the  /bin/hbase migrate work for migrating a large database, or will it
require another method? If so can anybody point me to a reference or give a

Here is my version information:

# Autogenerated build properties
Chandrashekhar Kotekar | 30 Jul 13:34 2014

Could not resolve the DNS name of slave2:60020

I have a HBase cluster on AWS. I have written few REST services which are
supposed to connect to this HBase cluster and get some data.

My configuration is as below :

   1. Java code, eclipse, tomcat running on my desktop
   2. HBase cluster, Hadoop cluster sitting on AWS
   3. Can connect to HBase cluster, Hadoop cluster ONLY THROUGH VPN

Whenever web service tries to do ANY operation on HBase, it throws Could
not resolve the DNS name of slave2:60020 error with following stack trace.

java.lang.IllegalArgumentException: Could not resolve the DNS name of
    at org.apache.hadoop.hbase.HServerAddress.checkBindAddressCanBeResolved(
    at org.apache.hadoop.hbase.HServerAddress.<init>(
    at org.apache.hadoop.hbase.zookeeper.RootRegionTracker.dataToHServerAddress(
    at org.apache.hadoop.hbase.zookeeper.RootRegionTracker.waitRootRegionLocation(
    at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(
    at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(
    at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(
    at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(

 at org.apache.hadoop.hbase.client.HTablePool.createHTable(
    at org.apache.hadoop.hbase.client.HTablePool.getTable(
    at com.shekhar.dao.AdminDAO.getAdminInfoByRowKey(
    at com.shekhar.auth.Authorization.isSystemAdmin(
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
(Continue reading)

varshar | 30 Jul 12:34 2014

Calculate number of records in write buffer


We are writing a billion records into HBase using multiple clients. Each
client is multithreaded.
Autoflush is set to false and the write buffer size = 12MB. 

The WriteRequestCount metric is incremented by only 1 for one batch insert
and not by the number of records inserted.* Is there any API/method to
calculate the total number of puts in the write buffer before it is flushed?

We need to calculate how many writes are actually happening at every second. 

Another reason is to see whether all the records inserted from the client
side is actually inserted into the data store without any data loss (at
every second). 

Any suggestions are welcome. 


View this message in context:
Sent from the HBase User mailing list archive at