Varun Sharma | 31 Jul 21:12 2014

Questions about custom helix rebalancer/controller/agent

Hi,

I am trying to write a customized rebalancing algorithm. I would like to
run the rebalancer every 30 minutes inside a single thread. I would also
like to completely disable Helix triggering the rebalancer.

I have a few questions:
1) What's the best way to run the custom controller ? Can I simply
instantiate a ZKHelixAdmin object and then keep running my rebalancer
inside a thread or do I need to do something more.

Apart from rebalancing, I want to do other things inside the the
controller, so it would be nice if I could simply fire up the controller
through code. I could not find this in the documentation.

2) Same question for the Helix agent. My Helix Agent is a JVM process which
does other things apart from exposing the callbacks for state transitions.
Is there a code sample for the same ?

3) How do I disable Helix triggered rebalancing once I am able to run the
custom controller ?

4) During my custom rebalance run, how I can get the current cluster state
- is it through ClusterDataCache.getIdealState() ?

5) For clients talking to the cluster, does helix provide an easy
abstraction to find the partition distribution for a helix resource ?

Thanks
Varun
(Continue reading)

CHU, THEODORA | 31 Jul 20:36 2014
Picon

HBase + PySpark

Does anyone know how to read data from OpenTSDB/HBase into PySpark? I’ve searched multiple forums and
can’t seem to find the answer. Thanks!
Parkirat | 31 Jul 17:20 2014
Picon

Hbase Mapreduce API - Reduce to a file is not working properly.

Hi All,

I am using Mapreduce API to read Hbase Table, based on some scan operation
in mapper and putting the data to a file in reducer.
I am using Hbase Version "Version 0.94.5.23".

*Problem:*
Now in my job, my mapper output a key as text and value as text, but my
reducer output key as text and value as nullwritable, but it seems *hbase
mapreduce api dont consider reducer*, and outputs both key and value as
text.

Moreover if the same key comes twice, it goes to the file twice, even if my
reducer want to log it only once.

Could anybody help me with this problem?

Regards,
Parkirat Singh Bagga.

--
View this message in context: http://apache-hbase.679495.n3.nabble.com/Hbase-Mapreduce-API-Reduce-to-a-file-is-not-working-properly-tp4062141.html
Sent from the HBase User mailing list archive at Nabble.com.

Wilm Schumacher | 31 Jul 17:17 2014

hbase and hadoop (for "normal" hdfs) cluster together?

Hi,

I have a "conceptional" question and would appreciate hints.

My task is to save files to hdfs and to maintain some informations about
them in a hbase db and then serve both to the application.

Per file I have around 50 rows with 10 columns (in 2 column families) in
the tables, which have string values of length around 100.

The files have normal size (perhaps between some kB to 100 MB or so).

By this estimation the number of files are way smaller than the the
number of rows (times columns), but the space on disk is way larger for
the files than the space for the hbase. I would further estimate, that
for every get on a file there should be around hundreds of getRows on
the hbase.

For the files I want to run an hadoop cluster (obviously). The question
now arises: should I run the hbase on the same hadoop cluster?

The pro of running together is obvious: i would only have to run one
hadoop cluster which would which would save time, money and nerves.

On the other hand it wouldn't be possible to make special adjustments
for optimizing the cluster for one or the other task. E.g. if I want to
make the hbase more "distributed" by optimizing the replication (to
let's say 6) I would have to use a doubled amount of disk for the
"normal" files, too.

(Continue reading)

Chandrashekhar Kotekar | 31 Jul 09:25 2014
Picon

Re: Could not resolve the DNS name of slave2:60020

No, we are using "hbase-0.90.6-cdh3u6"

Regards,
Chandrash3khar Kotekar
Mobile - +91 8600011455

On Thu, Jul 31, 2014 at 12:49 PM, Qiang Tian <tianq01@...> wrote:

> see https://issues.apache.org/jira/browse/HBASE-3556
> it looks you are using a very old release? 0.90 perhaps?
>
>
>
> On Thu, Jul 31, 2014 at 2:24 PM, Chandrashekhar Kotekar <
> shekhar.kotekar@...> wrote:
>
> > This is how /etc/hosts file looks like on HBase master node
> >
> > ubuntu <at> master:~$ cat /etc/hosts
> > 10.78.21.133 master
> > #10.62.126.245 slave1
> > #10.154.133.161 slave1
> > 10.224.115.218 slave1
> > 10.32.213.195 slave2
> >
> > and the code which actually tries to connect is as shown below:
> >
> > hTableInterface=hTablePool.getTable("POINTS_TABLE");
> >
> >
(Continue reading)

Jianshi Huang | 31 Jul 05:01 2014
Picon

Best practice for writing to HFileOutputFormat(2) with multiple Column Families

I need to generate from a 2TB dataset and exploded it to 4 Column Families.

The result dataset is likely to be 20TB or more. I'm currently using Spark
so I sorted the (rk, cf, cq) myself. It's huge and I'm considering how to
optimize it.

My question is:
Should I sort and write each column family one by one, or should I put them
all together then do sort and write?

Does my question make sense?

--

-- 
Jianshi Huang

LinkedIn: jianshi
Twitter:  <at> jshuang
Github & Blog: http://huangjs.github.com/
yl wu | 31 Jul 03:13 2014
Picon

Can I put all columns into a row key?

Hi all,

I am trying to design the row key for a table. Our application would
perform many queries on columns. My question is that is that a good way to
put all values from columns into the row key? Thus I can use filters like
FuzzyRowFilter to get rows and parse the values directly from row keys. How
would this affect the read performance with a long row key? Would it be
worse than use column filters to fetch data from regular columns?

Any opinion and experience would be really appreciated. Thank you very much
!

Best,
Yanglin
Thomas Kwan | 31 Jul 02:50 2014

Hbase MR Job with 2 OutputForm classes possible?

Hi there,

I have a Hbase MR job that reads data from HDFS, do a Hbase Get, and then
do some data transformation. Then I need to put the data back  to Hbase as
well as write data to a HDFS file directory (so I can import it back into
Hive).

The current job creation logic is similar to the following:

    public static Job createHBaseJob(Configuration conf, String []args)
    throws IOException {
        Path inputDir = new Path(args[0]);
        String tableName = args[1];
        String params = args[2];

        Job job = new Job(conf, NAME + "_" + tableName + " " + params);
        job.setJarByClass(MyMap.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setMapperClass(MyMap.class);

        FileInputFormat.setInputPaths(job, inputDir);

        // No reducers.  Just write straight to table.  Call
initTableReducerJob
        // to set up the TableOutputFormat.
        TableMapReduceUtil.initTableReducerJob(tableName, null, job);
        job.setNumReduceTasks(0);

        TableMapReduceUtil.addDependencyJars(job);
        return job;
(Continue reading)

Colin Kincaid Williams | 30 Jul 22:41 2014

Hbase / How to Migrate

I'm preparing to migrate an hbase database between clusters, but using the
same hbase version 0.92.1-cdh4.1.3. I found the document
http://wiki.apache.org/hadoop/Hbase/HowToMigrate had a link to an older
document about the design http://wiki.apache.org/hadoop/Hbase/Migration .
In that document was the statment:

Sometimes, the amount of on-filesystem data that needs to be changed will
be large so migration will need to run a MR job.

I am planning on migrating a large HBase database in the near future. Will
the  /bin/hbase migrate work for migrating a large database, or will it
require another method? If so can anybody point me to a reference or give a
suggestion?

Here is my version information:

# Autogenerated build properties
version=0.92.1-cdh4.1.3
git.hash=6b9dd3937f29975967eedf47fddd0c0f0af845ff
cloudera.hash=6b9dd3937f29975967eedf47fddd0c0f0af845ff
cloudera.base-branch=cdh4-base-0.92.1
cloudera.build-branch=cdh4-0.92.1_4.1.3
cloudera.pkg.version=0.92.1+165
cloudera.pkg.release=1.cdh4.1.3.p0.23
cloudera.pkg.name=hbase
cloudera.cdh.release=cdh4.1.3
cloudera.build.time=2013.01.27-01:03:15GMT
Chandrashekhar Kotekar | 30 Jul 13:34 2014
Picon

Could not resolve the DNS name of slave2:60020

I have a HBase cluster on AWS. I have written few REST services which are
supposed to connect to this HBase cluster and get some data.

My configuration is as below :

   1. Java code, eclipse, tomcat running on my desktop
   2. HBase cluster, Hadoop cluster sitting on AWS
   3. Can connect to HBase cluster, Hadoop cluster ONLY THROUGH VPN

Whenever web service tries to do ANY operation on HBase, it throws Could
not resolve the DNS name of slave2:60020 error with following stack trace.

java.lang.IllegalArgumentException: Could not resolve the DNS name of
slave2:60020
    at org.apache.hadoop.hbase.HServerAddress.checkBindAddressCanBeResolved(HServerAddress.java:105)
    at org.apache.hadoop.hbase.HServerAddress.<init>(HServerAddress.java:66)
    at org.apache.hadoop.hbase.zookeeper.RootRegionTracker.dataToHServerAddress(RootRegionTracker.java:82)
    at org.apache.hadoop.hbase.zookeeper.RootRegionTracker.waitRootRegionLocation(RootRegionTracker.java:73)
    at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:578)
    at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:558)
    at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:687)
    at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:589)

 at org.apache.hadoop.hbase.client.HTablePool.createHTable(HTablePool.java:129)
    at org.apache.hadoop.hbase.client.HTablePool.getTable(HTablePool.java:96)
    at com.shekhar.dao.AdminDAO.getAdminInfoByRowKey(AdminDAO.java:63)
    at com.shekhar.auth.Authorization.isSystemAdmin(Authorization.java:41)
    at com.shekhar.business.ReadingProcessor.getReadingInfo(ReadingProcessor.java:310)
    at com.shekhar.services.ReadingService.getReadings(ReadingService.java:543)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
(Continue reading)

varshar | 30 Jul 12:34 2014
Picon

Calculate number of records in write buffer

Hi,

We are writing a billion records into HBase using multiple clients. Each
client is multithreaded.
Autoflush is set to false and the write buffer size = 12MB. 

The WriteRequestCount metric is incremented by only 1 for one batch insert
and not by the number of records inserted.* Is there any API/method to
calculate the total number of puts in the write buffer before it is flushed?
*

We need to calculate how many writes are actually happening at every second. 

Another reason is to see whether all the records inserted from the client
side is actually inserted into the data store without any data loss (at
every second). 

Any suggestions are welcome. 

Thanks,
Varsha

--
View this message in context: http://apache-hbase.679495.n3.nabble.com/Calculate-number-of-records-in-write-buffer-tp4062088.html
Sent from the HBase User mailing list archive at Nabble.com.


Gmane