cem | 21 May 2013 22:43
Picon

Cassandra 1.2 TTL histogram problem

Hi all,


Why does Cassandra single table compaction skips the keys that are in the other sstables? Please correct if I am wrong.

I also dont understand why we have this line in worthDroppingTombstones method: 

double remainingColumnsRatio = ((doublecolumns) / (sstable.getEstimatedColumnCount().count() * sstable.getEstimatedColumnCount().mean());

remainingColumnsRatio  is always 0 in my case and the droppableRatio  is 0.9. Cassandra skips all sstables which are already expired.


Best Regards,
Cem
Hiller, Dean | 21 May 2013 16:23
Favicon

bootstrapping a new node...

We are using 1.2.2 cassandra and have rolled on 3 additionals nodes to our 6 node cluster(totalling 9 so
far).  We are trying to roll on node 10 but during the streaming a compaction kicked off which seemed very odd
to us.  "nodetool netstats" still reported tons of files that were not transferred yet.  Is this normal that
compaction might kick off during bootstrapping a new node.  Our node still says "Joining" in "nodetool
netstats" as well.  The ring does not show the new node yet either.  Lastly, "nodetool netstats" reports 0%
on EVERY single file and this doesn't seem to change.  The bootstrap node seems hung so a few questions

 1.  Is compaction supposed to go off during a bootstrapping node?
 2.  I seem to recall a bootstrap node setting in cassandra.yaml but that was not one of the steps I recall in the
datastax docs we went off of……in 1.2.2, is there any setting we need to set for a bootstrapping node
that we missed(our other nodes joined just fine though and seem to be working great).
 3.  What can I do to get this node to start streaming files again …can I just reboot the cassandra or should I
start from scratch somehow?
 4.  IF I need to start from scratch, I assume I a) stop the node, b) wipe commitlog and data directories, c)
start the node back up.  Would that be correct?  After all, the other nodes don't seem to know about this new
node according to "nodetool ring" command.

Thanks for any help on this one,
Dean

Vladimir Volkov | 21 May 2013 11:26
Picon
Gravatar

Cassandra hangs on large hinted handoffs

Hello.

I'm stress-testing our Cassandra (version 1.0.9) cluster, and tried turning off two of the four nodes for half an hour under heavy load. As a result I got a large volume of hints on the alive nodes - HintsColumnFamily takes about 1.5 GB disk space on each of the nodes. It seems, these hints are never replayed successfully.

After I bring other nodes back online, tpstats shows active handoffs, but I can't see any writes on the target nodes.
The log indicates memory pressure - the heap is >80% full (heap size is 8GB total, 1GB young).

A fragment of the log:
 INFO 18:34:05,513 Started hinted handoff for token: 1 with IP: /84.201.162.144
 INFO 18:34:06,794 GC for ParNew: 300 ms for 1 collections, 5974181760 used; max is 8588951552
 INFO 18:34:07,795 GC for ParNew: 263 ms for 1 collections, 6226018744 used; max is 8588951552
 INFO 18:34:08,795 GC for ParNew: 256 ms for 1 collections, 6559918392 used; max is 8588951552
 INFO 18:34:09,796 GC for ParNew: 231 ms for 1 collections, 6846133712 used; max is 8588951552
 WARN 18:34:09,805 Heap is 0.7978131149667941 full.  You may need to reduce memtable and/or cache sizes.  Cassandra will now flush up to the two largest memtables to free up memory.
 WARN 18:34:09,805 Flushing CFS(Keyspace='test', ColumnFamily='t2') to relieve memory pressure
 INFO 18:34:09,806 Enqueuing flush of Memtable-t2 <at> 639524673(60608588/571839171 serialized/live bytes, 743266 ops)
 INFO 18:34:09,807 Writing Memtable-t2 <at> 639524673(60608588/571839171 serialized/live bytes, 743266 ops)
 INFO 18:34:11,018 GC for ParNew: 449 ms for 2 collections, 6573394480 used; max is 8588951552
 INFO 18:34:12,019 GC for ParNew: 265 ms for 1 collections, 6820930056 used; max is 8588951552
 INFO 18:34:13,112 GC for ParNew: 331 ms for 1 collections, 6900566728 used; max is 8588951552
 INFO 18:34:14,181 GC for ParNew: 269 ms for 1 collections, 7101358936 used; max is 8588951552
 INFO 18:34:14,691 Completed flushing /mnt/raid/cassandra/data/test/t2-hc-244-Data.db (56156246 bytes)
 INFO 18:34:15,381 GC for ParNew: 280 ms for 1 collections, 7268441248 used; max is 8588951552
 INFO 18:34:35,306 InetAddress /84.201.162.144 is now dead.
 INFO 18:34:35,306 GC for ConcurrentMarkSweep: 19223 ms for 1 collections, 3774714808 used; max is 8588951552
 INFO 18:34:35,309 InetAddress /84.201.162.144 is now UP

After taking off the load and restatring the service, I still see pending handoffs:
$ nodetool -h localhost tpstats
Pool Name                    Active   Pending      Completed   Blocked  All time blocked
ReadStage                         0         0        1004257         0                 0
RequestResponseStage              0         0          92555         0                 0
MutationStage                     0         0              6         0                 0
ReadRepairStage                   0         0          57773         0                 0
ReplicateOnWriteStage             0         0              0         0                 0
GossipStage                       0         0         143332         0                 0
AntiEntropyStage                  0         0              0         0                 0
MigrationStage                    0         0              0         0                 0
MemtablePostFlusher               0         0              2         0                 0
StreamStage                       0         0              0         0                 0
FlushWriter                       0         0              2         0                 0
MiscStage                         0         0              0         0                 0
InternalResponseStage             0         0              0         0                 0
HintedHandoff                     1         3             15         0                 0

These 3 handoffs remain pending for a long time (>12 hours).
Most of the time Cassandra uses 100% of one CPU core, the stack trace of the busy thread is:
"HintedHandoff:1" daemon prio=10 tid=0x0000000001220800 nid=0x3843 runnable [0x00007fa1e1146000]
   java.lang.Thread.State: RUNNABLE
        at java.util.ArrayList$Itr.remove(ArrayList.java:808)
        at org.apache.cassandra.db.ColumnFamilyStore.removeDeletedSuper(ColumnFamilyStore.java:908)
        at org.apache.cassandra.db.ColumnFamilyStore.removeDeletedColumnsOnly(ColumnFamilyStore.java:857)
        at org.apache.cassandra.db.ColumnFamilyStore.removeDeleted(ColumnFamilyStore.java:850)
        at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1195)
        at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1150)
        at org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpointInternal(HintedHandOffManager.java:324)
        at org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(HintedHandOffManager.java:256)
        at org.apache.cassandra.db.HintedHandOffManager.access$300(HintedHandOffManager.java:84)
        at org.apache.cassandra.db.HintedHandOffManager$3.runMayThrow(HintedHandOffManager.java:437)
        at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)

Heap usage is also rather high, though the node isn't doing anything, except the HH processing. Here is CMS output:
2013-05-20T22:22:59.812+0400: 4672.075: [GC[YG occupancy: 70070 K (943744 K)]4672.075: [Rescan (parallel) , 0.0224060 secs]4672.098: [weak refs processing, 0.0002900 secs]4672.098: [scrub string table, 0.0002670 secs] [1 CMS-remark: 5523830K(7340032K)] 5593901K(8283776K), 0.0231160 secs] [Times: user=0.28 sys=0.00, real=0.02 secs]

Eventually, after a few service restarts, the hints suddenly disappear. Probably, the TTL expires and the hints get compacted away.


Currently my best guess is the following. Hinted handoffs are stored as supercolumns, with one row per target node. The service tries to read them entirely into memory for replay and fails, because the volume is too large to fit in the heap at once.
Then the TTL expires, and the service starts to delete old subcolumns during read. Since the underlying storage is a huge ArrayList, the deletion is inefficient and takes forever.

So, it seems there're two problems here.
1) Hints are not paged correctly and cause significant memory pressure - that's actually strange, since the same issue was supposedly addressed in https://issues.apache.org/jira/browse/CASSANDRA-1327 and https://issues.apache.org/jira/browse/CASSANDRA-3624;
2) Deletion of outdated hints doesn't work well for large hint volumes.


Any suggestions on how to make the cluster more tolerant to downtimes?

If I turn off the hinted handoff entirely, and manually run a repair after a downtime, will it restore all the data correctly?

--
Best regards, Vladimir

Arthur Zubarev | 20 May 2013 18:17
Picon
Favicon

Unable to start Cassandra



Hello All:

I had Cassandra 1.2.4 running up until yesterday, then after I took an Ubuntu update (may be related or may not), I suddenly wasn't able to connect any more.

The error was related to lack of memory and suggested a modification to its config, I kept increasing the memory (I have 8 Gb), but Cassandra would not start.
I read a post that it might be OpenJava related and I added Oracle Java 1.7.0_21, made it default, but the issue persisted.

I have then decided to remove the Cassandra community package and get the DataStax one. Alas, it failed to install citing unmet dependency 1.2.4 C* package. I could not figure how I get it since the only valid option was deb12.

I thus opted to re-get the one from Apache, made it installed no problem, but to my dismay C* would NOT start!

I can see the error when running sudo cassandra -f is

Exception in thread "main" java.lang.ExceptionInInitializerError
Caused by: java.lang.RuntimeException: Couldn't figure out log4j configuration: log4j-server.properties
    at org.apache.cassandra.service.CassandraDaemon.initLog4j(CassandraDaemon.java:82)
    at org.apache.cassandra.service.CassandraDaemon.<clinit>(CassandraDaemon.java:58)

The posts I found tell me to modify the
log4j-server.properties file.
But it is not present on my machine, in fact the directory

/etc/cassandra is empty.

I am out of ideas. What is wrong?

Regards,

Arthur
Pinak Pani | 20 May 2013 15:24
Picon

In a multiple data center setup, do all the data centers have complete data irrespective of RF?

I just wanted to verify the fact that if I happen to setup a multi data-center Cassandra setup, will each data center have the complete data-set with it?

Say, I have two data-center each with two nodes, and a partitioner that ranges from 0 to 100. Initial token assigned this way

DC1:N1 = 00
DC2:N1 = 25
DC1:N2 = 50
DC2:N2 = 75

where DCX is data center X, NX is node X. Which one the following options is true?

Option #1: DC1 and DC2, each will hold complete dataset with keys bucketed as follows
DC1:N1 = (50, 00] => 50 keys
DC1:N2 = (00, 50] => 50 keys
----
Complete data set mirrored at DC1

DC2:N1 = (75, 25] => 50 keys
DC2:N2 = (25, 75] => 50 keys
----
Complete data set mirrored at DC2

Option #2: DC1 and DC2, each will hold 50% of the data with keys bucketed as follows (much the same way in a single C setup)
DC1:N1 = (75, 00] => 25 keys
DC2:N1 = (00, 25] => 25 keys
DC1:N2 = (25, 50] => 25 keys
DC2:N2 = (50, 75] => 25 keys
----
data is divided into the two data centers.

Thanks,
PP
Michael Theroux | 19 May 2013 17:30
Picon
Favicon

Repair of tombstones

There has been a lot of discussion on the list recently concerning issues with repair, runtime, etc.

We recently have had issues with this cassandra bug:

	https://issues.apache.org/jira/browse/CASSANDRA-4905

Basically, if you do regular staggered repairs, and you have tombstones that can be gc_graced, those
tombstones may never be cleaned up if those tombstones don't get compacted away before the next repair. 
This is because these tombstones are essentially recopied to other nodes during the next repair.  This has
been fixed in 1.2, however, we aren't ready to make the jump to 1.2 yet.

Is there a reason why this hasn't been back-ported to 1.1?  Is it a risky change? Although not a silver bullet,
it seems it may help a lot of people with repair issues (certainly seems it would help us),

-Mike

Sylvain Lebresne | 19 May 2013 16:56

[RELEASE] Apache Cassandra 1.2.5 released


The Cassandra team is pleased to announce the release of Apache Cassandra
version 1.2.5.

Cassandra is a highly scalable second-generation distributed database,
bringing together Dynamo's fully distributed design and Bigtable's
ColumnFamily-based data model. You can read more here:


Downloads of source and binary distributions are listed in our download
section:


This version is a maintenance/bug fix release[1] on the 1.2 series. As always,
please pay attention to the release notes[2] and Let us know[3] if you were to
encounter any problem.

Enjoy!

[1]: http://goo.gl/7quKk (CHANGES.txt)
[2]: http://goo.gl/6KxVq (NEWS.txt)
Kais Ahmed | 18 May 2013 15:28

Cassandra read reapair

Hi all,

I encountered a consistency problem one some keys using phpcassa and Cassandra 1.2.3 since a server crash

Only some keys of one CF are corrupt.

I lauched a nodetool repair that successfully completed but don't correct the issue.


When i try to get a corrupt Key with :

CL ONE, the result contains 7 or 8 or 9 columns

CL QUORUM, result contains 8 or 9 columns

CL ALL, the data is consistent and returns always 9 columns


I thought using CF ALL, would correct the problem with READ REPAIR, but by returning to CL QUORUM, the problem persists.


Thank you for your help









Brian O'Neill | 17 May 2013 20:39
Favicon
Gravatar

[BLOG] : Cassandra as a Deep Storage Mechanism for Druid Real-Time Analytics Engine


FWIW, we were able to integrate Druid and Cassandra.  

Its only in PoC right now, but it seems like a powerful combination:
http://brianoneill.blogspot.com/2013/05/cassandra-as-deep-storage-mechanism-for.html

-brian

--
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://brianoneill.blogspot.com/
twitter: <at> boneill42
Tomàs Núnez | 17 May 2013 18:00
Favicon

Logging Cassandra queries

Hi!

For quite time I've been having some unexpected loadavg in the cassandra servers. I suspect there are lots of uncontrolled queries to the cassandra servers causing this load, but the developers say that there are none, and the load is due to cassandra internal processes. 

Trying to get to the bottom, I've been looking into completed ReadStage and MutationStage through JMX, and the numbers seem to confirm my theory, but I'd like to go one step forward and, if possible, list all the queries from the webservers to the cassandra cluster (just one node would be enough). 

I've been playing with cassandra loglevels, and I can see when a Read or a Write is done, but it would be better if I could knew the CF of the query. For my tests I've put the in the log4j.server " log4j.rootLogger=DEBUG,stdout,R", writing and reading a test CF, and I can't see the name of it anywhere.

For the test I'm using Cassandra 0.8.4 (yes, still), as my production servers, and also 1.0.11. Maybe this changes in 1.1? Maybe I'm doing something wrong? Any hint?

And... could I be more precise when enabling logging? Because right now, with log4j.rootLogger=DEBUG,stdout,R I'm getting a lot of information I won't use ever, and I'd like to enable just what I need to see gets and seds....

Thanks in advance, 
Tomàs

Apostolis Xekoukoulotakis | 17 May 2013 16:42
Picon

C language - cassandra

Hello, new here, What are my options in using cassandra from a program written in c?

A)
Thrift has no documentation, so it will take me time to understand.
Thrift also doesnt have a balancing pool, asking different nodes every time, which is a big problem.

B)
Should I use the hector (java) client and then send the data to my program with my own protocol? 
Seems a lot of unnecessary work.

Any other suggestions?


--

Sincerely yours, Apostolis Xekoukoulotakis

Gmane