Alain RODRIGUEZ | 22 Jul 15:56 2014
Picon

JSON to Cassandra ?

Hi guys, I know this topic as already been spoken many times, and I read a lot of these discussions.

Yet, I have not been able to find a good way to do what I want.

We are receiving messages from our app that is a complex, dynamic, nested JSON (can be a few or thousands of attributes). JSON is variable and can contain nested arrays or sub-JSONs.

Please, consider this example:

JSON

{
    "struct-id": 141241321,
    "nested-1-1": {
        "value-1-1-1": "36d1f74d-1663-418d-8b1b-665bbb2d9ecb",
        "value-1-1-2": 5,
        "value-1-1-3": 0.5,
        "value-1-1-4": ["foo", "bar", "foobar"],
        "nested-2-1": {
            "test-2-1-1": "whatever",
            "test-2-1-2": 42
        }
    },
    "nested-1-2": {
        "value-1-2-1": [{
            "id": 1,
            "deeply-nested": {
                "data-1": "test",
                "data-2": 4023
            }
        },
        {
            "id": 2,
            "data-3": "that's enough data"
        }]
    }
}

We would like to store those messages to Cassandra and then run SPARK jobs over it. Basically, storing it as a text (full JSON in one column) would work but wouldn't be optimised since I might want to count how many times "value-1-1-3" is bigger or equal to 1, I would have to read all the JSON before answering this. I read a lot of things about people using composite columns and dynamic composite columns, but no precise example. I am also aware of collections support, yet nested collections are not supported currently.

I would like to have:

- 1 column per attribute
- typed values
- something that would be able to parse and store any valid JSON (with nested arrays of JSON or whatever).
- The most efficient model to use alongside with spark to query anything inside.

What would be the possible CQL schemas to create such a data structure ?

What are the defaults of the following schema ?

Cassandra

CREATE TABLE test-schema (
    struct-id int,
    nested-1-1#value-1-1-1 string,
    nested-1-1#value-1-1-2 int,
    nested-1-1#value-1-1-3 float,
    nested-1-1#value-1-1-4#array0 string,
    nested-1-1#value-1-1-4#array1 string,
    nested-1-1#value-1-1-4#array2 string,
    nested-1-1#nested-2-1#test-2-1-1 string,
    nested-1-1#nested-2-1#test-2-1-2 int,
    nested-1-2#value-1-2-1#array0#id int,
    nested-1-2#value-1-2-1#array0#deeply-nested#data-1 string,
    nested-1-2#value-1-2-1#array0#deeply-nested#data-2 int,
    nested-1-2#id int,
    nested-1-2#data-3 string,
    PRIMARY KEY (struct-id)
)

I could use:

    nested-1-1#value-1-1-4 list<string>,

instead of:

    nested-1-1#value-1-1-4#array0 string,
    nested-1-1#value-1-1-4#array1 string,
    nested-1-1#value-1-1-4#array2 string,

yet it wouldn't work here:

    nested-1-2#value-1-2-1#array0#deeply-nested#data-1 string,
    nested-1-2#value-1-2-1#array0#deeply-nested#data-2 int,
    nested-1-2#value-1-2-1#array1#id int,
    nested-1-2#value-1-2-1#array1#data-3 string,

since this is a nested structure inside the list.



To create this schema, could we imagine that the app logging this try to write to the corresponding column, for each JSON attribute, and if the column is missing, catch the error, create the column and reprocess write ?

This exception would happen for each new field, only once and would modify the schema.

Any thought that would help us (and probably more people) ?

Alain
M.Tarkeshwar Rao | 22 Jul 15:45 2014
Picon

I want either all the DML statements within the batch succeed or rollback all. is it possible?

Hi all, 

In the user guide of  Cassandra i got the information about the batch for atomic DML operations.
I want either all the DML statements within the batch succeed or rollback all.

is it possible? 

another question in my can i use joins in Cassandra or any other way to achieve it.


Regards
Tarkeshwar
Picon

Error :AssertionError => firstTokenIndex(TokenMetadata.java:845)

hi all,

I trying add a node to a cassandra ring with only one seed-node. I have the seed in EC2 and I have this error  when I start cassandra in the other node



----



ERROR [Thrift:389] 2014-07-22 08:25:39,838 CassandraDaemon.java (line 191) Exception in thread Thread[Thrift:389,5,main]
java.lang.AssertionError
at org.apache.cassandra.locator.TokenMetadata.firstTokenIndex(TokenMetadata.java:845)
at org.apache.cassandra.locator.TokenMetadata.firstToken(TokenMetadata.java:859)
at org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:106)
at org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:2681)
at org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:376)
at org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:191)
at org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:866)
at org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:849)
at org.apache.cassandra.thrift.CassandraServer.batch_mutate(CassandraServer.java:749)
at org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.getResult(Cassandra.java:3690)
at org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.getResult(Cassandra.java:3678)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:32)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:34)
at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:199)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
ERROR [Thrift:390] 2014-07-22 08:25:41,169 CassandraDaemon.java (line 191) Exception in thread Thread[Thrift:390,5,main]
java.lang.AssertionError
at org.apache.cassandra.locator.TokenMetadata.firstTokenIndex(TokenMetadata.java:845)
at org.apache.cassandra.locator.TokenMetadata.firstToken(TokenMetadata.java:859)
at org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:106)
at org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:2681)
at org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:376)
at org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:191)
at org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:866)
at org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:849)
at org.apache.cassandra.thrift.CassandraServer.batch_mutate(CassandraServer.java:749)
at org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.getResult(Cassandra.java:3690)
at org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.getResult(Cassandra.java:3678)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:32)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:34)
at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:199)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
ERROR [Thrift:391] 2014-07-22 08:25:44,578 CassandraDaemon.java (line 191) Exception in thread Thread[Thrift:391,5,main]
java.lang.AssertionError
at org.apache.cassandra.locator.TokenMetadata.firstTokenIndex(TokenMetadata.java:845)
at org.apache.cassandra.locator.TokenMetadata.firstToken(TokenMetadata.java:859)
at org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:106)
at org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:2681)
at org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:376)
at org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:191)
at org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:866)
at org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:849)
at org.apache.cassandra.thrift.CassandraServer.batch_mutate(CassandraServer.java:749)
at org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.getResult(Cassandra.java:3690)
at org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.getResult(Cassandra.java:3678)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:32)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:34)
at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:199)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)



------


I do an AMI from the original seed cassandra  instance EC2 and delete all data an config listen address in cassnadra.yaml   to the new ip of the new node and run the instance. cassanbra   in the new node and  begin pass data from the other instance but the original seed get the error and stop the process to add the instance to the ring 



the issue is like this other:


¿Any ideas?

I am using   Cassandra 1.2.15  version and endpoint_snitch: Ec2Snitch




thanks ins advance  and regards
Jeremy Jongsma | 22 Jul 00:05 2014

Authentication exception

I routinely get this exception from cqlsh on one of my clusters:

cql.cassandra.ttypes.AuthenticationException: AuthenticationException(why='org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 2 responses.')

The system_auth keyspace is set to replicate X times given X nodes in each datacenter, and at the time of the exception all nodes are reporting as online and healthy. After a short period (i.e. 30 minutes), it will let me in again.

What could be the cause of this?
John Sanda | 21 Jul 17:44 2014
Picon

When will a node's host ID change?

Under what circumstances, if any, will a node's host ID change?

- John
Marcelo Elias Del Valle | 21 Jul 17:24 2014
Picon

map reduce for Cassandra

Hi, 

I have the need to executing a map/reduce job to identity data stored in Cassandra before indexing this data to Elastic Search.

I have already used ColumnFamilyInputFormat (before start using CQL) to write hadoop jobs to do that, but I use to have a lot of troubles to perform tunning, as hadoop depends on how map tasks are split in order to successfull execute things in parallel, for IO/bound processes. 

First question is: Am I the only one having problems with that? Is anyone else using hadoop jobs that reads from Cassandra in production?

Second question is about the alternatives. I saw new version spark will have Cassandra support, but using CqlPagingInputFormat, from hadoop. I tried to use HIVE with Cassandra community, but it seems it only works with Cassandra Enterprise and doesn't do more than FB presto (http://prestodb.io/), which we have been using reading from Cassandra and so far it has been great for SQL-like queries. For custom map reduce jobs, however, it is not enough.

Does anyone know some other tool that performs MR on Cassandra? My impression is most tools were created to work on top of HDFS and reading from a nosql db is some kind of "workaround".

Third question is about how these tools work. Most of them writtes mapped data on a intermediate storage, then data is shuffled and sorted, then it is reduced. Even when using CqlPagingInputFormat, if you are using hadoop it will write files to HDFS after the mapping phase, shuffle and sort this data, and then reduce it. 

I wonder if a tool supporting Cassandra out of the box wouldn't be smarter. Is it faster to write all your data to a file and then sorting it, or batch inserting data and already indexing it, as it happens when you store data in a Cassandra CF? I didn't do the calculations to check the complexity of each one, what should consider no index in Cassandra would be really large, as the maximum index size will always depend on the maximum capacity of a single host, but my guess is that a map / reduce tool written specifically to Cassandra, from the beggining, could perform much better than a tool written to HDFS and adapted. I hear people saying Map/Reduce on Cassandra/HBase is usually 30% slower than M/R in HDFS. Does it really make sense? Should we expect a result like this?

Final question: Do you think writting a new M/R tool like described would be reinventing the wheel? Or it makes sense?

Thanks in advance. Any opinions about this subject will be very appreciated.

Best regards,
Marcelo Valle.
jcllings | 21 Jul 01:17 2014
Picon

Which way to Cassandraville?

So I'm a Java application developer and I'm trying to find entry points
for learning to work with Cassandra.
I just finished reading "Cassandra: The Definitive Guide" which seems
pretty out of date and while very informative as to the technology that
Cassandra uses, was not very helpful from the perspective of an
application developer.

Having said that, what Java clients should I be looking at?  Are there
any reasonably mature PoJo mapping techs for Cassandra analogous to
Hibernate? I can't say that I'm looking forward to yet another *QL
variant but I guess CQL is going to be a necessity.  What, if any, GUI
tools are available for working with Cassandra, for data modelling?

Jim C.

DuyHai Doan | 20 Jul 23:03 2014
Picon

Re: estimated row count for a pk range

1) Use separate counter to count number of entries in each column family but it will require you to manage the counting manually
2) SELECT DISTINCT partitionKey FROM ....  Normally this query is optimized and is much faster than a SELECT *. However if you have a very big number of distinct partitions it can be slow


On Sun, Jul 20, 2014 at 6:48 PM, tommaso barbugli <tbarbugli <at> gmail.com> wrote:
Hello,
Lately I collapsed several (around 1k) column families in a bunch (100) of column families.
To keep data separated I have added an extra column (family) which is part of the PK.

While previous approach allowed me to always have a clear picture of every column family's size; now I have no other option than select all the rows and make some estimation to guess the overall size used by one of the grouped data in this CFs.

eg.
SELECT * FROM cf_shard1 WHERE family = '1';

Of course this does not work really well when cf_shard1 has some data in it; is there some way perhaps to get an estimated count for rows matching this query?

Thanks,
Tommaso

tommaso barbugli | 20 Jul 18:48 2014
Picon

estimated row count for a pk range

Hello,
Lately I collapsed several (around 1k) column families in a bunch (100) of column families.
To keep data separated I have added an extra column (family) which is part of the PK.

While previous approach allowed me to always have a clear picture of every column family's size; now I have no other option than select all the rows and make some estimation to guess the overall size used by one of the grouped data in this CFs.

eg.
SELECT * FROM cf_shard1 WHERE family = '1';

Of course this does not work really well when cf_shard1 has some data in it; is there some way perhaps to get an estimated count for rows matching this query?

Thanks,
Tommaso
Robert Stupp | 20 Jul 17:32 2014
Picon

Caffinitas Mapper - Java object mapper for Apache Cassandra

Hi all,

I've just released the first beta version of Caffinitas Mapper.

Caffinitas Mapper is an advanced Java object mapper for Apache Cassandra NoSQL database. It offers an
annotation based declaration model with a wide range of built-in features like JPA style inheritance
with table-per-class and single-table model. Composites can be mapped using either Apache
Cassandra’s new UserType or as distinct columns in a table. Cassandra collections, user type and tuple
type are directly supported - collections can be loaded lazily. Entity instances can be automatically
denormalized in other entity instances. CREATE TABLE/TYPE and ALTER TABLE/TYPE CQL DDL statements can
be generated programmatically. Custom types can be integrated using a Converter API. All Cassandra
consistency levels, serial consistency and batch statements are supported.

All Apache Cassandra versions 1.2, 2.0 and 2.1 as well as all DataStax Community and Enterprise editions
based on these Cassandra versions are supported. Java 6 is required during runtime.

Support for legacy, Thrift style models, is possible with Caffinitas Mapper since it supports
CompositeType and DynamicCompositeType out of the box. A special map-style-entity type has been
especially designed to access schema-less data models.

Caffinitas Mapper is open source and licensed using the Apache License, Version 2.0.

Website & Documentation: http://caffinitas.org/
API-Docs: http://caffinitas.org/mapper/apidocs/
Source Repository: https://bitbucket.org/caffinitas/mapper/
Issues: https://caffinitas.atlassian.net/
Mailing List: https://groups.google.com/d/forum/caffinitas-mapper

Eric Evans | 19 Jul 23:21 2014

[ANN] Apache Cassandra 2.1.0-rc4 release candidate


The Cassandra team is pleased to announce the release of Apache Cassandra
version 2.1.0-rc4.

If all goes well, this release candidate will become 2.1.0 final, until
then it should not be considered for production use.

We'd appreciate testing; Let us know of any problems[3]!

You can download 2.1.0-rc4 from the website: 
http://cassandra.apache.org/download.

Thanks, and happy testing!

[1]: http://goo.gl/wcH7ma (CHANGES.txt)
[2]: http://goo.gl/Ooros0 (NEWS.txt)
[3]: https://issues.apache.org/jira/browse/CASSANDRA
[4]: http://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=shortlog;h=refs/tags/cassandra-2.1.0-rc4

--

-- 
Eric Evans
eevans <at> sym-link.com

Gmane