Getting the counters with the highest values
Robert Wille <rwille <at> fold3.com>
2014-11-23 15:41:36 GMT
I’m working on moving a bunch of counters out of our relational database to Cassandra. For the most part,
Cassandra is a very nice fit, except for one feature on our website. We manage a time series of view counts
for each document, and display a list of the most popular documents in the last seven days. This seems like a
pretty strong anti-pattern for Cassandra, but also seems like something a lot of people would want to do.
If you’re keeping counters, its pretty likely that you’d want to know which ones have the highest
Here’s what I came up with to implement this feature. Create a counter table with primary key (doc_id,
day) and a single counter. Whenever a document is viewed, increment the counter for the document for today
and the previous six days. Sometime after midnight each day, compile the counters into a table with
primary key (day, count, doc_id) and no additional columns. For each partition in the counter table, I
would sum up the counters, delete any counters that are over a week old, and put the sum into the second table
with day = today. When I query the table, i would ask for data where day = yesterday. During the compilation
process, I would delete old partitions. In theory I’d only need two partitions. One that is being built,
and one for querying.
I’d be interested to hear critiques on this strategy, as well as hearing how other people have
implemented a "most-popular" feature using Cassandra counters.