Cassandra at scale and beyond

Yuriy Yarosh
11 min readApr 17, 2021

I had pretty mixed up experience and expectations when dealt with relatively large and medium-sized Cassandra and ScyllaDB clusters and other in-house Column store solutions. I thought that it would be great to reflect on those experiences and share what I’ve learned so far.

It was quite warm and comfy, inside the belly of the beast

RDBMS operations are tricky on both small and large scales, especially long term. So, let’s clear things up a bit for column-oriented databases and respective datasets.

When we’re talking about Column stores, the first thing that comes to mind is the Size and the second one is Denormalization. But there’s not much understanding of the actual Cost in both schema development, storage design, respective operations, and architectural purpose.

Is it any good ?

Well, yes. While there are simply no real long-term reliability guarantees, Cassandra actually can be pretty useful for some purposes. Let’s clear some common biases and focus on the good stuff for now.

Good stuff

Actor persistence

Cassandra, as a Column store, is great for actor models persistence. It can be a huge game-changer when used with Akka Projections and event-sourced behavior Cluster Sharding. In this case, Cassandra can be a journal store for incoming events and pending Actor Behavior Snapshots.

If you’re not that familiar with Akka — there’s a good TypeSafe project called CloudState, which essentially is a Kubernetes Akka sidecar designed to provide CQRS read projections and event sourcing.

If you’ll need to perform streaming from a KNative based FaaS app — it’s also good to consider CloudFlow. It goes as a CloudState companion for streaming apps, for instance, ML and batch data processing.

Not an ad, just something I’m used to working with.

Schema-proof(-ing)

Cassandra schema development is a bit tricky. It requires getting all the application queries that will be performed beforehand, to decide on how the Clustering Key columns will be actually sorted.

Modeling a Cassandra schema differs much from common well-known relational modeling because you’re modeling denormalized schema with all the queries in mind. Chebotko diagrams are the most preferred way of visual Cassandra schema representation. btw Hackolade design tool supports most of the NoSQL databases quite nicely.

Chebotko diagram example

Thus, Cassandra models is just a set of views over your relational data distributed across a DC cluster, using clustering key columns sorting order. Without prior knowledge of your domain modal and proper formal requirements development, it’s quite hard to design a distinct Cassandra schema. So, just guessing might do the trick, but not advised.

Designing distinct denormalized schema forces more in-depth requirements development, which reduces long-term risks and just the number of Unknown Factors.

Cloud-connected

AWS Keyspaces is a managed Cassandra service with quite good enough performance and low operational requirements. It’s a decent choice when you want to try it but don’t want to get into all the bugs, lags, corruption, sleepless nights, and other nasty things. Keyspaces is not a silver bullet — there is a lot of limitations, but without Compaction, Repairs, and Backup hassle. AWS Keyspaces service has a pretty decent point in time recovery.

You can treat AWS keyspaces as DynamoDB with a CQL interface.

Plan B

Well, and there’s always ScyllaDB, which is an exceptional replacement for Cassandra with a much more well-developed life-cycle — there’s even a decent k8s operator. So, after some struggle, you also might find out that Scylla covers most of the Cassandra design flaws and plain bugs. And there’s a lot of bugs present.

Not great stuff

Schema-fool(-ing)

First and foremost, “schema-less” doesn’t mean that changing the schema will be free. Additional DB accessor abstraction can help to mitigate added operation costs during migrations.

For PostgreSQL, it’s adding a View that will wrap both old and new versions of the table and provides a relevant way to populate missing data, only for the active users, by using stored functions, while DDL is performed in the background. Yes, there’s a clear overhead involved, but it’s manageable both from a development and operations perspective.

Changing the Primary Key of a column store table would require a Table Copy to redistribute all the data in a ring-like fashion using new PKeys definitions.

Thus adding a new node, or any schema changes, would require propagation of these changes to all the nodes inside the cluster. The fun part is that the DDL propagation itself is a part of your CQL connection driver. Any writes during DDL propagation may cause consistency errors and semi-applied transactions. So, writing a consistent Cassandra Driver wrapper on top for these kinds of issues is fairly common.

Every existing major Cassandra consumer has it’s own flavor of it, with the drivers and operations tools developed from scratch, in different variations

Having multiple table versions with opposite column sorting order is also a massive hassle — table partitions would grow indefinitely over time, and you’ll need to choose what sorting order will be managed manually at the application level.

Cassandra’s “best practice” is to avoid schema migrations and have multiple tables with a version suffix instead — deal with all the compatibility hassle at the application and driver-wrapper level.

Flaky Eventual consistency

You’ll need at least 5 replicas of your production data to make it actually available for more or less decent operation. Thus when we’re talking about Time-series stuff or just plain old analytical data — ask yourself: can your company or clients really afford having 5 copies of it ?

Why 5 ?
Let’s say you’ve got a single node down, having a common LOCAL_QUORUM level of consistency would require 2 nodes to acknowledge mutations. In case of metadata loss or compaction / repair pressure you can lose at least two more nodes. Thus during peak load having 5 replicas at least acknowledges that there would be at least two nodes available if things really go sideways.

Practically you can get write hints pressure every time 2 out of 5 replica nodes are compacting themselves or repairing. Hinted handoffs is just a way to backup transactions for unavailable nodes, which essentially doubles the network throughput requirements for every failed node during the outage and after bringing it online.

It’s fairly common to search through existing Cassandra issues to get a general idea why yet another interrupted repair or compaction caused losing table metadata… and there is clearly a lot.

If we’re talking about common replication factor of 3 — usually you’ll be losing 2 nodes, and this will paralyze all the write operations - you’ll be forced to use consistency level of ONE instead of LOCAL_QUORUM.

Having all your data being Eventually Consistent goes both ways — it will become Eventually Corrupted due to bugs, IO pressure, OOM’ing, hinted handoffs pressure and you might not be ready for that. Some of these aspects are just due to poor design, while others are inherent for all DHT based Distributed Column Stores.

The other interesting aspect of Cassandra is that you can’t really change the number of VNodes without rebuilding whole cluster, so it’s better to keep it high and Token Balanced. So, all Token ranges should be distributed evenly.

Data kinds

Understanding what kinds of data are being stored and for what purpose they’re being indexed is essential — all your data has some form of deprecation, and managing data deprecation is crucial for ensuring database performance and proper resource utilization.

Well, I’m always telling folks to denormalize only the data of active users on SSD’s , while keeping everything else normalized on conventional HDD’s. So, it will be safe to archive afterwards, without affecting prod performance much.

So, it’s not great for any type of partitioned, analytical data.
It’s better to stick with something like Vertica instead at bigger scale, but Postgres would still be a great choice for smaller scale (less than 10Tb per node, for warm partitions).

tldr; It’s not OK to use Cassandra as a long-term time-series database
and maybe, just maybe, OK for the short-term

Using Cassandra as a time-series database makes sense only for high-burst operations, when it’s crucial to redistribute a lot of writes across the high number of nodes, during hard times. For instance, weekly chat logs are a pretty good fit for this. Even though it would require a separate application-level bucketing metadata, which would store the time periods for the data that will be written at a given point of time… does it ring a bell, NO ?

Yes, you’ve heard it right, — it’s a trigger-happy HTB scheduling and write throughput throttling for your Cassandra app. As for me, personally, I’d be glad if database would have a decent scheduler so I wouldn’t throttle write transactions myself. Some folks might be OK with that — as long as it does the trick, I guess.

Backups

There are two major machanism of the data backup in both Cassandra and ScyllaDB — incremental backups and Snapshots. But there’s not much difference between them: incremental backups are taken between table flushes and can be deleted securely after the actual flush, while snapshot is a full copy of the table. Usually snapshots are being created before compaction, because of possible compaction failure, which essentially doubling the disk space requirements. Using both snapshots and incremental backups helps to reduce the disk space requirements, but most of the time they’re simply ignored and deleted if there were no major issues for the last ten days or so (it’s the time hinted handoffs are not being discarded).

It might sound strange, but Cassandra itself is a good target for a main database backup due it’s self-healing properties, if you’re lucky enough, or your budget is big enough to store a couple of additional replicas.

There’s always a slight risk of data loss present, because Cassandra has no reliable internal means for data streaming, but ScyllaDB does.

Streaming

Well, there are few approaches for Cassandra streaming, but none actually reliable enough
• trigger based streaming — you have no control over internal ThreadPools and trigger impl may cause performance loss
• Change Data Capture (CDC) streaming — in simple words, you can stream a copy of all your flushed commit log entries, but there’s an added hassle of merging and compacting commit logs manually. If consumer is too slow — Cassandra refuses to accept new transactions.
• Incremental backups streaming — it could work just great, except it streams not quite up to date and there’s a potential data loss. That’s the approach Netflix actually have been using for quite a while.

For now, only ScyllaDB, has the most reliable way of CDC streaming by providing a dedicated Commit Log table with a “scylla_cdc_log” and data before and after the actual transaction had been applied. If consumer is too slow there’s a risk of transaction loss inside the stream, but there’s no added risks for the data itself.

Compaction

As you might already know, Cassandra uses append-only type of storage data structure called Log Structure Merge Tree (LSM-tree) and a Disk Commit Log.
Thus all transaction are being written onto the disk first, before being stored and processed in-mem. The downside to this is that Commit Log should be stored at the fastest IOPS-wise drive available, even better if it’s managed IOPS devices. The other caveat is that LSM-tree is an append only tree, so it needs to be compacted after getting too many deletes stored inside of it.

Thus there are different Compaction Strategies that can be applied to the target node, and there are different side effects you might get if the respective Compaction had been interrupted. From plain lose of Table Metadata, which can be restored, to the “Night of the Walking Dead” and “Data Resurrection”.
There’s also this “Necropolis” situation when current row consists only of Tombstones and further reads are just guesses.

The fun part starts when you’re unable to actually Repair it, and there’s clearly not enough live replicas to actually finish the Repairs properly.

There are three major Compaction Strategies with their pros/cons for different usage purposes:
• Size tiered — good for general usage where reads almost as frequent as inserts, but lacks flexibility in case of frequent updates
(Replication Factor can reach 11 in some production cases)
• Level tiered — a bit weird strat, optimized for read and update intensive tasks like Actor Persistence mentioned above, but it causes write amplification
• Time Window tiered — good for infrequent time series data and backups, where there mostly no deletes and there’s no Write Spikes

Thus I’d say that practical Cassandra usage is fairly limited to it’s three compaction strategies present.

Scylla Enterprise additionally has Incremental Compaction which fixes the issue with Size Tiered space amplification, also it’s possible to temporarily switch from Level to Size compaction strategy during performance bottlenecking.

Scylla has pretty decent documentation regarding different Compaction Strategies, worth checking out.

Repairs

The rule of thumb: “In case of doubt and fear of consistency loss — run full repairs”. Repairs are fairly frequent and can be compared to “Full Cluster Scans” if we’re talking in relational DB performance terms. Even though there’s an Incremental repairs mechanism present, running full cluster repairs node by node, after restoring from a Snapshot or just bootstrapping a replacement node, is a MUST. Cassandra may lose it’s consistency by itself due to multiple, relatively unpredictable factors. If a full repair had failed — it’s more convinient to recover the failed node from the snapshots, than run repairs once again due to excessive cluster gossiping and network overhead.

So, sometimes it’s simply a matter of luck, you may or may not have the sufficeint number of replicas present to actually perform sufficient repairs.

Too sum it up

This all essentially means that
1) You have to keep 50% free disk space, always
2) You have to keep 50% of your available IOPS, always
3) You have to be able to double the networking bandwidth on demand
4) You have to be prepared that during repairs at least one node will become inaccessible and you may lose Handoffs, which would require an additional repairs run
5) You have to be prepared to Allocate a Mirror cluster from scratch in case of major schema migrations or Hadoop integration

Which call an be really and really challenging by itself.
So, yeah, Cassandra operations are quite the pinch, and Scylla doesn’t differ that much too.

Conclusion

Surely, you can “just get things done” with Cassandra and it will work perfectly fine, if you’ll control the growth of your Cluster and fine-tune most of the stock parameters.

If compared with Scylla — there’s similar extra maintenance cost, which could be managed with Cloud Hosted Solutions or Kubernetes Operator.

Upgrading Cassandra on your own, even for the minor versions, proven to be quite tedious and error-prone process, with data corruption risks. Backups and snapshots are a must.

If you can’t restore your Column Store data at any given point of time — you’re doing something very wrong. And keeping few dozens of replicas “just to make it work somehow” is not a reliable long term plan.

It’s good to treat Cassandra as a highly available Read projection, in terms of CQRS, on top of your relational database, and as an intermediate store for your current business processes state. So if it’s possible to describe your business processes as FSM’s — cassandra would be the proper place to store your FSM state transition history, until it’s proven that given process is finished correctly, and there’s no need for state reproducibility.

Hopefully I’ve cleared few biases regarding Cassandra operation and made it’s purpose a bit more clearer.

--

--