Apache Cassandra, bad assumptions, and unit test shortfalls

This is a story about a problem that took a long time to solve. As in, no one noticed it was a problem for a couple years then it took another year to solve. It involves using the time series database Cassandra, some bad assumptions made by the initial programmers who were more used to relational databases, and a reliance on unit tests that test for a short period of time but miss long term problems. And there is a bonus, somewhat related story at the end.

I was pondering this story today as I was mopping my floors (maybe because I’m wearing my Cassandra shirt today) and thought it was worth writing out.  I will write about it in some vague terms as I don’t want to point fingers, but if you ask me in comments I’ll dig up the commits and show the code lines. (And no, I didn’t author the bug.)

I worked on Monasca, an open source monitoring and metering project within OpenStack, for several years. This open source project was consumed by our corporate distribution of OpenStack, and many features were contributed by the corporation. Monasca originally used MongoDB or InfluxDB as a time series storage for the monitoring data. (Todo: insert a link to a good description of time series DBs.) Monasca quickly discovered that MongoDB couldn’t scale up to the load that was being generated, and InfluxDB, while capable of handling the load of configures right, didn’t have a freely available High Availability configuration that fit the corporate requirements for production. So a small team set to work in finding an alternative and settled on Apache Cassandra.

Cassandra had some nice features and proved to handle the load well in testing. So after some work, the Cassandra Persister was merged, complete with unit tests and documented tests for generating large loads. All looked good.

But a strange problem came up over time. Some customers reported that the CLI for retrieving the monitoring data quit working. They had been using the system successfully for months and then suddenly just got errors. With a little investigation it was found that not all data was broken, just some queries. Very strange, and no useful error message out of the code.

I guess I should mention here that the Cassandra Persister code had two versions: Java and Python. Much of Monasca was originally written in Java, as the original authors were familiar with Java and found it was faster. But the standard for language in Openstack is Python, so an optional Python version was also created. While our customers were using the Java version in production, the problem was common to both versions.

So after months of failed attempts to find a root cause, I finally was given the bug and looked closer at what data was really being stored in the Cassandra tables. I was given a dump from one of the affected systems and tried to pull out individual rows, but found that a particular cell, the metric ID, was missing. How could this be possible to loose just one cell in the row?? It was maddening. But not knowing a root cause, I set to finding a workaround. Fortunately, the metric ID is actually calculated from other cells in the row, so it could be reconstructed.

I wrote a tool (in Python) to search for rows where the cell was missing and recreate the data. It wasn’t as easy as it sounds though. Cassandra does not use the common SQL language of relational databases, but instead has CQL which looks a lot like it. You can write a select statement that will show cells from the row, and in the output you might spot that the metric ID cell is reported as “null”. But if you try to write a query to find a cell with the value of “null” it will fail. Why? Because CQL is actually lying. The data stored in the row is sparse, meaning it only stores data that is present and doesnt store null values like MySQL might. (Todo add a picture). Effectively the cell doesnt exist for that row, so you cant write a query to find null cells because there aren’t any.

The way around that was to run through all the rows and look at which didn’t have a value for the metric ID cell in the python code. Not super efficient but it worked, and I was able to recreate the value and write it back to the database to get functional rows again.

While this recreate tool worked, it didn’t fix the problem permanently. Before long, the problem would manifest again, just on different rows. Again, Why??

The answer finally became clear when looking at the TTL values in Cassandra. For each cell or table (but not row) Cassandra can keep a Time To Live value. When this value counts down, the data is automatically cleaned up. Sounds handy when you are working with a large moving set of data that you want to clean out and save storage. But the problem is that the original authors of the Cassandra Persister didn’t understand the subtle implications. It was a bad assumption that Cassandra behavior was more like a relational database.

When a row was written, the TTL for all the cells had the same value, and would count down and clean up at the same time. So it looked like the row had the same TTL and all was fine. But the problem came when an update to a metric occurred. The authors, to be more efficient, didn’t make any change to the metric ID in an update. Why would you, the value of the ID doesnt change. But because the cell with the ID didnt get updated, it kept the original TTL. But the cells with updated data had their TTL reset. So across the row, the TTL values were different. Then when the ID counted down it was deleted, but the rest of the row remained, and a query would find a partial row, causing errors.

The solution was pretty simple. Any time an update was made to the metering data, ALL the cells have to be updated. The fix has since been included in the upstream code and rolled out to most customers (though it came up again just a couple weeks ago for a customer who was way behind on updates).

So why didn’t this problem come up in testing?

It wasn’t that there weren’t unit tests for the Caassandra Persister. This bad assumption was missed in testing because the unit tests don’t run long enough. The default TTL for the cells was set to 45 days. No one in their right mind would sit around for 45 days for a unit test. As the problem doesn’t occur until the data times out of the cell, there was no indication that there was a problem.

And because there was no notion that this would be a problem, there were no long-term burn in tests done before committing upstream and shipping out to customers. If this was suspected as a problem, a test to catch the problem that set the TTL to 1 day could have been done, though even 1 day is too long for unit tests or any Continuous Integration system I’ve encountered.

BONUS: If you read this far, congratulations and condolences. I want to mention another big problem we hit due to not having a wealth of experience with Cassandra bad assumptions: Disk Storage. You see, counter to the intuition most people learn when dealing with file systems and just about any other application in the world, Cassandra tends to fail cleanup when using more than 50% of the storage space available. There are lots of other articles about this, but if you don’t know this is a problem you aren’t likely to go seek them out.

The testing done for the Cassandra Persister in Monasca were extensive around ingesting large amounts of data very quickly, but didn’t test what happens over time (as with TTL) or what happens when data is deleted (tombstones are kept for some time rather than clearing out space) or when the disks fill up.

The short version of the problem is that for Cassandra to clean up and run ‘compaction’ on a table, it makes a copy of all the data it wants to keep from that table. This means it could theoretically make a copy of _all_ the data in a table, and therefore needs double the space of the data being stored in order to manage the data. So if you aren’t actively compacting and clearing out old data and your database goes over 60% (or even up to 95%) of the available disk space, you are stuck and cannot run compaction and garbage collection!

We’ve seen this happen to multiple customers who were unaware of this behavior, and the response from customers and support engineers is always shock and bewilderment at how bad that experience is.

If this had been a known consideration going in to the product, then monitoring could have been set up to trigger an alarm when the disk usage grew high to avoid the problem. It is easy enough to avoid if caught before the disks fill up – compaction can be run, data retention can be optimized, and TTL values can be shortened.

This is something that might have been found if a long term (maybe 2 month, as the default TTL was set to 45 days) test had been run under sufficient load. But most sane projects aren’t going to sit and wait for two months before shipping a product, so that wasn’t ever planned.

Here are a few good links on Cassandra data considerations:

Hardware guidelines from Cassandra (doesn’t give exact numbers)

Common Mistakes

About Deletes and Tombstones

I could/should separate these two stories into two posts, and maybe a third for all the links I have collected, but this is good enough for now. 🙂