MayaData Blog

Kubernetes storage performance myths

Written by Evan Powell | May 19, 2020 9:01:23 PM

Intro:  You’ve been had!

If you want to skip ahead - to the stuff about how the future is here now - please do.  Skip down for my introduction of Mayastor.  

Or … read more to learn how you may have been misled by your storage vendors when they asserted that characterization of performance of on a per workload basis on Kubernetes for shared storage was a tractable problem. I think it is worth a few minutes of your time. Thankfully there is a way out that leverages Kubernetes itself.

Have you ever been involved in an analysis of the performance of a typical shared everything storage system?

How did it go?

Did you learn how your system will perform in reality?

More importantly, did you learn how your workloads would perform in reality?

If you answered yes then - well - you didn’t do a very good job of analyzing the performance under a variety of future conditions.  It turns out that you cannot get it done in a human time frame.

Performance tuning of shared everything storage is hand-waving even before you introduce the dynamism of containerized workloads. 

If you have been around long enough you know all too well that this uncertainty at the foundation of the stack has a real world impact on us all - your production workloads don’t behave as projected and so you fire fight and often buy engineered systems just for the important workloads.

And by production workloads I mean for example e-commerce systems driving shopping carts or financial systems enabling you to get paid and pay the team.  Important stuff.

So by now you might be thinking you’d like to see some proof.

Let’s do it! Let’s benchmark for real

Let’s imagine that you’d like to do something fairly basic so you can have some good idea of how things will work in production.

Armed with your trusty FIO or other benchmarking tool you sit down to figure out a reasonable plan.

You want to start with your common workloads, some message queues, DBs perhaps of different flavors or primary configurations and so on.  You can pretty easily get to dozens of workloads when you consider that they may be configured differently.  For this example let’s say there are 32 flavors of workloads.

Let’s say your storage system has approximately 10 different variables you can try when optimizing it.  In reality, the number is far higher - but let’s start with 10 that seem important.  As an example, block size or cache policies.

Further let’s assume that you have 4 settings for each of these permutations.  Again, this is an over simplification however perhaps it is close enough for this exercise.

If you are curious, here is an example config file for CEPH.  That's it - get your crush rule and your choose leaf on!

https://github.com/ceph/ceph/blob/master/src/sample.ceph.conf

We don’t want to be here all day so let’s assume you are good at automation, so set up and tear down time is minimal and you trust that generally speaking the storage system won’t suddenly degrade after a few minutes, so the runs take no more than 10 minutes each.  By the way - it depends on the storage system and the concurrent load but again, simplify!

Last but not least, you know enough to know that a storage system in state A may behave very differently than a storage system in state B so a single run is clearly not sufficient.  Again, you have real life to get back to.  Let’s say you decide on 10 test runs.

Before we go any farther I’d just point out that a test set up that might have been simple in the past would, in the world of ever dynamic workloads on and off your containerized environment, likely be a bit too simplistic.

Secondly, note that nothing we are doing addresses the noisy neighbor problem.  Just to pick one example, a CEPH cluster with a few workloads on it is radically different than one that has been in production for a while and is under some load.  The neighbors are not just noisy in that case, they are firing grenade launchers at each other.  But that’s clearly tough to benchmark, so let’s skip it.

Let’s just run the tests as outlined above.

Set ‘em up before lunch, go on a good social distancing walk around the neighborhood, and come back to check on your results.

Here’s the thing, assuming your automation works flawlessly and is hands-off between workloads, runs, and permutations and so on it will take by my math approximately 61 years to complete.

It’s actually pretty straightforward.

So to recap, that is 10 variables to set a workload up ( e.g., cache, block size) over 4 permutations, 10m for set up and tear down each for 10 runs, doing this over 32 different workloads - we get  10^4 x 10 x 10 x 32 = 32 million minutes!

If you want to dig into the reasoning behind these parameters I’d refer you to one of the best scientific thinkers about all things performance and storage and workloads and more, Alex Aizman, who is now the lead storage architect for NVIDIA.

Specifically here: https://storagetarget.com/2017/07/07/four-decades-of-tangled-concerns/

I’m biased since Alex was one my co-founders at Nexenta however if you flip through his blog you’ll learn a lot and be entertained as well.

Also his contention - and you might want to trust him instead of me - is actually it isn’t 61 years, it is 6300+ years.

Either way - bottom line - if you tried to come up with predictions on what your workloads would look like bottoms up where the foundation is a shared storage system - you’ve been had.

If you read Alex’s blog you’ll note that he also points out that we didn’t take into account the network.  For this and other reasons our estimates - whether 61 years or the likely more accurate 6300 years - are underestimates.  Again, I really encourage you to read his blog for a more complete understanding.

So how do we engineer Kubernetes storage performance?

So what can we do as engineers, or would be engineers in my case, to move things forward, to make them more tractable?

The choice of most hyperscalers and increasingly the workloads they helped create, like Kafka or Cassandra or even Elastic, is pretty clear.  They don’t use shared storage.  Just say no.

This makes all the more sense when you consider how storage media has changed over the years.  In short, it used to be constrained by the speed of sound (disk drives) and now is constrained by the speed of light more or less (electrons zipping around NVMe drives).

One reason for shared storage in the past was you needed to stripe across many many disks that maxed out at 150 or so IOPS to get to reasonable IOPS for your workloads.

Today all that stripping and meta data look ups - and the network complexity - just adds latency vs. the millions of IOPS you can get out of a 2U server from a vendor like Supermicro.

So today most newer workloads just want the disk.  The heck with snapshots or replication, the database can handle that just fine.

The storage wars are over.  The users picked: D) none of the above, aka direct attached or DAS.  

Unfortunately, DAS wastes a lot of resources and time and effort in the care and feeding of underlying cloud volumes or on-premise hardware.  Someone has to do disk management, resizing of volumes, life cycle of cloud volumes or NVMe drives, encryption, and more.  And, ideally, wouldn't be good to get the sort of efficiency gains containers deliver for compute - for storage?  I mean - all the money goes to storage spending anyway....

The approach we’ve taken at MayaData in helping to invent and popularize the Container Attached Storage pattern or CAS is to partition the problem.   Instead of working storage UP, we work workload DOWN.  And instead of sharing everything we share as little as possible - ideally nothing at all other than Kubernetes itself.

With CAS and OpenEBS each workload essentially has its own storage system.  Instead of turning knobs on your storage system, your Kubernetes, your network, your workload, and so on - you trust in Kubernetes.  The division of concerns is expressed via Storage Classes.

AND, if you just want performance, you get straight to the disk with the help of a flavor of dynamic local PV via OpenEBS.  Let OpenEBS handle the taints and so forth - to keep your workload local - and let OpenEBS handle the life cycle of the underlying disk and cloud volumes as wells.

And if you are willing to pay the storage tax in order to get snapshots and replication and more - great - OpenEBS delivers a storage engine that does that.  We call it cStor.  Either way your users don’t know the difference, you author and publish the storage classes and on a per workload basis the right bits are working for you.   This fits right into the Kubernetes way of managing the separation of concerns - and of course enables GitOps practices as well.

Every workload essentially has its own storage system running the flavor of storage engine that is best for that workload.

The above approach has led to OpenEBS becoming a defacto standard under a bunch of workloads with Dynamic LocalPV being the top storage engine but cStor being used quite a bit as well.  Open source projects like Elastic and PostgreSQL and others increasingly default to OpenEBS.

Nonetheless - we thought we could do better.  We knew Kubernetes storage performance would be an issue. So we started a couple of years ago working on a radically new approach, one that could get you your cake and let you eat it as well - maximum performance similar to direct attached storage PLUS key features.

We started by assuming that what is now becoming common today - existed.  Those assumptions included:

  1. That NVMe would spread and that SPDK from Intel would mature enough to be the right way to connect to NVMe environments.  ✅
  2. That Rust as a new system language would also mature - and enable us to write software that could keep up with the potential performance of NVMe. ✅
  3. That Kubernetes itself would mature and win.  ✅
  4. That Kubernetes would get better at embracing data running on it as a workload. ✅

Incidentally, at MayaData we have contributed both to Kubernetes and to SPDK itself.

All of this brings us to Mayastor

Mayastor is the first of its kind, a breakthrough in Kubernetes storage performance.  Never before has there been a cloud native data engine of this type available.  That much is clear in the elegant and clean architecture.  Perhaps more importantly - the difference is clear in its performance, delivering near the theoretical maximum possible performance of the underlying NVMe resources, while itself being comprised of a series of containers able to deliver per workload storage.

Mayastor retains the Container Attached Storage pattern - and it has just been included in OpenEBS for the first time as a storage engine option. Know more here: https://github.com/openebs/Mayastor  

In closing, the future is here - though certainly unevenly distributed - and the challenges of getting predictable Kubernetes storage performance to your workloads are now more tractable. You no longer need to budget 62 or 6300 years to sort through some of the ways your storage will impact your performance.

Give Mayastor a try - and please let us know what you think.  It remains Beta, and I truly think it is breakthrough in Kubernetes storage performance. 

And while you are at it, please join the many thousands that are using other flavors of OpenEBS including dynamic local PV in production now.  Public references include Comcast, Arista, Bloomberg and others.