You’ll recognize that there are a number of labored patterns for technology blogs. Well, today, I’m going to use the “Tenuous Segway” template for this particular offering. Not, dear reader, that I wish to bore but because it’s been a while since we’ve blogged about our progress, and intentions for the Mayastor project and I think some context would, therefore, be appropriate. Also, this is my first blog as a MayaData employee, so it presents me an opportunity to introduce myself through the rambling stories of my youth. Let’s begin.
I spent my childhood, and later my adolescence, in the UK during the 1970’s & 1980’s. I’m not going to present the customary argument for “the best time to be a child” but at the same time, they formed a colorful and generally exciting and optimistic backdrop to my formative years. As any era, they bore witness to human tragedies, both international and domestic but also landmark moments of human advance. From the comfort of my parent’s lounge, I observed the fall of the Berlin wall and later with it Glasnost, Perestroika, and the eventual end of the Cold War. I arrived too late for Apollo but was there to marvel at the birth of the Shuttle program. Despite taking a circuitous route, the 8-bit microcomputer home invasion of the 1980’s is what ultimately led me to both a career and a life-long interest in technology.
When indulging myself in a spot of reminiscence, I find that my most vivid memories of that period and those which had the greatest emotional impact upon me aren’t necessarily the affairs of this blue planet and its fractious denizens. For me, it’s often the great movies of my childhood that linger. 1977 is the year I have in mind now and with it, an iconic Sci-Fi release. No. Not that one. The other one.
If you remember anything of the blockbuster “Close Encounters of the Third Kind”, it’s going to be the iconic closing sequence, in which the humans use a sizable “light organ” in order to converse with the little grey folks, using the famous five-note hook.
[If you’ve not seen it, then you really should do so. Then go seek out all the other great movies on which special effects director Douglas Trumbull worked. If you skip 1983’s criminally overlooked “Brainstorm”, then do not pass Go and do not collect 200 monies.]
What you may not know, unless you too have carried that film as part of your developmental landscape and/or also hold a strong interest in electronic music, is that the sophisticated synthesizer stack on show in that film was not a faked prop but actually a real product which you could buy at the time - the ARP 2500.
So what’s the connection I’m trying to make? The connection, is quite literally, just that. ARP built a better mousetrap (specifically, better and more stable electronics than the market leader at the time, Moog) but failed to sell the 2500 in any significant number, in large part due to an innovative, but ultimately poor, signal path design choice. At the time of the 2500, it was an established practice to interconnect the various signal generators, filters, and effects units comprising a modular analog synthesizer with patch cables - decedents of those used by human telephone operators in the era of manual exchanges and ancestor to the RJ-45 network patch lead. This allowed the player to reconfigure the signal path in order to create new sounds but, just as with the server and telco rooms we’ve all seen and worked in, almost instantly manifests an almighty rat's nest of cables. The size of the plug jack connectors, combined with the sheer number required, conspired to set the lower bound of an instrument’s size to be fairly high, affecting its portability.
ARP’s solution was to replace (most) of the requirement for bulky, external patch leads by delivering the same functionality (configuration of the signal path) using a system of internal bus-bars and switches arranged as a matrix. Think PLA/FPGA but with mechanical switches instead of fusible links. This allowed them to shrink the physical size of their instrument considerably, compared to one of the similar capabilities from Moog. However, the same switch matrix ‘innovation’ in practice gave rise to significant signal contamination between adjacent signal lines which weren’t supposed to be connected (“crosstalk”), negating any quality improvement that ARP’s superior analog electronics should have delivered. Being physical switches they were also prone to contact oxidation and contamination by dust bunnies et al., leading to reliability issues.
[Fun fact: the support technician whom ARP dispatched to the film set to install and look after the instrument during filming ended up being cast as the keyboard player in the film after Spielberg overheard him playing it between takes. He’s also the only member of the cast whose name appears twice in the end credits].
Whilst the Mayastor project was conceived and began life at MayaData before I joined the company this year, and to my knowledge no-one involved in that early design process was aware of the yarn I’ve just spun, if we squint our eyes up real tight then these two projects might look like siblings whose lives diverged on the basis of one fateful choice. With Mayastor we’ve prioritized a robust signal path (ie. data plane) architecture which, to the maximum viable extent, doesn’t place upper limits on the performance of the Container Attached Storage (CAS) system it forms the basis of. We want that but we also wish to conserve the flexibility to add new features without significant redesign and refactoring. We believe that we are succeeding in that goal.
In the remainder of this piece, I’d like to (re)acquaint you with this particular aspect of Mayastor’s design and then close out by describing what features and functionality we expect to deliver with the next release, version 0.3.0. Broader goals for Mayastor as MayaData’s next-gen storage engine for Kubera, and the significance of CAS and of Kubernetes as a data plane, have been discussed at length previous in articles, so we won’t repeat ourselves here. [Suggested additional reading: “Mayastor: Composable you keep saying that word“ and “Mayastor: Crossing the Chasm to NVMF, Infinity and Beyond“ and Container Attached Storage: A Primer.
We can consider CAS, and Mayastor as an example of it, as in many ways a more Kubernetes native and per workload centric version of the older pattern of storage virtualization. In turn, storage virtualization is itself a flavor of hardware abstraction. The usefulness and hence lifetime of any implementation of such an abstraction is bounded both by its flexibility and its “power”. In the CAS domain what we’re looking for ideally is a very flexible, composable system that simultaneously delivers high throughput and low latency, whilst requiring only modest computing resources to do it.
10 years ago, when I first started working for a company called DataCore, storage virtualization was such a new concept that even we, as a seminal vendor of such products, didn’t really appreciate this was what we were actually building and marketing. When we spoke with potential customers, who would be accustomed to purchasing the custom hardware-based, monolithic storage arrays of the period, we needed to find a new lexicon to convey to them the differences in our approach, along with its technological and financial advantages. I often drew an analogy between what our software was actually doing and a conceptual “storage router”. In effect, we’re looking for high-performance policy-based routing of I/O between the consumer of the storage and the actual persistence layer. We must accept requests for service on our “front-end” and, according to established rules relating to security, QoS, availability, and durability, re-direct that exchange to the hardware device which will ultimately provide that service in some way. The architecture of the system performing that routing sets an ultimate limit on the performance of the components of user applications that have data traversing it. To continue the analogy, at the edge of the network, we might just get away with trading off some performance against cost but as we move inwards, up through the top of rack switches, to the distribution layer and ultimately into the network core, any compromise will always out itself, and usually at the worst possible moment (there was a reason that once-upon-a-decade, all the equipment in your core network was always a nice shade of #15495d).
With Mayastor, our approach to meeting these challenges has been to predicate our data plane on NVMe and NVMe over Fabrics (NVMe-oF). I’m not going to go into detail on those technologies here, so if you’d like to know more or require a refresher, I suggest an excellent blog post by Chuck Piercey “The NVMe-oF Boogie”. To borrow Chuck’s summary, these transports offer us greater protocol and command set efficiency over other, more commonly deployed, SCSI based ones. They also confer the potential for significantly enhanced parallelism. Of equal importance to the mix, we’ve chosen to base our implementation of NVMe-oF on Intel’s Storage Performance Development Kit (SPDK). The SPDK employs a polled, asynchronous, and lockless model; a salient feature of that approach being that throughput can be scaled in a very linear fashion with respect to CPU utilization. This is a highly desirable characteristic for distributed, microservice-like applications - our resulting cluster’s I/O capabilities scale in a predictable fashion, making the life of the SRE tasked with meeting SLOs in the face of changing workloads just that little bit easier. This predictability compares extremely favorably with traditional shared everything approaches to storage that not only break the loosely coupled pattern of cloud-native workloads but also perform in a way that is essentially impossible to predict under scale; the challenges of benchmarking traditional shared everything storage were discussed in a webinar about Mayastor that can be found here.
At the heart of Mayastor is a construct we call the “nexus”. For each Persistent Volume Claim (PVC) bound to a Mayastor Storage Class, the control plane (which we call “MOAC”) in concert with the Mayastor CSI plugins, creates a new nexus instance which acts as the “storage router” for the Persistent Volume (PV) which is being provisioned. This nexus instance accepts and manages I/O requests for that PV, directing them according to the configuration it currently holds for that PV, all via an SPDK-based poller (the “reactor”). In a Mayastor CAS deployment, this nexus instance forms the primary abstraction between PVs and the physical devices of the persistence layer. In the most simple case, this re-direction will be to a block device attached to the same worker node as the nexus instance is homed on. But the nexus is also capable of performing transformations on the I/O passing through it. For example, for reasons of availability and durability, we might wish to maintain more than one copy of the data contained by a PV. The nexus supports this by dispatching multiple copies of any writes which are received for the volume, to replicas hosted on other Mayastor Storage Nodes within the cluster (the actual replica count is defined by the Volume’s Storage Class). Only when all replicas have acknowledged their writes will the nexus signal completion of the transaction back to the consumer. That is to say, policy-based workload protection in Mayastor is based on synchronous replication. The data paths between a nexus and any remote replicas use NVMe-oF as their transport. Furthermore, as is consistent across all OpenEBS components, both initiator and target are implemented as user-space entities; this means that there is no dependency on a host kernel’s support for NVMe-oF.
Whilst 0.3.0 is designated an alpha release it’s nonetheless an important milestone for the project since it marks “the end of the beginning” for the delivery of the fundamental design and differentiating characteristics of Mayastor. With 0.3.0 we believe we have approached feature completeness; depending on feedback from users and what we learn in the next weeks 0.3.0 may be our last alpha release.
I/O flowing within a nexus to any of its remote replicas has used NVMe-oF exclusively since the initial release but this hasn’t been the case when it comes to exporting the corresponding Volume back to Kubernetes for mounting and consumption. In previous versions of Mayastor it has been necessary for the user to choose either iSCSI or NBD (Network Block Device protocol) for this, by specifying the desired transport type within the Storage Class Custom Resource Definition (CRD) on which the corresponding PVC is based. With the release of 0.3.0, the Mayastor CSI Plugin will include the option to use NVMe-oF for “front-end” duties also, allowing for an end-to-end high-performance data path. Whilst remote replicas are always shared with the nexus over a user-space implementation of NVMe-oF, the export of a Mayastor Volume requires that the host OS provides its own NVMe initiator for making the connection to the nexus' NVMe-oF user space target. This mandates the use of a kernel version which supports this feature. (Export over iSCSI provides a fall-back option, albeit with reduced performance, for those systems which are unable to provide the prerequisite functionality).
This release also marks the introduction of autonomous rebuild capabilities for Volume replicas. Disk failure in production is a certainty, not a likelihood (either through device failure, or through wider-reaching issues such as node failure, network partition, etc.) and as an infrastructure component, Mayastor must detect and correct them when serving workloads that do not already address this level of resilience themselves, and do so without any human intervention whenever possible. Basic error conditions, such as the count of error responses from a replica exceeding a defined threshold, or the replica becoming unresponsive, will now result in it being marked as faulted. MOAC’s response to the available replica count of a Volume falling below the desired count defined within its Storage Class is to instruct a Mayastor Storage Node to create a new, “empty” replica, and to share it with the degraded nexus. The introduction of a new replica to the nexus will cause it to start a rebuild process, bringing it into synchronization with the other replicas. This process can successfully complete without having to suspend workload I/O to the affected PV. For now, the detection processes is rudimentary; simple timeout and error values, configurable by the user. However, the foundation has been set for the introduction of more sophisticated, heuristic-based approaches in the future.
As we move towards the beta release, our focus will be on continued strengthening of Mayastor’s ability to detect and recover from failures, internal and exogenous. For example; should a Mayastor node be lost or disconnected from the cluster, applications using Volumes hosted by any nexus on that failed node must continue doing so uninterrupted. This calls for the introduction of multi-pathing. We’ll also be addressing some of the current limitations of the Mayastor Pool, such as membership count (a pool can only contain a single disk device right now). OpenEBS has a lot of prior experience in various approaches to pooling underlying devices - and managing them - that we plan to leverage at the device layer.
Mayastor Version 0.3.0 is released on the 15th of August as a critical part of the release of OpenEBS 2.0.