An Overview of Storage in Kubernetes
Kubernetes supports a powerful storage architecture that is often complex to implement unless done right. The Kubernetes orchestrator relies on volumes-abstracted storage resources - that help to save and share data between ephemeral containers. Since these storage resources abstract the underlying infrastructure, volumes enable dynamic provisioning of storage for containerized workloads.
In Kubernetes, shared storage is typically achieved by mounting volumes and connecting to an external filesystem or block storage solution. Container Attached Storage (CAS) is a relatively newer solution that allows Kubernetes administrators to deploy storage as containerized microservices in a cluster. The CAS architecture makes workloads more portable and simpler to modify storage based on application needs. Because CAS is deployed per workload or per cluster, it also eliminates the cross workload and cluster blast radius of traditional shared storage.
This article compares CAS with traditional shared storage to explore their similarities, differences and architecture overview.
Container Attached Storage:
Container Attached Storage (CAS) is a solution for stateful workloads that deploys storage as a cluster running in the cloud or on-premises. Unlike traditional storage options where storage is a shared filesystem or block storage running externally, CAS enables storage controllers that can be managed by Kubernetes. These storage controllers can run anywhere with a Kubernetes distribution, whether on top of traditional shared storage systems, or managed storage services like Amazon EBS. Data stored in CAS is accessed directly from containers within the cluster, thereby significantly reducing Read/Write times.
CAS leverages the container orchestrator’s environment to enable persistent storage. The CAS software has storage targets in containers that run as services. If desired, these services are replicated as microservice-based storage replicas that can easily be scheduled and scaled independently of each other. CAS services can be orchestrated using Kubernetes or any other orchestration platform as containerized workloads, ensuring the autonomy and agility of software development teams.
For any CAS solution, the cluster is typically divided into two layers:
- The control plane consists of the storage controllers, storage policies, and instructions on how to configure the data plane. Control plane components are responsible for the provisioning volumes and other storage associated tasks.
- The data plane components receive and execute instructions from the control plane on how to save and access container information. The main element of the data plane is the Storage Engine which implements pooled storage. The engine is essentially responsible for the Input-Output volume path. Some popular storage engines of OpenEBS include Mayastor, cStor, Jiva and OpenEBS LocalPV. Some prominent users of OpenEBS include the CNCF, ByteDance(Tiktok), Optro, Flipkart, Bloomberg and others.
- Container Attached Storage is built to primarily run on Kubernetes and other cloud-native container orchestrators. This makes the solution inherently platform-agnostic and portable, thereby making it an efficient storage solution that can be deployed on any platform without the inconvenience of vendor lock-in.
- CAS decomposes storage controllers into constituent units that can be scaled and run independently.
- Every storage controller is attached to a Persistent Volume and typically runs within the user-space, achieving storage granularity and independence from underlying operating systems
- Control plane entities are deployed as Custom Resource Definitions that deal with physical storage entities such as disks
- Data plane entities are deployed as a collection of PODs running in the same cluster as the workload
- The CAS architecture can offer synchronous replication in order to add additional availability.
When to Use:
Container Attached Storage is steadily becoming the de-facto standard for persistent storage of stateful Kubernetes workloads. CAS is most like the Direct Attached Storage that many current workloads expect, such as NoSQL, logging, machine learning pipelines, Kafka and Pulsar. Many workload communities and users have embraced CAS. CAS also allows small teams to retain control over their workloads. In short, CAS may be preferred where:
- The workloads expect local storage
- Teams want to be able to efficiently turn local storage, including disks or cloud volumes, into volumes on demand for Kubernetes workloads
- Performance is a concern
- The loose coupling of the architecture is desired to be maintained at the storage layer
- Increased density of workloads on hosts is desired
- Small team autonomy is desired to be maintained
Traditional Shared Storage:
Shared storage was designed to allow multiple users/machines to access and store data in a pool of devices. Shared storage provided additional availability to workloads that themselves were unable to provide for their own availability; additionally, shared storage was able to work around the poor performance of underlying disk which at the time were able to deliver no more than 150 I/O operations per second. Today’s underlying drives can be 10,000 times more performant; massively faster than the performance requirements of most workloads.
A shared storage infrastructure typically consists of block storage systems in Storage Area Networks (SANs) or file system based storage devices in Network Attached Storage (NAS) configurations.
While the storage industry was once a rapidly growing industry, with growth rates in excess of 30% - 50% YoY in the late 1990s and early 2000s. In the 2010s this growth rate moderated and in certain years stopped entirely. In the 2020s growth started again, however, at a rate much slower than the exponential growth in the amount of data storage. Meanwhile, Direct Attached Storage and Cloud storage each grew more quickly in terms of capacity shipped and overall spending.
In traditional shared storage, all nodes in a network share the same physical storage resources but have their own private memory and processing devices. Files and other data can be accessed by any machine connected to the central storage.
For a Kubernetes application, traditional shared storage is first implemented by using monolithic storage software to virtualize physical storage resources, which could either be bare-metal servers, SAN/NAS networks or block storage solutions. The software then connects to Persistent Volumes that store cluster data. Each Persistent Volume (PV) is bound to a Persistent Volume Claim (PVC) which application PODs use to request a portion of the shared storage.
Both CAS and shared storage can utilize the Container Storage Interface (CSI). CSI is used to issue the commands to the underlying storage such as the need to provision a PV or to expand or snapshot that capacity.
A typical Traditional Shared Storage architecture
- Embraces centralized, consolidated storage for Block and File Storage systems, allowing administration from a single interface.
- Traditional storage is distinctly divided into 3 layers: the Hosts tier which has client machines, the Fabric layer which includes switches and other networking devices, and the storage layer which includes the controllers used to read/write data onto physical disks.
- Shared storage integrates redundancy into the design of storage devices, allowing systems to withstand failure to a sizable degree.
To scale up traditional shared storage, additional storage devices should be deployed and configured into the existing array.
When to Use
Shared storage is used to manage large amounts of data generated and accessed by a number of different machines. This is because traditional shared storage enables high performance for large files with no bottlenecks or downtimes. Shared storage is also the go-to storage solution for organizations that depend on collaboration between teams. As data and files are managed centrally, shared storage allows efficient version control and consolidated information management. Traditional Shared Storage is also used to eliminate the need for multiple drives containing the same information, which helps reduce redundancies, thus increasing storage capacity.
CAS vs. Shared Storage
The two storage options vary greatly in how they persist application data. While traditional shared storage relies on an external array of storage devices to persist data, CAS uses containers within an orchestrated environment.
Following are a few similarities and differences between CAS and Traditional Shared Storage:
- Both CAS and traditional shared storage offer high availability storage for applications. CAS allows high availability using Data POD replicas that ensure storage is always available for the CAS cluster. While traditional shared storage uses a redundant design to ensure that the storage system can withstand failure.
- Both options provide quick storage options for critical applications. CAS uses agile microservices to ensure quick I/O times while shared storage allows multiple machines to quickly read and write data on a shared pool of storage devices, reducing the need to create connections between individual machines.
- Both solutions accommodate software-defined storage which leverages the performance of physical devices with the agility of software.
- Both can utilize the Container Storage Interface (CSI) to issue the commands to the underlying storage.
- Both can be Open Source, extending the openness of Kubernetes to the data layer. It appears that container attached storage is somewhat more likely to be open source however that is yet to be determined conclusively.
- CAS follows a container-based microservice framework for storage, which means teams can take advantage of the agility and portability of containers to ensure faster, more efficient storage. On the contrary, traditional shared storage involves different virtual or physical machines reading/writing into a shared pool of storage devices, thereby increasing latency and reducing access speeds.
- CAS is platform-agnostic. This means CAS-based storage solutions can run either on-premises or the cloud, without requiring extensive configuration changes. While shared storage relies on Kernel modifications, making it inefficient to deploy for workloads across different environments.
- While traditional shared storage relies on consolidated monolithic storage software, CAS runs on the userspace, enabling independent management capabilities for efficient storage administration at the granular level.
- CAS allows linear scalability since storage containers can be brought up as required, while in traditional shared storage, scaling involves adding newer devices to an existing storage array
Designed in Kubernetes, CAS enables agility, granularity and linear scalability, making it a favourite for cloud-native applications. Traditional shared storage offers a mature stack of storage technology that mainly falls short in persisting storage for stateful applications due to the inherent lack of linear scalability. CAS is a novel solution that enables the implementation of storage controllers to exist in userspace, allowing maximum scalability.
OpenEBS, a popular CAS based storage solution, has helped several enterprises run stateful workloads. Originally developed by MayaData, OpenEBS is now a CNCF project with a vibrant community of organizations and individuals alike. This was also evident from CNCF’s 2020 Survey Report that highlighted MayaData (OpenEBS) in the top-5 list of popular storage solutions.
Canonical definition of Container Attached Storage:
To read Adopter use-cases or contribute your own, visit: https://github.com/openebs/openebs/blob/master/ADOPTERS.md.
CNCF 2020 Survey Report:
OpenEBS LocalPV Quick Start Guide: