Containerized applications are ephemeral, which means any data created by a container is lost as soon as the process terminates. This requires a pragmatic approach to data persistence and management when orchestrating containers using Kubernetes. To deal with this, the Kubernetes orchestration platform uses Volume plugins to isolate storage consumption from provisioned hardware.
A Persistent Volume (PV) is a Kubernetes API resource that provisions persistent storage for PODs. Cluster resources can use a PV construct to mount any storage unit -- file system folders or block storage options -- to Kubernetes nodes. PODs request for a PV using Persistent Volumes Claims (PVC). These storage integrations and other features make it possible for containerized applications to share data with other containers and preserve the container state.
PVs can be provisioned statically by the cluster administrator or dynamically using Storage Classes. Some important features that distinguish different Storage Classes include Capacity, Volume Mode, Access Modes, Performance and Resiliency. When a Local Disk is attached directly to a single Kubernetes node, it is known as a Local PV which provides the best performance and is only accessible from a single node where it is attached. This post explores why LocalPV and NVMe storage should be used for Kubernetes workloads.
NVMe is a high-speed access protocol that delivers low latency and high throughput for SSD storage devices by connecting them to the processor through a PCIe interface. Early SSDs connected to the CPU through SATA or Serial Attached SCSi (SAS). These relied on legacy standards customized for Hard Disk speeds which were considered inefficient since each connection to the processor remained limited by synchronized locking or the SAS Host Bus Adapter (HBA).
To overcome this challenge, NVMe unlocks the true potential of flash storage using the Peripheral Component Interconnect Express (PCIe) that supports high performance, Non-Uniform Memory Access (NUMA). NVMe also supports parallel processing, with 64K Input-Output queues with each queue having 64K entries. This high-bandwidth, low-latency storage hosts applications that can create as many I/O queues as system configuration, workload and the NVMe controller allows.
Following a NUMA based storage protocol, NVMe allows different CPUs to manage I/O queues, using various arbitration mechanisms. Modern enterprises are data-driven, with users and devices generating huge amounts of data that may overwhelm companies. By enhancing the capabilities of multi-core CPUs, NVMe provides low latency and fast transfer rates for better access and processing of large data sets. NVMe devices typically rely on NAND Flash Memory that can be hosted on various SSD form factors including normal SSDs, U2 Cards, M2 Cards, and PCIe Add-In Cards.
NVMe over Fabrics (NVMe-oF) extends the advantages of NVMe storage access by implementing the NVMe protocol for remotely connected devices. The architecture allows one node to directly access a storage device of another computer over several transport protocols.
In NVMe architecture, the host computer is connected to SSD storage devices via a high throughput Host-Controller Interface. The storage service is composed of three main elements:
To submit queues to the Input/Output, the NVMe controller utilizes Memory-Mapped Controller Registers and the host system’s DRAM. The number of mapped registers determines the number of parallel I/O operations the protocol can support.
A typical NVMe storage architecture
PCIe reduces the need for various abstract implementation layers, allowing for faster, efficient storage. Some benefits of using NVMe for storage include:
While most storage systems used to persist data for Kubernetes clusters are remote and independent of the source nodes, it is possible to attach a local disk directly to a single node. Locally attached storage typically guarantees higher performance and tighter security than remote storage.
A Kubernetes LocalPV represents a portion of local disk storage that can be used for data persistence in StatefulSets. With LocalPV, the local disk is specified as a persistent volume that can be consumed with the same PVC and Storage Class abstractions used for remote storage. This results in low latency storage that is suitable for fault-tolerant use-cases such as:
Before the introduction of LocalPV volumes, hostPath volumes were used for accessing local storage. There were certain challenges while orchestrating local storage with hostPath as it didn’t support important Kubernetes features, such as StatefulSets. Additionally, hostPath volumes required separate operators for disk management, POD scheduling and topology, making them difficult to use in production environments.
LocalPV volumes were designed in response to issues with the scheduling, disk accounting and portability of hostPath volumes. One of the major distinctions is that the Kubernetes control plane knows the Node that owns a LocalPV. With hostPath, data is lost when a POD referencing the volume is scheduled to a different node. LocalPV volumes can only be referenced using a Persistent Volume Claim (PVC) while hostPath volumes can be referenced both directly in the POD definition file and via PVC.
Workloads can be configured to access NVMe SSDs on a local machine using LocalPV and a Persistent Volume Claim, or StatefulSet with volumeClaim attributes. This section explores how to attach a local disk to a Kubernetes cluster with NVMe storage configured.
$ cat sc.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: openebs-device-sc
allowVolumeExpansion: true
parameters:
devname: "test-device"
provisioner: device.csi.openebs.io
volumeBindingMode: WaitForFirstConsumer
Check the doc on storageclasses to know all the supported parameters for Device LocalPV.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: openebs-device-sc
allowVolumeExpansion: true
parameters:
devname: "test-device"
provisioner: device.csi.openebs.io
allowedTopologies:
- matchLabelExpressions:
- key: kubernetes.io/hostname
values:
- device-node1
- device-node2
The above storage class tells that device with meta partition test-device is available on nodes device-node1 and device-node2 only. The Device CSI driver will create volumes on those nodes only.
The scheduling algorithm currently only accounts for either the number of volumes or total capacity occupied from a device and does not account for other factors like available cpu or memory while making scheduling decisions. So if you want to use node selector/affinity rules on the application pod, or have cpu/memory constraints, a Kubernetes scheduler should be used. To make use of kubernetes scheduler, you can set the volumeBindingMode as WaitForFirstConsumer in the storage class. This will cause a delayed binding, i.e Kubernetes scheduler will schedule the application pod first, and then it will ask the Device driver to create the PV. The driver will then create the PV on the node where the pod is scheduled.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: openebs-device-sc
allowVolumeExpansion: true
parameters:
devname: "test-device"
provisioner: device.csi.openebs.io
volumeBindingMode: WaitForFirstConsumer
Please note that once a PV is created for a node, the application using that PV will always get scheduled to that particular node only, as PV will be sticky to that node. The scheduling algorithm by Device driver or kubernetes will come into picture only during the deployment time. Once the PV is created, the application can not move anywhere as the data is there on the node where the PV is.
$ cat pvc.yaml
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: csi-devicepv
spec:
storageClassName: openebs-device-sc
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 4Gi
$ cat fio.yaml
apiVersion: v1
kind: Pod
metadata:
name: fio
spec:
restartPolicy: Never
containers:
- name: perfrunner
image: openebs/tests-fio
command: ["/bin/bash"]
args: ["-c", "while true ;do sleep 50; done"]
volumeMounts:
- mountPath: /datadir
name: fio-vol
tty: true
volumes:
- name: fio-vol
persistentVolumeClaim:
claimName: csi-devicepv
Some benefits of integrating LocalPV into clusters using NVMe for storage include:
Non-Volatile Memory Express (NVMe) enhances data storage and access by leveraging the performance benefits of flash memory for SSD based storage. By connecting storage devices to the CPU directly via the PCIe interface, data companies eliminate the bottlenecks associated with SATA or SAS based access. LocalPV reduces the data path between storage and Kubernetes nodes by mounting a volume directly on a Kubernetes node. This results in higher throughput and IOPS, suitable for fault-tolerant stateful applications.
OpenEBS by MayaData is one of the popular open-source, agile storage stacks for performance-sensitive databases orchestrated by Kubernetes. Mayastor, OpenEBS’s latest storage engine, delivers very low overhead versus the performance capabilities of underlying devices.. OpenEBS Mayastor does not require NVMe devices or that workloads consume NVMe, although in both cases performance will increase. OpenEBS Mayastor is unique currently amongst open source storage projects in utilizing NVMe internally, to communicate to options OpenEBS replicas. To learn more about how OpenEBS Mayastor, leveraging NVMe as a protocol, performs when leveraging some of the fastest NVMe devices currently available on the market, visit this article.
OpenEBS Mayastor builds a foundational layer that enables workloads to coalesce and control storage as needed in a declarative, Kubernetes-native way. While doing so, the user can focus on what's important, that is, deploying and operating stateful workloads.
If you’re interested in trying out Mayastor for yourself, instructions for how to set up your own cluster, and run a benchmark like `fio` may be found at https://docs.openebs.io/docs/next/mayastor.html.
Related Blogs:
https://blog.mayadata.io/the-benefits-of-using-nvme-for-kubernetes
https://blog.mayadata.io/mayastor-nvme-of-tcp-performance