OpenEBS, the leading block storage solution on K8s and CAS architecture, uses ‘cStor’ [pronunciation: see-stor] as one of its storage engines. This solution has been available in the field for more than half a year now.
cStor uses ZFS behind the scenes by running it in the user space. In this post, I will highlight a few challenges that our team faced while building this data engine.
So, when running in a user space, ZFS utilizes a user space binary called ‘ztest.’ Before we move on, let’s discuss a few details about this exciting user space binary. This binary is created by linking the libraries libzpool, libzfs, etc., which are based out of the ZFS code that runs in Kernel. These libraries contain transactional, pooled storage layers that are compiled from the code nearly the same as Kernel code. This means that all the major functionality available in kernal ZFS, such as creating/configuring pools and volumes, performing IOs, disk space usage management etc., are part of ztest binary. However, it is worth noting that accessibility of the volumes, i.e., ZPL/ZVOL are not part of ztest.
To perform configuration-related operations, ztest directly calls the handlers, whereas, in the case of kernel space ZFS, zpool/zfs, CLI binaries are used. Another interesting item is disk operations. ZFS in Kernel calls a disk driver, or its wrapper related APIs. This is in contrast to ztest, which performs system library calls, or IOCTLs.
Disk operations
In cStor, we followed a similar approach to create a binary called ‘zrepl’ that is part of cStor. It has been built using the libraries similar to what is used for ztest and contains transactional, pooled storage layers. It also must perform various tasks, such as:
- Creation/configuration of pools and zvols
- Supporting zvols to perform IOs on it
- AIO support
- Accessibility of the created zvols
- Replication
- Rebuilding of missing data, etc.
The first two items are related to the support and configuration of ZVOLs in cStor, which we will cover in this blog.
The first challenge is to create/configure pools and zvols for the user space binary ‘zrepl.’
As mentioned above, zpool/zfs CLI binaries are used for the kernel space ZFS. These CLI binaries send IOCTLs to kernel space ZFS to create/configure pools. However, this approach won’t work for user space’ zrepl.’
In the user space, the best way to accomplish this is by writing REST/gRPC server in `zrepl`. This listens for config requests and performs the tasks handled by current zpool/zfs CLI binaries. However, current zpool/zfs CLI binaries have evolved over a period of decades and replicating its functionality would be an immense task. This also moves the project out-of-sync with the upstream of ZoL, which likely would not be a good idea.
We came up with the idea of performing IOCTL redirection using unix domain sockets. The ‘zrep’ binary creates a server on unix domain socket. Upon receiving a message with IOCTL information that needs to be executed, this server will call the handler to be executed by kernel space ZFS after receiving that IOCTL. Rather than performing IOCTL calls, zpool / zfs CLI binaries are modified to connect to the IOCTL server and send a message with the IOCTL information that needs to be executed.
IOCTL redirection
Now we come to the next task of supporting zvols in user space. We implemented this by linking the same ZVOL code of kernel ZFS that creates datasets, objsets, objects to zrepl. We were able to avoid the device creation over these zvols in the user space. The 'zvol_state' structure similar to the existing one has been added, and we overwrote functions to create zvols. During the pool import process, we also added wrappers over the DMU layer to read/write IOs onto zvols and enabled ZIL related APIs to log/replay. Aside from the device creation, all features provided by ZoL over zvols, such as snapshotting/cloning/send/receive, etc., work with this approach.
Before I finish this post, I want to give credit to ZOL, from which this project has been forked.
Game changer in Container and Storage Paradigm- MayaData gets acquired by DataCore Software
Don Williams
Don Williams
Managing Ephemeral Storage on Kubernetes with OpenEBS
Kiran Mova
Kiran Mova
Understanding Persistent Volumes and PVCs in Kubernetes & OpenEBS
Murat Karslioglu
Murat Karslioglu