ZFS based cStor- the Storage Engine of OpenEBS (built on Kubernetes)

A small introduction about myself, I work for a company named MayaData who develops an Open Source software called OpenEBS (CNCF Sandbox project) that simplifies the deployment of stateful applications on Kubernetes. Haven't checked it out yet? Click here


OpenEBS, the leading block storage solution on K8s and CAS architecture, uses ‘cStor’ [pronunciation: see-stor] as one of its storage engines. This solution has been available in the field for more than half a year now.

cStor uses ZFS behind the scenes by running it in the user space. In this post, I will highlight a few challenges that our team faced while building this data engine.

So, when running in a user space, ZFS utilizes a user space binary called ‘ztest.’ Before we move on, let’s discuss a few details about this exciting user space binary. This binary is created by linking the libraries libzpool, libzfs, etc., which are based out of the ZFS code that runs in Kernel. These libraries contain transactional, pooled storage layers that are compiled from the code nearly the same as Kernel code. This means that all the major functionality available in kernal ZFS, such as creating/configuring pools and volumes, performing IOs, disk space usage management etc., are part of ztest binary. However, it is worth noting that accessibility of the volumes, i.e., ZPL/ZVOL are not part of ztest.

To perform configuration-related operations, ztest directly calls the handlers, whereas, in the case of kernel space ZFS, zpool/zfs, CLI binaries are used. Another interesting item is disk operations. ZFS in Kernel calls a disk driver, or its wrapper related APIs. This is in contrast to ztest, which performs system library calls, or IOCTLs.

Disk operations

                                                                                      Disk operations

In cStor, we followed a similar approach to create a binary called ‘zrepl’ that is part of cStor. It has been built using the libraries similar to what is used for ztest and contains transactional, pooled storage layers. It also must perform various tasks, such as:

  • Creation/configuration of pools and zvols
  • Supporting zvols to perform IOs on it
  • AIO support
  • Accessibility of the created zvols
  • Replication
  • Rebuilding of missing data, etc.

 

The first two items are related to the support and configuration of ZVOLs in cStor, which we will cover in this blog.

The first challenge is to create/configure pools and zvols for the user space binary ‘zrepl.’

As mentioned above, zpool/zfs CLI binaries are used for the kernel space ZFS. These CLI binaries send IOCTLs to kernel space ZFS to create/configure pools. However, this approach won’t work for user space’ zrepl.’

In the user space, the best way to accomplish this is by writing REST/gRPC server in `zrepl`. This listens for config requests and performs the tasks handled by current zpool/zfs CLI binaries. However, current zpool/zfs CLI binaries have evolved over a period of decades and replicating its functionality would be an immense task. This also moves the project out-of-sync with the upstream of ZoL, which likely would not be a good idea.

We came up with the idea of performing IOCTL redirection using unix domain sockets. The ‘zrep’ binary creates a server on unix domain socket. Upon receiving a message with IOCTL information that needs to be executed, this server will call the handler to be executed by kernel space ZFS after receiving that IOCTL. Rather than performing IOCTL calls, zpool / zfs CLI binaries are modified to connect to the IOCTL server and send a message with the IOCTL information that needs to be executed.

IOCTL redirection

                                                                                IOCTL redirection

Now we come to the next task of supporting zvols in user space. We implemented this by linking the same ZVOL code of kernel ZFS that creates datasets, objsets, objects to zrepl. We were able to avoid the device creation over these zvols in the user space. The 'zvol_state' structure similar to the existing one has been added, and we overwrote functions to create zvols. During the pool import process, we also added wrappers over the DMU layer to read/write IOs onto zvols and enabled ZIL related APIs to log/replay. Aside from the device creation, all features provided by ZoL over zvols, such as snapshotting/cloning/send/receive, etc., work with this approach.

Before I finish this post, I want to give credit to ZOL, from which this project has been forked.

Utkarsh Mani Tripathi
Utkarsh is a maintainer of jiva project and has contributed in building both control and data plane of OpenEBS. He loves to learn about file-system, distributed systems and networking. Currently, he is mainly focusing on enhancing jiva and maya-exporter In his free time, he loves to write poems and make lip smacking dishes
Chuck Piercey
Chuck Piercey is a Silicon Valley product manager with experience shipping more than 15 products in several different market segments representing a total of $2.5Bn revenue under both commercial and open source business models. Most recently he has been working for MayaData, Inc. focused on software-defined storage, network, and compute for Kubernetes environments. Chuck occasionally writes articles about the technology industry.
Sagar Kumar
Sagar is a software engineer at Mayadata who loves coding and solving real-world problems. He has been playing with Kubernetes for the last couple of years. Currently, he is focused on building OpenEBS Director as the go-to solution for OpenEBS users. In his free time, he loves playing cricket and traveling.