Let's add an opcode to io-uring

Tue, August 1, 2023

All I really wanted…

…was to understand how io-uring works. Honestly. I didn’t expect to end up writing this.

OK, that may be a lie. I probably would have written something and called it “Notes on io_uring implementation” or something. In fact, if I’m being honest, I wanted a follow up of my checklist for learning io-uring.

That’s when the thought popped into my head : “Hey. What if I implemented a new opcode?”

Well, it’s two days later and here we are.

Fine, but what opcode?

io-uring has a pretty extensive set of supported operations. It can do open(), close(), read()/write(), there is an opcode for anything useful you can think of.

Challenge accepted then. Let’s do something useless.

Let’s write some zeroes at an offset in a file, as a single opcode.

You may think - hold on, that’s not useless. Sure, I could do this with a buffer and write(2), but a dedicated, fast opcode could be handy.

Good try, but no. There is already an opcode for it. Sort of.

The fallocate() syscall accepts a flag, FALLOC_FL_ZERO_RANGE that zeroes out a range of a file, at a given offset. And, you guessed it, io-uring has an opcode for fallocate.

But, for ~~a hack job~~ educational purposes, we can definitely mimick¹ how IOURING_OP_FALLOCATE is implemented. All we need to do is hardwire the flag.

Sidebar: `file_operations`

file_operations is a structure which holds pointers to the functions that a driver can call on a device.

In our case, we want to call fallocate() on file, but we don’t want to worry about what device that file is on, or what filesystem is on that device. I just call file_operations->fallocate() and the correct driver will take care of things for me. Note that this is for operations in kernel space. In userspace you will do some system call like fallocate(2), or go through aio or io-uring and these will end up calling file_operations->fallocate(). As an application programmer, you never have access to file_operations.

In the kernel implementation below, you’ll see that we call vfs_fallocate() instead of file_operations->fallocate(). This is because vfs_fallocate() does a lot of checks on flags and permissions that the “raw” fallocate() calls assumes have been done.

Kernel Code

The changes we need to do in the kernel are surprisingly limited. At least I hope they are, otherwise I got really lucky.

First, and this should be obvious, add the new opcode:

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 08720c7bd92f..cb937b723f09 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -235,7 +235,7 @@ enum io_uring_op {
        IORING_OP_URING_CMD,
        IORING_OP_SEND_ZC,
        IORING_OP_SENDMSG_ZC,
-
+       IORING_OP_ZERO,
        /* this goes last, obviously */
        IORING_OP_LAST,
 };

Next, let’s create the function that will do the zeroing. As planned, we will duplicate the implementation of the IOURING_OP_FALLOCATE and hardwire the flag to FALLOC_FL_ZERO_RANGE.

The implementation is here and the declaration here. Duplicate the implementation of io_fallocate and change io_fallocate to io_zero. Then, change the flags argument to vfs_allocate, and hardcode it to FALLOC_FL_ZERO_RANGE. You will also need to include <linux/falloc.h>. In the end, this is the implementation you should have in io_uring/sync.c

diff --git a/io_uring/sync.c b/io_uring/sync.c
index 255f68c37e55..fcc180019cde 100644
--- a/io_uring/sync.c
+++ b/io_uring/sync.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/kernel.h>
 #include <linux/errno.h>
+#include <linux/falloc.h>
 #include <linux/fs.h>
 #include <linux/file.h>
 #include <linux/mm.h>
@@ -110,3 +111,18 @@ int io_fallocate(struct io_kiocb *req, unsigned int issue_flags)
        io_req_set_res(req, ret, 0);
        return IOU_OK;
 }
+
+int io_zero(struct io_kiocb *req, unsigned int issue_flags)
+{
+       struct io_sync *sync = io_kiocb_to_cmd(req, struct io_sync);
+       int ret;
+
+       /* fallocate always requiring blocking context */
+       WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK);
+
+       ret = vfs_fallocate(req->file, FALLOC_FL_ZERO_RANGE, sync->off, sync->len);
+       if (ret >= 0)
+               fsnotify_modify(req->file);
+       io_req_set_res(req, ret, 0);
+       return IOU_OK;
+}

And this should be your io_uring/sync.h header

diff --git a/io_uring/sync.h b/io_uring/sync.h
index e873c888da79..406b15a4453b 100644
--- a/io_uring/sync.h
+++ b/io_uring/sync.h
@@ -7,4 +7,5 @@ int io_fsync_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 int io_fsync(struct io_kiocb *req, unsigned int issue_flags);

 int io_fallocate(struct io_kiocb *req, unsigned int issue_flags);
+int io_zero(struct io_kiocb *req, unsigned int issue_flags);
 int io_fallocate_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);

Sidebar: `io-uring` code organization

In the io_uring/ directory at the top of the kernel tree you will find what amounts to the implementation of the io-uring opcodes. The opcodes are separated into groups according to their functionality, so you have rw.c which contains the implementation of IOURING_OP_READ/IOURING_OP_WRITE and friends, nop.c which has the implementation for IOURING_OP_NOP etc.

Each file defines the struct that contains the relevant fields from the SQE, like file handles, user buffers and so on. In practice, this means that each opcode implementation needs to live at an appropriate place. I presume the choice is made based on the command struct that carries the relevant information, so if you need to pull user buffers, you should add your method to the rw.c. I expect it may be necessary to create new files, for some operations.

Thankfully, our code is the same as the generic fallocate code, so it fits nicely in sync.c/sync.h

Wiring the call

The io_zero function is currently not called from anywhere. We should probably do something about it.

There are two structures² we need to define to wire everything up, and they are found here. If we look at the implementation for IOURING_OP_FALLOCATE, we can find io_issue_def here and io_cold_def here.

Duplicate them, and change io_issue_def.issue to io_zero and io_cold_def.name to "ZERO", like this:

diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 3b9c6489b8b6..1950f75c1c9f 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -428,6 +428,11 @@ const struct io_issue_def io_issue_defs[] = {
                .prep                   = io_eopnotsupp_prep,
 #endif
        },
+       [IORING_OP_ZERO] = {
+               .needs_file             = 1,
+               .prep                   = io_fallocate_prep,
+               .issue                  = io_zero,
+       },
 };


@@ -648,6 +653,9 @@ const struct io_cold_def io_cold_defs[] = {
                .fail                   = io_sendrecv_fail,
 #endif
        },
+       [IORING_OP_ZERO] = {
+               .name                   = "ZERO",
+       },
 };

 const char *io_uring_get_opcode(u8 opcode)

io_issue_def.io_fallocate_prep for IOURING_OP_ZERO is still the same as for IOURING_OP_FALLOCATE and it does basic validation of the SQE. We can just reuse it, no reason to implement our own.

Obviously, the critical thing is setting the io_issue_def.issue field to io_zero - a function pointer to our implementation.

And we’re done. You kernel should compile. You can go ahead and run it - may I suggest starting a VM with it?.

Userspace code

What good is all this if we can’t use it?

Frankly, it’s no good even if we can use it, but it’s educational so let’s do it anyway.

The standard interface for interacting with io_uring is, of course, liburing. We could add an io_uring_prep_zero() method there, and it would work just fine.

But I want to be with the cool kids³. So Rust it is. I mean, what did you expect?

Tokio’s io-uring

There are a few userspace io-uring libraries for Rust. I choose the one from tokio.rs because it seems well maintained and kind of popular. Rio is another option.

At a high level, we will need to make the library know about the new opcode, and have it pass the appropriate arguments.

This means two changes. One is adding the binding in the sys/sys.rs file

diff --git a/src/sys/sys.rs b/src/sys/sys.rs
index 37cc111..ecd4121 100644
--- a/src/sys/sys.rs
+++ b/src/sys/sys.rs
@@ -1045,7 +1045,8 @@ pub const IORING_OP_SOCKET: io_uring_op = 45;
 pub const IORING_OP_URING_CMD: io_uring_op = 46;
 pub const IORING_OP_SEND_ZC: io_uring_op = 47;
 pub const IORING_OP_SENDMSG_ZC: io_uring_op = 48;
-pub const IORING_OP_LAST: io_uring_op = 49;
+pub const IORING_OP_ZERO: io_uring_op = 49;
+pub const IORING_OP_LAST: io_uring_op = 50;
 pub type io_uring_op = libc::c_uint;
 pub const IORING_MSG_DATA: _bindgen_ty_5 = 0;
 pub const IORING_MSG_SEND_FD: _bindgen_ty_5 = 1;

And the other is adding the opcode in opcode.rs. This is again a matter of copying/pasting the existing Fallocate opcode and removing the mode argument and its use.

Note that the way the opcode! macro works, offset is generated as a setter, exactly like Fallocate.

diff --git a/src/opcode.rs b/src/opcode.rs
index ffc5771..e834f43 100644
--- a/src/opcode.rs
+++ b/src/opcode.rs
@@ -740,6 +740,28 @@ opcode! {
     }
 }

+opcode! {
+    pub struct Zero {
+        fd: { impl sealed::UseFixed },
+        len: { u64 },
+        ;;
+        offset: u64 = 0,
+    }
+
+    pub const CODE = sys::IORING_OP_ZERO;
+
+    pub fn build(self) -> Entry {
+        let Zero { fd, len, offset } = self;
+
+        let mut sqe = sqe_zeroed();
+        sqe.opcode = Self::CODE;
+        assign_fd!(sqe.fd = fd);
+        sqe.__bindgen_anon_2.addr = len;
+        sqe.__bindgen_anon_1.off = offset;
+        Entry(sqe)
+    }
+}
+
 opcode! {
     /// Open a file, equivalent to `openat(2)`.
     pub struct OpenAt {

That’s it, or least that’s all I can think of right now. The code should compile and install, so let’s leave it like this and see what happens.

NOTE: Yes, obviously we could have skipped the kernel implementation and just added the ZERO opcode in the library with a fixed value for the mode field. But that wouldn’t be “adding an opcode to io-uring”, would it? Remember, it’s all about the journey, not the destination.

A simple driver program

What’s left is writing a simple application to take advantage of this new opcode. Here’s one, that you should put in src/zero_file.rs

use io_uring::{opcode, types, IoUring};
use std::os::unix::io::AsRawFd;
use std::fs::OpenOptions;
use std::io::{Result,Error};


fn main() -> Result<()> {

    let offset = 0;
    let length = 5;
    
    let mut ring = IoUring::new(1)?;

    let fd = OpenOptions::new().write(true).open("file_with_stuff")?;

    let zero_e = opcode::Zero::new(types::Fd(fd.as_raw_fd()), length as _)
        .offset(offset)
        .build();

    unsafe {
        ring.submission()
            .push(&zero_e)
            .expect("submission queue is full");
    }

    ring.submit_and_wait(1)?;

    let cqe = ring.completion().next().expect("completion queue is empty");

    assert!(cqe.result() >= 0, "zero error: {}", Error::from_raw_os_error(-cqe.result()));

    Ok(())
}

and a Cargo.toml to go with it:

[package]
name = "zero"
version = "0.1.0"
edition = "2018"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
io-uring = { path = "../io-uring" }


[[bin]]
name="zero"
path="src/zero_file.rs"

NOTE: The dependency to the modified io-uring library is relative, as you can see. Make sure that you are pointing to the correct location, otherwise this won’t compile.

Once you get everything in place, do a

cargo build

and should have an executable zero under target/debug/

I also created the file_with_stuff file like so

echo "1234567890"  > file_with_stuff

And run zero from the directory with the file.

chris@desktop:~/sources/sample$ target/debug/zero
thread 'main' panicked at 'zero error: Invalid argument (os error 22)', src/zero_file.rs:34:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Oh right. My kernel doesn’t know about the ZERO opcode. At least I really, really hope that’s the reason.

When I get the binary to a VM that is running the modified kernel, instead i get this:

ubuntu@ubuntu-20-cloud-image:~$ target/debug/zero
ubuntu@ubuntu-20-cloud-image:~$ cat file_with_stuff 
67890
ubuntu@ubuntu-20-cloud-image:~$

Hey, it worked!

If you build the executable on your host machine and copy it to the VM, you may get linking errors if the libc versions don’t match. At least I did. So instead, I scp -R sample/ user@host:/path the entire project (i.e src/zero_file.rs and Cargo.toml) to the VM and compiled it there instead. YMMV.

Epilogue

If you got this far down, I’m humbled and I hope you got something out of it.

For me this was useful because I avoided the intricacies of implementing a complex opcode, but still had to deal with the whole stack of io-uring. I didn’t know how to do that, but now I do. So yay!

Drop me a line on Mastodon if you’re into that sort of thing, I really would like to know what you think.

Thank you for reading.

Copy-paste. We can definitely copy-paste. ↩
They are a single structure, really. The reason they are split (and why “cold” is in the name of one) is for performance reasons. Keeping the “hot” structure smaller means better cache locality and better performance. ↩
You’re thinking about doing it directly from the raw syscall, aren’t you? Well done. It’s a job and a half, and you would basically re-implement parts of liburing. But you’ll learn a lot along the way. ↩