A simple, async page cache built on top of io_uring

Wed, July 24, 2024

So, you want to write your own page cache

Are you sure I can’t talk you out of it?

Ok, fine, I understand, you like a challenge. Concurrent programming, I/O, memory management, there are a lot of concepts that you need to master. It can be fulfilling.

But, before you start, take a moment to reflect on the hurdles you’ll face.

As you embark on this journey, the biggest obstacle on your way will be concurrency. You’ll find that your page cache will need a list of memory pages, a page lookup table, and other things that need to be concurrently accessed. This means memory fences, thread safe data structures and probably a lot of hard bugs to deal with.

You could, of course, make your life simpler, by having a single worker thread do all the I/O work and access the common page cache state. No more concurrency issues then.

But that’s slow, isn’t it?

It turns out, it doesn’t have to be. And that’s what this post is about. Making a fast enough page cache that is easy to debug, maintain and extend.

For that, we’ll need a simple state machine, which will be the main thing I’ll describe. But, first, let’s look at the tool that makes this technique work.

Here, take this ring with you

Our secret weapon is Linux’s io_uring.

Here’s a schematic of how it works:

An image made of three pieces one on top of another.
The first shows the words plz read something going in a
box that is labelled io_uring.
The second is a card from spongebob squarepants that
says a few moments later.
The third is the words i read something coming out of a
box labelled io_uring, with an arrow indicating the movement.

Ok, that may be oversimplified, but it captures the gist of it. io_uring is basically an async runtime for I/O: you give it a request and then go about your business. At some later convenient point, you can check and see if your I/O is complete.

Of course, you don’t need to submit only one request at a time - you can do many at once and they will be executed in parallel, in kernel threads.

We can build on that. After all, the main problem with I/O is that it takes a while, and that’s why we want multiple threads doing work at the same time. That’s why page caches are usually multithreaded.

But if we can take advantage of concurrent, asynchronous I/O at the kernel level, many of our problems go away. We can have a single service thread reading requests from a queue, submit I/O requests (SQEs, as uring calls them) as needed, and at some later point, when the I/O is done, return the result to the caller. If the request is for a page that is already in the cache, then we can return that, of course - that’s what caches are for.

That’s a bit optimistic though, isn’t it? After all, most requests take multiple steps to execute, which means we’ll need to wait for I/O steps to complete anyway, doesn’t it?

It turns out, it doesn’t. We’ll use a state machine approach to park jobs as they wait for their I/O operations, so our service thread won’t have to block. Ever.

The state machine approach turns out to be very simple, and guarantees that progress will happen. I will describe how the state machine works soon, but before we do that, we have a more fundamental question to discuss.

What will our page cache do?

We are building a page cache because we have more data on disk than we have memory. We want to read and write that data, and we want to offload the juggling of file and memory pages to a component that will handle reads, writes, file flushes, locking, everything of that nature, for us.

In effect, a page cache is a facade over the filesystem API and, for the purposes of this article, I’ll assume two operations: read() and write().

Both accept a file and an offset and return a pointer to a buffer with the contents of the file at that offset:

fn read(offset: FileOffset, transaction_id: TransactionId, request_id: u64) -> PagePoll

fn write(offset: FileOffset, transaction_id: TransactionId, request_id: u64) -> PagePoll

enum PagePoll {
   Wait,
   ReadDone(PagePointer, u64),
   WriteDone(PagePointer, u64),
}

Some notes.

Remember, this is asynchronous. Read and write requests do not always return a result immediately. As of this writing, both return a PagePointer straight away if there is a cache hit, otherwise they return a Wait value to indicate that the caller needs to try again later.
The page cache is transaction aware. All reads and writes happen in the context of a transaction, which allows for threads to read they own writes in isolation. This is a concern not commonly attributed to the page cache, but I happen to think it’s an excellent idea. In any case, the complexity it adds is clearly demarcated, so you can remove it if you implement this for your project. Which brings us to…
The difference between read and write is that write will do effectively a copy-on-write operation. Writes will happen in a transaction private copy of the page and that will become visible to other transactions only on commit. Speaking of commit…
We are missing two more methods to manage transactions - begin_tx() and commit(). These create and terminate transaction contexts in the page cache.

fn begin_tx(transaction: TransactionId)

fn commit(transaction: TransactionId)

Finally, request_id is an identifier for the request - these need to be unique, so asynchronously received responses can be mapped to their requests.

There is one piece still missing. If read or write return Wait, then how do I get my buffer?

We need one more method.

fn progress() -> PagePoll

Calling this will try to find some work that can make progress and push it along. It may be reaping a response from uring (aka a CQE), it may be doing a copy for a write request, whatever is next.

It may even be nothing, in which case it will return Wait.

But if it completes something, then it will return the result (a PagePointer) along with the request_id of the request that this result is for.

In fact, the rest of this post is about what progress() does, starting with reads.

The simple case: Doing a `read()`

Ok, we have the API. I’ll assume we have a pool of pages in memory (that’s what a page cache is, really), and a file on disk. We also have a map from file+offset to pages, known as the Page Lookup Table (PLT) so we can serve data that’s already in memory. These are all the implementation details that come into the design, so I won’t spend more time on it.

Let’s say that someone asks to read a page’s worth of data from some offset of the file into memory.

First, we may already have a copy of this data that we find via the PLT - a cache hit. In this case our work is done, we can return immediately. We just need to keep a use counter for each page, so we know which ones are in use. As is standard for page caches, a page must be released when the user is done, so we don’t leak memory¹.

If, however, we cache miss, we have some work to do. First, we need to find a page we can read into, possibly evicting one that’s already there (but not in use, of course). Then, we’ll need to read the data in that page and finally return. Evictions of dirty pages will issue a write I/O request, so it will have to wait a bit. Page faults wait on read I/O.

It turns out that this is all the I/O we need to do, and we do it constantly. It’s a pretty simple loop that looks like this:

Four boxes on a black background.
One is at the middle top, with text WaitingForFreePage.
It has a white border and two arrows pointing from it to
two boxes in the middle of the image.
The one on the left says Evicting, the one on the right Faulting.
The evicting one has an arrow pointing to the faulting one.
Both have an orange border.
The Faulting box has an arrow pointing to a box on the bottom
right corner, which has a white border and text that says Done.

The two orange states, Evicting and Faulting are where I/O happens and are asynchronous. When a request is in either of these states, it is waiting on io_uring to complete an I/O request, so we can skip it and continue looking for requests that are either WaitingForFreePage to kick off their I/O or in Done, so we can return to the caller.

The arrow from WaitingForFreePage to Faulting is an optimization if an already free page is found - then eviction is not needed and we can go ahead and fault from disk to memory.

Once a read request is received, a job is created to track its state, it is set to WaitingForFreePage and it is placed in a job queue. Every time the progress() method is called, it checks a batch of jobs in that queue for their state, and it progresses it along the state machine. It never blocks, it just calls on io_uring to do the I/O with the request_id as an identifier². It also checks the CQE for completed I/O, looks up the corresponding job and progresses it³.

There is also the case that it’s impossible for a job to proceed - for example, it’s WaitingForFreePage but it can’t find a page to evict and move to Evicting because all pages are in use. In this case, the service thread will just keep trying this job every now and then until it moves through. This design doesn’t guarantee that a specific job will make progress, just that some job will make progress. A Quality of Service API would fit nicely here, if the implementer wants to have special jobs that complete immediately, without yielding to other work.

Once a job reaches Done, it is…done. This means it has a pointer to a page that contains whatever data the caller asked for, and it is pinned in place so it can’t be evicted. On the next call to progress() this job will be found, it will be removed from the queue, and the appropriate PagePoll will be returned.

This read request is now complete. We never blocked on it, and while we were waiting for its I/O to complete, we served requests that came in and progressed all the jobs we could.

Wait a minute. Who calls `progress()`?

All of the methods I mentioned above, including progress are meant to be called from the same thread⁴.

In Glowdust, the way I do it is I have the page cache run in its own thread and do a loop that

Reads read/write requests from a mpsc queue (these are added by the Glowdust worker threads) and create the appropriate jobs
Calls progress and see if any results are returned
Returns results to the worker threads.

This allows for control over how often to check for new requests vs progressing pending work, and makes it easier to refactor between a blocking and a non blocking API. I have opted for blocking the worker threads for now, but it’s easy to imagine returning a Future-like object for every request that you can await on⁵.

And what about writes?

Writes are not particularly complicated. Recall that write creates a private copy if it doesn’t exist, and returns it. This means there are two possibilities:

The page we’re looking for is already faulted for reading. Then we need to evict to make space so we can create our copy.
The page we’re looking for is not present in the page cache, so we need to fault it directly as a private copy. Faulting means we need to first evict, then fault and then return, which is exactly the same state machine as read.

The state transition for the first case is slightly different:

Three boxes placed vertically, with an arrow pointing
from the first to the second, and another form the second
to the third.
The first box is labelled WaitingForFreePage,
the second Evicting and the third Done.
The second is orange, the other two white.
All on a black background.

The main difference is that in the first case, we don’t fault, we copy, so there is no read I/O. The choice between the two cases happens at the time of handling the write request, when we check if the PLT has an entry for the requested offset. If it does, we increment the use count and kick off a Copy job. If it doesn’t we kick off a DirectFault job. The kind of job determines which state transitions the job will go through. Other than that, the way Evicting and Faulting is handled is exactly the same as for read.

That’s all you’ll need

If you got this far, I salute your tenacity. Following this pattern will let you write a decently performing page cache that doesn’t even come close to having race conditions, deadlocks or any other gremlin associated with concurrency.

All of the code is available at the Glowdust repo where you can find all the gnarly details I omitted. None of them have to do with concurrency though, it’s just the usual tediousness of I/O code.

It works just fine, in case you’re wondering. I am using it to build the LSMTree store and everything fits very well so far. I expect changes will happen, especially as the API crystalizes, but the basic technique holds up.

I also glossed over the commit method, because it’s not traditionally a page cache thing. It is integral to my implementation, but you can skip it if you follow this method. I won’t judge you for it.

As a single threaded construct, I expect it to be really fast up until it maxes out its core, and then it doesn’t scale further. Keep in mind, performance is not a priority for me right now, so I haven’t benchmarked it. Is that a problem? Maybe, depending on your use case. For a small DBMS like Glowdust, it works really well, because what I care about the most is components that are easy to get right and even easier to debug. So for me, the scalability issue does not factor in.

But what I care about the most is the technique of using the state machine to keep a thread from blocking and use it at 100% to advance any job that can advance.

While I haven’t tested this yet, I am pretty sure that this design also exhibits excellent push back behavior. The number of pending jobs in my implementation is fixed, which means I can adjust dynamically the amount of load put on I/O. I can also easily implement a job priority scheme and even create a simple scheduler.

What I want to look into further is how difficult it will be to get the code from this single threaded state to a multi threaded one, where all threads work on all jobs, regardless of ownership. That’s exactly the approach used by wait free data structures like Harris’ non blocking linked list which uses CAS to progress what is effectively a state machine.

This is a good example of where the API can be improved for quality of life. Pages can have their use count decremented when they are drop()’d, so the user doesn’t accidentally leak them. Even further, page acquisition could be bound to Transaction objects, so they are all reaped on drop, commit or abort. I haven’t implemented this, but it’s not hard. The API will evolve as I use it. ↩
io_uring allows for the attachment of a u64 as “user data” in every SQE, that is returned unmodified in the CQE. This is exactly so that requests and responses can be correlated and what we use it here for. ↩
I am ignoring all error handling here for simplicity. There are also edge cases about extending file size, partial reads etc that I don’t cover here, they are beyond the scope of this introduction. ↩
An exception to this is the page use count, which is an AtomicUSize for convenience. This way it can be released by the owning thread directly. ↩
Right now there is only support for one page per read/write call, but that’s not necessary. It is not hard to have support for passing arrays of offsets and initiate multiple I/O at once, and perhaps accompany them with an array of request_ids so you can get back multiple results from a single call. The similarities with io_uring are not accidental. ↩