A novel design for database transactions

Sun, June 16, 2024

Perhaps the easiest thing in the world is coming up with grand designs when you don’t actually have to do the work.

It’s slightly more difficult to pull this off if you’re working alone. Difficult, but not impossible.

All you have to do is pretend that you won’t have to do the work, a method known as “making it my future self’s problem”. A demonstration of this technique can be found in my previous work, which includes hits like “my ideal transactional subsystem” (why, you don’t have an ideal transactional subsystem?).

It should come as no surprise then to learn that, after I was done with the first iteration of the LSM store implementation, I run out of excuses - it was time to stop preaching and start writing some code.

Part 1: What makes a transaction, anyway?

For the purposes of this discussion, a transaction is an identifiable working context.

A transaction is basically a scratch space where changes to the database can be collected before they are complete. Changes to the database need to appear atomically, which means we need first gather them, then validate them, and finally make them visible to all users.

Many of the words in the previous description must give you some pause.

What constitutes a “change”?

Where are these changes stored?

What does it mean to validate them? Complete them? Make them visible?

Come to think of it, that’s all the words. So let’s try to do a bit better.

Meme picture of Inogo Montoya from The Princess Bride.
The caption at top says let me explain, at the bottom even if there is too much

Making changes

A change in Glowdust means assigning a value to a key. It doesn’t matter if the key existed or not - if you do a value assignment, this implies changes to the store files. These changes cannot be visible to other transactions, because they may be valid only if certain conditions are met, and these conditions may not have been checked yet.

For example, and at the risk of explaining the obvious: Say a user has implemented a list in Glowdust and they do an insert. They of course want the new count and the new element to appear at the same time - it shouldn’t be possible for a transaction to see the new element but the wrong count.

That’s what transactions are for.

Note that I have not included schema updates in the definition for changes. Currently, these are treated separately, even if they appear in a Glowdust program with other transactional statements:

create function the_f(int) -> int   // this will be executed independently of what follows

the_f(1) = 3;

The function definition in the example above will be executed regardless of what happens to the transaction statement that follows. This is deliberate and pretty common in databases, since schema statements don’t generally appear in transactional contexts - they are executed rarely, usually as part of an install script, and their transactional semantics are kind of vague.

Storing the changes

As the transaction creates and modifies key/value pairs, these changes need to be kept somewhere so once the transaction commits we can, like, you know, have them so we can validate and apply them. Where to keep them, though?

Memory. The answer is memory, we keep them in memory.

But where in memory?

Good question. Glowdust keeps them in the page cache (or what will be the page cache once I get around to building one). This comes with some benefits:

Most importantly, no memory allocations. Assigning already allocated pages to hold transaction state means we only allocate page cache, which happens very rarely (currently at system startup, but that will change). So, lower heap churn and fewer system calls.
Transaction state is built up directly as pages, so they can be flushed to disk just like any other store page. In fact, there is no difference between the “real” store and the transaction state store, they can be treated exactly the same for reads and writes.
Furthermore, once the tx completes, there is no further processing necessary for commit. All the data is right there, already in the store format and in pages, ready to flush.

Processing the changes - and how to push back load

With the above in place, it is really easy to move the transaction state between threads. Everything is possible to look up by an id and pick up the pages with the state, the VM stack and the cursor state.

This lets transactions to be served by a fixed number of worker threads that can pick them up, execute some bytecodes, and then put them back for some other thread to pick up later.

Of course, transactions don’t have only their own state to worry about - there is a whole store that needs to be page faulted so data can be read. That’s also stored in the page cache, which creates an interesting dynamic with the transaction state: As more transactions are running (or the bigger they are), the less space is left for page faulting store files, so throughput will go down, effectively pushing load back.

CPU has a similar behavior - given a fixed number of worker threads, as load increases (more client connections, for example), transactions and their state will accumulate in the page cache, again pushing back on the load.

At least I think that’s what should happen. I haven’t implented enough of the system to see this in action but, in theory, making the bottlenecks appear at a single spot and using that for pushing back on the load seems like it should work.

If you read the above paragraph and laughed out loud, please let me know why, because I’m about to spend a lot of time on this and I’d like to know if it’s doomed at conception.

This will be an important set of experiments to run and in all likelyhood I will focus on developing enough infrastructure to get these running and see how it behaves.

And you bet I’ll write about it.

Committing the changes

Committing the changes mostly means making them visible to other transactions.

“Other transactions” means transactions that satisfy certain time horizon and visibility criteria, depending on the consistency model chosen. Glowdust hasn’t yet committed (pun intended) to a consistency model, although I am partial to Snapshot Isolation.

Part of commit is also constraint validation (which for Glowdust is still WIP) and mutual exclusion.

The latter is generally addressed in two ways - with locks (pessimistically) or with conflict resolution (optimistically - aka MVCC). I have not decided yet on which method to use - locks are easier, MVCC is sexier and fits much better with both snapshot isolation and the whole “transaction is in the page cache” thing.

For now though, the model is extremely simple. When a transaction commits, it makes its private state public, by moving it to a common vector. That vector is the store used by all transactions after they first exhaust their own private state.

Part 2: The current implementation

Most of the above is implemented in a single commit, which is a straighforward implementation of what I described in Part 1.

It basically adds a context identifier to the ValueStore methods. This context id is synonymous to a transaction id, which in turn is a u64, although I expect this to grow more complicated.

The signature for reading and writing values now has become

fn get_function_value(
    &self,
    context: TransactionId,
    argument: &[Value],
) -> Result<Option<Value>, StoreError>;

fn set_function_value(
    &mut self,
    context: TransactionId,
    argument: &[Value],
    value: Value,
) -> Result<(), StoreError>;

fn function_cursor(
    &self,
    context: TransactionId,
) -> Result<Box<dyn FunctionCursor>, StoreError>;

Naturally, a store implementation can choose to ignore the context, like the CSV reader implementation does - in essence this means that the store is read only.

But do you know which store is not read-only?

That’s right, the LSM tree store.

Transaction support in the LSMTree store

The LSM store becomes essentially this:

pub struct LsmTreeStore {
    context_stores: DashMap<TransactionId, LsmTreeStoreModule>,
    committed_stores: Vec<LsmTreeStoreModule>,
}

which is the most naive way possible to implement this.

It’s also wrong, for reasons that will soon become clear.

Each LsmTreeStoreModule is a collection of OrderedMemtable for storing keys and UnorderedMemtable for storing values (as we’ll see in Part 2 of the store series). Each module stores the state of exactly one transaction.

Reading and writing values is dependent on a transaction id, as per the trait above, and follows a strict order of

Lookup the module for the given transaction id, try to find the key in that. Not having a context for a transaction is an error, because it means the transaction has either not been started or has been committed.
If the key is not found in the module, look in the committed_stores Vector. Return from there if it is found, otherwise return None.

In other words, as the implementation is now, the committed_stores vector is a stand in for what would be the “on-disk store” for the database.

Note, however, that all data is stored in the same “page cache”: Vectors allocated by the store that hold both committed state (i.e. the store) and uncommited state.

LSMTree store commit sequence

When the script ends successfully, the transaction is committed. This happens on the Server component, by calling the commit() method on the transaction state (currently represented by a VirtualMachine instance that holds the stack, cursors and TransactionId).

let mut vm = VirtualMachine::new_with_id(tx_id);
vm.run(script, &mut store, &mut sink);
vm.commit(&mut store);

These three lines effectively

Start the transaction and allocate the VM stack for running the script bytecode
Execute the bytecode
Commit the changes

Rollback is not implemented yet. You can just crash the whole process instead, it’s cooler that way¹.

Obviously, the important bit is the last line, the commit() call. Assuming the run() call returned properly, that’s where the changes of the transaction become visible. And what does that method do?

It will go over all stores in the store argument and call commit() on each one.

Only 2 problems? That’s not so bad, considering

In the current implementation, each function lives in its own store, which means the following race can happen:

Transaction A begins
Transaction A makes changes to functions fun1 and fun2
Transaction B begins
Transaction A begins commit
Transaction A commits store for fun1
Transaction B reads store for fun1, reads A’s changes
Transaction B reads store for fun2, doesn’t see A’s changes
…more stuff happens

Obviously, in this case, atomicity has been violated, because at step 7 transaction B saw changes that were in progress. This problem needs to be fixed, and there are some ways this can be done. This is future work, I apologise, there is only so much time available in the day.

There is another problem that needs fixing here - conflicting changes. Let’s do another sequence like above, but more devious this time

Transaction A begins
Transaction A sets fun(1) = 2
Transaction A sets fun(2) = 4
Transaction B begins
Transaction B sets fun(1) = 3
Transaction B sets fun(2) = 5
Transaction A starts commit, sets fun(1) = 2
Transaction B starts commit, sets fun(1) = 3
Transaction B sets fun(2) = 5
Transaction A sets fun(2) = 4

Done.

The end result is

fun(1) = 3
fun(2) = 4

But no transaction ever wrote this. There is no sequence in which transactions A and B could have executed that would produce this end result.

As you may have guessed, this is a problem. There is a conflict in what transactions A and B read and write that needs to be managed. As I wrote above, one way to deal with this is locks, the other is MVCC. Given that transactions Ids are already present and integral in the design, and the state is already formatted as store entries, MVCC feels much more natural. Again, this is future work.

One final note

Above I mentioned the execution sequence for a transaction, but I skipped over some details. I’d like to make amends here and mention one of them - the sink argument of the vm.run() method.

// let mut vm = VirtualMachine::new_with_id(tx_id);
vm.run(script, &mut store, &mut sink); // <-- This one
// vm.commit(&mut store);

That’s where results from the script are written² and thus it’s possible to be consumed by the caller.

The results are returned immediately, as the script is running. This means state can accumulate if the caller doesn’t consume the values returned, which can lead to memory pressure, if the Sink doesn’t have a blocking implementation.

It also means that if a transaction produces results that are not consumed by the caller, then that transaction will block until the results are drained. This puts a cap on the amount of memory consumed by transactions, effectively pushing back on memory consumption.

Finally, one more cool thing I found is that having a sync_channel(1) as the implementation, it creates an interesting behavior. For every return statement in a script, the worker thread will block until the result is consumed. So, I can create effectively breakpoints in a Glowdust script to write integration tests for executing transactions.

I’ll be committing some tests to this effect soon, I need to clean them up first.

Part 3: Next steps

I will focus on testing out the push back behavior of the system. I am very curious to see if it behaves like I expect it to. A robust load handling architecture is key for a reliable DBMS and probably the hardest part of building the transactional subsystem. It also means greatly simplified operation, because it reduces the need for configuration - technically, all the operator needs to worry about is how much memory to give to the system. The rest is balanced automatically.

If you’re interested in the design of the system, or even using it as a basis for your own products, get in touch. I am happy to discuss database design or get more involved in work that you do in this area.

Ok, fine, it’s not really cooler, it’s just that I need to decide on semantics and implement it. But in the mean time, just crash it, it’s not that hard anyway. ↩
Results are produced by the script when the return statement is used. ↩