How Glowdust models CSV files as functions
DISCLAIMER
This post is reflective of the state of affairs at the time of publication. For up to date documentation on Glowdust’s CSV support, go to the book
I’m going to split this post in two parts.
The first will be about the design decisions for CSV import in Glowdust as it was released yesterday.
The second will discuss storage more generally.
CSV files as functions
At first look, CSV files don’t seem to contain functions, do they?
Take a look at this CSV file:
fake_id
1
1
In what universe is this a function? It doesn’t even represent a mapping, much less have a key for the domain.
“Ah”, the shrewed designer will exclaim.
“There is a domain value here, hidden in plain sight. It’s the row id. So any CSV file is a mapping from row id to the row contents.”
Well, maybe so. But w-
“Moreover”, the designer of the astute persuasion continues, “you are forgetting the file URI itself. So the key is in reality a composite key of URI+row id”.
In an attempt to bring this silly bit to a lukewarm conclusion, I’ll note that yes, these are valid identifiers if we assume the structure of the CSV file to be relevant.
These identifiers are not from the data itself though, are they? There is no row id in that text snippet above. This exact data could be part of a stream that might or might not repeat - where is your row id then?
No, I don’t think that the mechanics of a CSV file are relevant for representation as a function.
So, how about we cheated a bit? Since the whole issue is around ids, maybe we could skip them?
And that, dear reader, is how CSV support is implemented in Glowdust.
In fact, as we’ll see in the second part of this post, the way Glowdust treats CSV is the basis for all storage. Every other store is at least CSV with some optional bells and whistles on top - you know, trivial things like identifiers and indexes.
For a more concrete example, let’s do a Cats.csv
file:
id,name,color,birthdate,owner_name,personality,favorite_toy
1,Nata,calico,5/25/2013,Nata Dowdell,Playful,Crinkle ball
2,Winfred,calico,3/30/2011,Winfred Grima,Lazy,Laser pointer
3,Cherish,calico,12/10/2004,Cherish Eadie,Lazy,Crinkle ball
4,Chelsy,orange,4/4/2010,Chelsy Gribben,Curious,Feather wand
5,Oswald,calico,1/12/2014,Oswald Rickard,Curious,Laser pointer
(You can follow the exact instructions, if you like, as per the book)
Since Glowdust supports structs, we can just define one that holds the data for each row:
create type Cat {
id: int,
name : string,
color : string,
birthdate : string,
owner_name : string,
favorite_toy : string,
};
And now we just need the function that returns each row.
Which, you’ll note, is the actual problem we’re trying to solve.
Ready?
Here it is:
create function Cat_CSV() -> Cat
No domain. Yup, it’s pretty much just a stream - an iterator, if you don’t mind me saying so. But there is no input value, just returned values, which apparently, under some weird set-theoretic arguments is nevertheless some valid function.
“Weird set-theoretic arguments” sounds a lot like hand waiving, I’ll admit, but hey, I’ll grasp any straws I can to make this work, and to be honest, I am not after mathematical rigour as much I am abound functionality.
The point is, Cat_CSV()
can be called in a match
clause just like any function, and have joins, filters and so on
applied to the return values:
match Cat_CSV() -> cat, {cat.id > 3}, return cat.name
<<
["Chelsy"]
["Oswald"]
Now, will this work with an ActivityPub stream? Or a Kafka topic? Or any other data stream?
Of course it will. There is nothing CSV specific about this. No row ids or any other file specific artifact.
In fact, to make this work, we still have to connect the Cat_CSV
function to the backing CSV file.
Currently, this is achieved with a native function:
function_csv("Cat_CSV", "/home/glowdust/Cats.csv");
The function_csv()
native Glowdust function accepts a function name and a path to a CSV file.
It then maps the rows of the file to the return type of the function and returns an iterator from it.
Iteration is the only thing it can do, of course. I presume there are ways to create indexing structures over CSV files, or check them for constraints, even write to them, but at that point we are talking about a specialized store.
The point is, by having a Glowdust function be effectively an iterator backed by a CSV store, we can deal with any streaming input.
We have a nice primitive here. How does it translate to stores?
Every store is a function
Every value store can be treated in exactly this way. For example, as I am working on the native LSM Tree store for Glowdust, the temporary code I write uses a
function_lsm("Function", "/path/to/store")
native function with exactly the same semantics. The difference
is that because the LSM store knows about keys, I can have Function
accept a domain like any other function.
That is exactly how the import example in the book works, the difference being that a function by default is stored in memory, without needing an explicit store assignment.
This could change, and in fact it would make sense to have an optional clause in every function definition:
create function Cat_CSV() -> Cat AS CSV@/path/to/file.csv
or similar syntax.
More generally, any store than can present an iterator over its contents can be mapped as a function in Glowdust. Obviously, to get the full functionality of the database it helps to have ids, indexes and support for writes, but any network stream can be brought into a Glowdust query just by using the appropriate backing store implementation.
I wanted to have CSV support first, before any “native” store, precisely because it is limited in ways that can reveal kinks and edge cases in the data model.
I think I did ok, and I am not too chuffed about the “no domain values” thing. It may not be very pure, but its workable and makes sense in a lot of contexts.
So next up, it’s the LSM Tree implementation and transaction state.
As always, let me know your thoughts on Mastodon
And, if you find this interesting enough, you may want to donate towards my costs as an independent developer.