Discussion
Search code, repositories, users, issues, pull requests...
michaeljelly: Really cool
alex_hirner: What are your thoughts on eviction, re how easy to add some basic policy?
russellthehippo: Great question. I have some eviction functions in the Rust library; I don’t expose them through the extension/VFS yet. The open question is less “can I evict?” and more “when should eviction fire?” — via user action, via policy, or both.The obvious policy-driven versions are things like: - when cache size crosses a limit - on checkpoint - every N writes (kind of like autocheckpoint) - after some idle / age thresholdMy instinct is that for the workload I care about, the best answer is probably hybrid. The VFS should have a tier-aware policy internally that users can configure with separate policies for interior/index/data pages. But the user/application may still be in the best position to say “this tenant/session DB is cold now, evict aggressively.”
inferense: very cool!
agosta: This is awesome! With all of the projects/teams working on improving sqlite, it feels like it's just a matter of time before it becomes a better default than postgres for serious projects.I do wonder - for projects that do ultimately enforce single writer sqlite setups - it still feels to me as if it would always be better to keep the sqlite db local (and then rsync/stream backups to whatever S3 storage one prefers).The nut I've yet to see anyone crack on this setup is to figure out a way to achieve zero downtime deploys this way. For instance, adding a persistent disk to VMs on Render prevents zero downtime deploys (see https://render.com/docs/disks#disk-limitations-and-considera...) which is a real unfortunate side effect. I understand that the reason for this is because a VM instance is attached to the volume and needs to be swapped with the new version of said instance...There are so many applications where merely scaling up a single VM as your product grows simplifies devops / product maintenance so much that it's a very compelling choice vs managing a cluster/separate db server. But getting forced downtime between releases to achieve that isn't acceptable in a lot of cases.Not sure if it's truly a cheaply solvable problem. One potential option is to use a tool like turbolite as a parallel data store and, only during deployments, use it to keep the application running for the 10 to 60 seconds during a release swap. During this time, writes to the db are slower than usual but entirely online. And then, when the new deployment instance is totally online, it can sync the difference of data written to s3 back to the local db. In this way, during regular operation, we get the performance and benefits of local IO, and only during upgrades the persistence benefits of s3 backed sqlite.Sounds like a fraught thing to try to setup. But man it really is hard to beat the speed of local reads.
jijji: i wonder how much that costs per hour to run any normal load? what benefit does this have versuss using mysql (or any similar rdbms) for the queries? mysql/pgsql/etc is free remember, so using S3 obviously charges by the request, or am i wrong?
carlsverre: You might be interested in taking a look at Graft (https://graft.rs/). I have been iterating in this space for the last year, and have learned a lot about it. Graft has a slightly different set of goals, one of which is to keep writes fast and small and optimize for partial replication. That said, Graft shares several design decisions, including the use of framed ZStd compression to store pages.I do like the B-tree aware grouping idea. This seems like a useful optimization for larger scan-style workloads. It helps eliminate the need to vacuum as much.Have you considered doing other kinds of optimizations? Empty pages, free pages, etc.
russellthehippo: Very cool, thanks. I hadn’t seen Graft before, but that sounds pretty adjacent in a lot of interesting ways. I looked at the repo and see what I can apply.I've tried out all sorts of optimizations - for free pages, I've considered leaving empty space in each S3 object and serving those as free pages to get efficient writes without shuffling pages too much. My current bias has been to over-store a little if it keeps the read path simpler, since the main goal so far has been making cold reads plausible rather than maximizing space efficiency. Especially because free pages compress well.I have two related roadmap item: hole-punching and LSM-like writing. For local on non-HDD storage, we can evict empty pages automatically by releasing empty page space back to the OS. For writes, LSM is best because it groups related things together, which is what we need. but that would mean doing a lot of rewriting on checkpoint. So both of these feel a little premature to optimize for vs other things.
carlsverre: Both of those roadmap items make sense! Excited to see how you evolve this project!