Discussion
Do You Even Need a Database?
z3ugma: At some point, don't you just end up making a low-quality, poorly-tested reinvention of SQLite by doing this and adding features?
freedomben: Sometimes yes, I've seen it. It even tends to happen on NoSQL databases as well. Three times I've seen apps start on top of Dynamo DB, and then end up re-implementing relational databases at the application level anyway. Starting with postgres would have been the right answer for all three of those. Initial dev went faster, but tech debt and complexity quickly started soaking up all those gains and left a hard-to-maintain mess.
gorjusborg: Only if you get there and need it.
upmostly: Exactly. And most apps don't get there and therefore don't need it.
fatih-erikli-cg: I agree. Databases are useless. You don't even need to load it into the memory. Reading it from the disk when there is a need to read something must be ok. I don't believe the case that there are billions of records so the database must be something optimized for handling it. That amount of records most likely is something like access logs etc, I think they should not be stored at all, for such case.Even it's postgres, it is still a file on disk. If there is need something like like partitioning the data, it is much more easier to write the code that partitions the data.
fifilura: Isn't this the same case the NoSQL movement made.
t312227: hello,as always: imho (!)afaik: parts (!) of the noSQL movement made that claim.we have to remember: there are roughly 4 different types of so-called noSQL db-types which are somewhat "mingled" together in this term.so ... yes ... i'd say: (json based) document-dbs a la mongoDB fit that claim.others are (much) more about different data-structures and query-languages.just my 0.02€
ForHackernews: Surprised to see this beating SQLite after previously reading https://sqlite.org/fasterthanfs.html
z3ugma: but it's so trivial to implement SQLite, in almost any app or language...there are sufficient ORMs to do the joins if you don't like working with SQL directly...the B-trees are built in and you don't need to reason about binary search, and your app doesn't have 300% test coverage with fuzzing like SQLite doesyou should be squashing bugs related to your business logic, not core data storage. Local data storage on your one horizontally-scaling box is a solved problem using SQLite. Not to mention atomic backups?
moron4hire: Came here to also throw in a vote for it being so much easier to just use SQLite. You get so much for so very little. There might be a one-time up-front learning effort for tweaking settings, but that is a lot less effort than what you're going to spend on fiddling with stupid issues with data files all day, every day, for the rest of the life of your project.
randusername: Separate from performance, I feel like databases are a sub-specialty that has its own cognitive load.I can use databases just fine, but will never be able to make wise decisions about table layouts, ORMs, migrations, backups, scaling.I don't understand the culture of "oh we need to use this tool because that's what professionals use" when the team doesn't have the knowledge or discipline to do it right and the scale doesn't justify the complexity.
jwitchel: This is a great incredibly well written piece. Nice work showing under the hood build up of how a db works. It makes you think.
bsenftner: I worked as a software engineer for 30 years before being forced to use a database, and that was for a web site. I've been coding actively, daily, since the 70's. Forever we just wrote proprietary files to disk, and that was the norm, for decades. Many a new developer can't even imagine writing their own proprietary file formats, the idea literally scares them. The engineers produced today are a shadow of what they used to be.
evanelias: Your article completely ignores operational considerations: backups, schema changes, replication/HA. As well as security, i.e. your application has full permissions to completely destroy your data file.Regardless of whether most apps have enough requests per second to "need" a database for performance reasons, these are extremely important topics for any app used by a real business.
kabir_daki: We built a PDF processing tool and faced this exact question early on.For our use case — merge, split, compress — we went fully stateless. Files are processed in memory and never stored. No database needed at all.The only time a database becomes necessary is when you need user accounts, history, or async jobs for large files. For simple tools, a database is often just added complexity.The real question isn't "do you need a database" but "do you need state" — and often the answer is no.
anonymars: Yeah, it scares me because I'm experienced enough to know all the difficulties involved in keeping durable data consistent, correct, and performant
forinti: Many eons ago I wrote a small sales web application in Perl. I couldn't install anything on the ISP's machine, so I used file-backed hashes: one for users, one for orders, another for something else.As the years went by, I expected the client to move to something better, but he just stuck with it until he died after about 20 years, the family took over and had everything redone (it now runs Wordpress).The last time I checked, it had hundreds of thousands of orders and still had good performance. The evolution of hardware made this hack keep its performance well past what I had expected it to endure. I'm pretty sure SQLite would be just fine nowadays.
da02: What type of product or service were they selling?
9rx: > and your app doesn't have 300% test coverage with fuzzing like SQLite doesSurely it does? Otherwise you cannot trust the interface point with SQLite and you're no further ahead. SQLite being flawless doesn't mean much if you screw things up before getting to it.
RL2024: That's true but relying on a highly tested component like SQLite means that you can focus your tests on the interface and your business logic, i.e. you can test that you are persisting to the your datastore rather than testing that your datastore implementation is valid.
9rx: Your business logic tests will already, by osmosis, exercise the backing data store in every conceivable way to the fundamental extent that is possible with testing given finite time. If that's not the case, your business logic tests have cases that have been overlooked. Choosing SQLite does mean that it will also be tested for code paths that your application will never touch, but who cares about that? It makes no difference if code that is never executed is theoretically buggy.
wmanley: Business logic tests will rarely test what happens to your data if a machine loses power.
9rx: Then your business logic contains unspecified behaviour. Maybe you have a business situation where power loss conditions being unspecified is perfectly acceptable, but if that is so it doesn't really matter what happens to your backing data store either.
bevr1337: > The real question isn't "do you need a database" but "do you need state" — and often the answer is no.This is a solid takeaway and applies to a lot of domains. Great observation
forinti: A calendar for cutting your hair according to the phases of the moon.
da02: Sounds like a tough business. The profit margins must have been razor thin.
vovanidze: people wildly underestimate the os page cache and modern nvme drives tbh. disk io today is basically ram speeds from 10 years ago. seeing startups spin up managed postgres + redis clusters + prisma on day 1 just to collect waitlist emails is peak feature vomit.a jsonl file and a single go binary will literally outlive most startup runways.also, the irony of a database gui company writing a post about how you dont actually need a database is pretty based.
upmostly: The irony isn’t lost on us, trust me. We spent a while debating whether to even publish this one.But yeah, the page cache point is real and massively underappreciated. Modern infrastructure discourse skips past it almost entirely. A warm NVMe-backed file with the OS doing the caching is genuinely fast enough for most early-stage products.
vovanidze: props for actually publishing it tbh. transparent engineering takes are so rare now, usually its just seo fluff.weve basically been brainwashed to think we need kubernetes and 3 different databases just to serve a few thousand users. gotta burn those startup cloud credits somehow i guess.mad respect for the honesty though, actually makes me want to check out db pro when i finally outgrow my flat files.
upmostly: I'm feel like I could write another post: Do you even need serverless/Cloud because we've also been brainwashed into thinking we need to spend hundreds/thousands a month on AWS when a tiny VPS will do.Similar sentiment.
hooverd: Serverless is cheap as hell as low volumes. Your tiny VPS can't scale to zero. If you're doing sustained traffic your tiny VPS might win though. The real value in Cloud is turning capex spend into opex spend. You don't have to wait weeks or months to requisition equipment.
swiftcoder: I feel like someone who works for a DB company ought to mention at least some of the pitfalls in file-based backing stores (data loss due to crashes, file truncation, fsync weirdness, etc)
bachmeier: Based on what's in the article, it wouldn't take much to move these files to SQLite or any other database in the future.
locknitpicker: > Based on what's in the article, it wouldn't take much to move these files to SQLite or any other database in the future.Why waste time screwing around with ad-hoc file reads, then?I mean, what exactly are you buying by rolling your own?
bachmeier: You can avoid the overhead of working with the database. If you want to work with json data and prefer the advantages of text files, this solution will be better when you're starting out. I'm not going to argue in favor of a particular solution because that depends on what you're doing. One could turn the question around and ask what's special about SQLite.
ablob: So you trade the overhead of SQL with the overhead of JSON?
locknitpicker: > You can avoid the overhead of working with the database.What overhead?SQLite is literally more performant than fread/fwrite.
leafarlua: This always confuses me because we have decades of SQL and all its issues as well. Hundreds of experienced devs talking about all the issues in SQL and the quirks of queries when your data is not trivial.One would think that for a startup of sorts, where things changes fast and are unpredictable, NoSQL is the correct answer. And when things are stable and the shape of entities are known, going for SQL becomes a natural path.There is also cases for having both, and there is cases for graph-oriented databases or even columnar-oriented ones such as duckdb.Seems to me, with my very limited experience of course, everything leads to same boring fundamental issue: Rarely the issue lays on infrastructure, and is mostly bad design decisions and poor domain knowledge. Realistic, how many times the bottleneck is indeed the type of database versus the quality of the code and the imlementation of the system design?
mike_hearn: Disclaimer: I work part time on the DB team.You could also consider renting an Oracle DB. Yep! Consider some unintuitive facts:• It can be cheaper to use Oracle than MongoDB. There are companies that have migrated away from Mongo to Oracle to save money. This idea violates some of HN's most sacred memes, but there you go. Cloud databases are things you always pay for, even if they're based on open source code.• Oracle supports NoSQL features including the MongoDB protocol. You can use the Mongo GUI tools to view and edit your data. Starting with NoSQL is very easy as a consequence.• But... it also has "JSON duality views". You start with a collection of JSON documents and the database not only works out your JSON schemas through data entropy analysis, but can also refactor your documents into relational tables behind the scenes whilst preserving the JSON/REST oriented view e.g. with optimistic locking using etags. Queries on JSON DVs become SQL queries that join tables behind the scenes so you get the benefits of both NoSQL and SQL worlds (i.e. updating a sub-object in one place updates it in all places cheaply).• If your startup has viral growth you won't have db scaling issues because Oracle DBs scale horizontally, and have a bunch of other neat performance tricks like automatically adding indexes you forgot you needed, you can materialize views, there are high performance transactional message queues etc.So you get a nice smooth scale-up and transition from ad hoc "stuff some json into the db and hope for the best" to well typed data with schemas and properly normalized forms that benefit from all the features of SQL.
locknitpicker: > weve basically been brainwashed to think we need kubernetes and 3 different databases just to serve a few thousand users. gotta burn those startup cloud credits somehow i guess.I don't think it makes any sense to presume everyone around you is brainwashed and you are the only soul cursed with reasoning powers. Might it be possible that "we" are actually able to analyse tradeoffs and understand the value of, say, have complete control over deployments with out of the box support for things like deployment history, observability, rollback control, and infrastructure as code?Or is it brainwashing?Let's put your claim to the test. If you believe only brainwashed people could see value in things like SQLite or Kubernetes, what do you believe are reasonable choices for production environments?
koliber: I love this article as it shows how fast computers really are.There is one conclusion that I do not agree with. Near the end, the author lists cases where you will outgrow flat files. He then says that "None of these constraints apply to a lot of applications."One of the constraints is "Multiple processes need to write at the same time." It turns out many early stage products need crons and message queues that execute on a separate worker. These multiple processes often need to write at the same time. You could finagle it so that the main server is the only one writing, but you'd introduce architectural complexity.So while from the pure scale perspective I agree with the author, if you take a wider perspective, it's best to go with a database. And sqlite is a very sane choice.If you need scale, cache the most often accessed data in memory and you have the best of both worlds.My winning combo is sqlite + in-memory cache.
upmostly: Seeing the Rust 1M benches were an amazing reminder as to how fast stuff really is.
randyrand: “You Aren’t Gonna Need It” - one of the most important software principles.Wait until you actually need it.
upmostly: 100%.Premature optimisation I believe that's called.I've seen it play out many times in engineering over the years.
tracker1: I'd argue for using LevelDB or similar if I just wanted to store arbitrary data based on a single indexable value like TFA. That said, I'd probably just default to SQLite myself since the access, backup, restore patterns are relatively well known and that you can port/grow your access via service layers that include Turso or Cloudflare D1, etc.
alexisread: Good points, but Postgres has all those, along with much better local testing story, easier and more reliable CDC, better UDFs (in Python, Go etc.), a huge ecosystem of extensions for eg. GIS data, no licencing issues ever, API compatability with DuckDB, Doris and other DBs, and (this is the big one) is not Oracle.
tracker1: I generally limit Oracle to where you are in a position to have a dedicated team to the design, deployment and management of just database operations. I'm not really a fan of Oracle in general, but if you're in a position to spend upwards of $1m/yr or more for dedicated db staff, then it's probably worth considering.Even then, PostgreSQL and even MS-SQL are often decent alternatives for most use cases.
mike_hearn: That was true years ago but these days there's the autonomous database offering, where DB operations are almost all automated. You can rent them in the cloud and you just get the connection strings/wallet and go. Examples of stuff it automates: backups, scaling up/down, (as mentioned) adding indexes automatically, query plan A/B testing to catch bad replans, you can pin plans if you need to, rolling upgrades without downtime, automated application of security patches (if you want that), etc.So yeah running a relational DB used to be quite high effort but it got a lot better over time.
hackingonempty: Probably more like a low-quality, poorly-tested reinvention of BerkeleyDB.
stackskipton: SRE here. My "Huh, neat" side of my brain is very interested. The SRE side of my brain is screaming "GOD NO, PLEASE NO"Overhead in any project is understanding it and onboarding new people to it. Keeping on "mainline" path is key to lower friction here. All 3 languages have well supported ORM that supports SQLite.
tracker1: I'm mostly with you here... it's amazing how many devs don't have a certain amount of baseline knowledge to understand file-io, let alone thin abstractions for custom data and indexing like tfa. Then again, most devs also don't understand the impacts of database normalization under load either.
ghc: I'm so old I remember working on databases that were designed to use RAW, not files. I'm betting some databases still do, but probably only for mainframe systems nowadays.
bob1029: https://docs.oracle.com/cd/B16276_01/doc/win.102/b14305/arch...
ghc: > Oracle® Database Platform Guide 10g Release 2 (10.2) for Microsoft Windows Itanium (64-Bit)Well, I guess that at least confirms Oracle on Itanium (!?) still supported RAW 5 years ago.I'm guessing everyone's on ASM by now though, if they're still upgrading. I ran into a company not long ago with a huge oracle cluster that still employed physical database admins and logical database admins as separate roles...I would bet they're still paying millions for an out of date version of Oracle and using RAW.
shafoshaf: Relational Databases Aren’t Dinosaurs, They’re Sharks. https://www.simplethread.com/relational-databases-arent-dino...The very small bonus you get on small apps is hardly worth the time you spend redeveloping the wheel.
adrian_b: Sharks vs. dinosaurs seems indeed an appropriate metaphor.During Cretaceous, when dinosaurs were at their peak, sharks had already become very similar to the sharks of today, e.g. there were big sharks that differed very little from the white sharks and tiger sharks of today.Then the dinosaurs have disappeared, together with the pterosaurs and the mosasaurs, and they have been replaced by other animals, but the sharks have continued to live until today with little changes, because they had already reached an optimized design that was hard to improve.Besides the sharks, during Cretaceous there already existed along the dinosaurs other 2 groups of big predators that have changed little since then, crocodiles and big constrictor snakes similar to the pythons of today.Therefore all 3 (sharks, crocodiles and big constrictor snakes) are examples of locally optimum designs that have been reached more than 70 million years ago.
pseudosavant: SQLite has become my new go-to when starting any project that needs a DB. The performance is very fast, and if anything is ever successful enough to outgrow SQLite, it wouldn't be that hard to switch it out for Postgres. Not having to maintain/backup/manage a separate database server is cheaper and easier.
noveltyaccount: As soon as you need to do a JOIN, you're either rewriting a database or replatforming on Sqlite.
goerch: a) Just heard today: JOINs are bad for performance b) How many columns can (an Excel) table have: no need for JOINs
traderj0e: It's a good article with interesting benchmarks, but the title is clickbait. "Database" = NoSQL here. Sure you can decently DIY a KV store, but it's not even comparable to SQLite.
rasengan: Sounds like a good way to waste the only scarce resource: time.
tracker1: At that point, you can say the same for PostgreSQL, which is more broadly supported across all major and minor cloud platforms with similar features and I'm assuming a lower cost and barrier of entry. This is without signing with Oracle, Inc... which tends to bring a lot of lock-in behaviors that come with those feature sets.TBF, I haven't had to use Oracle in about a decade at this point... so I'm not sure how well it competes... My experiences with the corporate entity itself leave a lot to be desired, let alone just getting setup/started with local connectivity has always been what I considered extremely painful vs common alternatives. MS-SQL was always really nice to get setup, but more recently has had a lot of difficulties, in particular with docker/dev instances and more under arm (mac) than alternatives.I'm a pretty big fan of PG, which is, again, very widely available and supported.
mike_hearn: I think PG doesn't have most of the features I named, I'm pretty sure it doesn't have integrated queues for example (SELECT FOR UPDATE SKIP LOCKED isn't an MQ system), but also, bear in mind the "postgres" cloud vendors sell is often not actually Postgres. They've forked it and are exploiting the weak trademark protection, so people can end up more locked in than they think. In the past one cloud even shipped a transaction isolation bug in something they were calling managed Postgres, that didn't exist upstream! So then you're stuck with both a single DB and a single cloud.Local dev is the same as other DBs: docker run -d --name <oracle-db> container-registry.oracle.com/database/free:latest See https://container-registry.oracle.comWorks on Intel and ARM. I develop on an ARM Mac without issue. It starts up in a few seconds.
AlotOfReading: There's plenty of middle ground between an unchanging SQL schema and the implicit schemas of "schemaless" databases. You can have completely fluid schemas with the full power of relational algebra (e.g. untyped datalog). You shouldn't be using NoSQL just because you want to easily change schemas.
throwway120385: I dunno. Even in embedded systems every time I've started without a database I've eventually come to need something like a database, and in every case I've found myself building essentially an ad-hoc poorly managed database into the application including marshalling/unmarshalling, file management, notification, and so on because each new feature over the top of regular files was just that much easier to add versus switching to a database system.However the driving motivation for adding a database is not necessarily managing data, but the fact that the database system creates a nice abstraction layer around storing data of relational or non-relational form in non-volatile memory and controlling access to it while other systems are updating it. And because it's a nice abstraction, there are a lot of existing libraries that can take advantage of it in your language of choice without requiring you to completely invent all of that stuff over the top of the filesystem. That has knock-on effects when you're trying to add new functionality or new interaction patterns to an existing system.And in cases where two or more processes need to communicate using the same data, a database gives you some good abstractions and synchronization primitives that make sense, whereas regular files or IPC require you to invent a lot of that stuff. You could use messaging to communicate updates to data but now you have two copies of everything, and you have to somehow atomize the updates so that either copy is consistent for a point in time. Why not use a database?Knowing what I know today I would start with some kind of database abstraction even if it's not necessarily designed for transactional data, and I would make sure it handled the numerous concerns I have around data sharing, consistency, atomicity, and notification because if I don't have those things I eventually have to invent them to solve the reliability problems I otherwise run in to without them.
freedomben: I avoided DBs like the plague early in my career, in favor of serialized formats on disk. I still think there's a lot of merit to that, but at this point in my career I see a lot more use case for sqlite and the relational features it comes with. At the least, I've spent a lot less time chasing down data corruption bugs since changing philosophy.Now that said, if there's value to the "database" being human readable/editable, json is still well worth a consideration. Dealing with even sqlite is a pain in the ass when you just need to tweak or read something, especially if you're not the dev.
giva: > Dealing with even sqlite is a pain in the ass when you just need to tweak or read something, especially if you're not the dev.How? With SQL is super easy to search, compare, and update data. That's what it’s built for.
freedomben: Pain in the ass was way too strong, I retract that. Mainly I meant relative. For example `nvim <filename>.json` and then /search for what I want, versus tracking down the sqlite file, opening, examining the schema, figuring out where the most likely place is that I care about, writing a SQL statement to query, etc.
s_ting765: [delayed]
trgn: im sure, but honestly, i would love to have a db engine that just writes/reads csv or json. does it exist?
tonymet: If the cloud is just someone else’s hard disks (etc) then RDBMS is just someone else’s btree
oliviergg: Please … Every few years the pendulum swings. First it was “relational databases are too rigid, just use NoSQL.” Then “NoSQL is a mess, just go back to Postgres.” Now: “do you even need a database at all, just use flat files.” Each wave is partially right. But… each wave is about to rediscover, the hard way, exactly why the previous generation made the choices they did. SQLite is the answer to every painful lesson learned, every scar from long debug night the last time someone thought “a JSON file is basically a database.”
goerch: Yes, but you are probably a bit too polite. And I'm not sure how to do justice to SQLite, Postgres and my new favourite toy DuckDB.
marcosdumay: No, when things change fast and unpredictably, NoSQL is worse than when they are well-known and stable.NoSQL gains you no speed at all in redesigning your system. Instead, you trade a few hard to do tasks in data migration into an unsurmountable mess of data inconsistency bugs that you'll never actually get into the end of.> is mostly bad design decisions and poor domain knowledgeYes, using NoSQL to avoid data migrations is a bad design decision. Usually created by poor general knowledge.
leafarlua: Makes sense. But in this case, why NoSQL exists? What problems does it resolves and when should it be considered? I'm being naive, but fast changing environment has been one of the main advantages that I was taught from devs when it comes to NoSQL vs SQL (nosql being the choice for flexible schemas). So it is more about BASE vs ACID?
debo_: Michael Stonebraker used to write long, scathing critiques of modern data storage/retrieval fads, and how they were forgetting important historical lessons.They were terrific reads; his writing on object-oriented databases was the most fun technical reading I did in grad school. And I even learned a lot!
tracker1: I think part of it is the scale in terms of the past decade and a half... The hardware and vertical scale you could get in 2010 is dramatically different than today.A lot of the bespoke no-sql data stores really started to come to the forefront around 2010 or so. At that time, having 8 cpu cores and 10k rpm SAS spinning drives was a high end server. Today, we have well over 100 cores, with TBs of RAM and PCIe Gen 4/5 NVME storage (u.x) that is thousands of times faster and has a total cost lower than the servers from 2010 or so that your average laptop can outclass today.You can vertically scale a traditional RDBMS like PostgreSQL to an extreme degree... Not to mention utilizing features like JSONB where you can have denormalized tables within a structured world. This makes it even harder to really justify using NoSQL/NewSQL databases. The main bottlenecks are easier to overcome if you relax normalization where necessary.There's also the consideration of specialized databases or alternative databases where data is echo'd to for the purposes of logging, metrics or reporting. Not to mention, certain layers of appropriate caching, which can still be less complex than some multi-database approaches.
leafarlua: What about the microservices/serverless functions world? This was another common topic over the years, that using SQL with this type of system was not optimal, I believe the issue being the connections to the SQL database and stuff.
tracker1: I think a lot of the deference to microservices/serverless is for similar reasons... you can work around some of this if you use a connection proxy, which is pretty common for PostgreSQL...That said, I've leaned into avoiding breaking up a lot of microservices unless/until you need them... I'm also not opposed to combining CQRS style workflows if/when you do need micro services. Usually if you need them, you're either breaking off certain compute/logic workflows first where the async/queued nature lends itself to your needs. My limited experience with a heavy micro-service application combined with GraphQL was somewhat painful in that the infrastructure and orchestration weren't appropriately backed by dedicated teams leading to excess complexity and job duties for a project that would have scaled just fine in a more monolithic approach.YMMV depending on your specific needs, of course. You can also have microservices call natural services that have better connection sharing heuristics depending again on your infrastructure and needs... I've got worker pools that mostly operate of a queue, perform heavy compute loads then interact with the same API service(s) as everything else.
OtomotO: If you have an option, never ever use Oracle!Never!
datadrivenangel: But if you have state, and that state needs to persist between program executions, then for the love of resiliency and all that is robust, consider using a SQL database!
ozgrakkurt: You need databases if you need any kind of atomicity. Doing atomic writes is extremely fragile if you are just on top of the filesystem.This is also why many databases have persistence issues and can easily corrupt on-disk data on crash. Rocksdb on windows is a very simple example a couple years back. It was regularly having corruption issues when doing development with it.
creatonez: For the simple case, it isn't necessarily that fragile. Write the entire database to a temp file, then after flushing, move the temp file to overwrite the old file. All Unix filesystems will ensure the move operation is atomic. Lots of "we dump a bunch of JSON to the disk" use cases could be much more stable if they just did this.Doesn't scale at all, though - all of the data that needs to be self-consistent needs to be part of the same file, so unnecessary writes go through the roof if you're only doing small updates on a giant file. Still gotta handle locking if there is risk of a stray process messing it up. And doing this only handles part of ACID.
jeffffff: don't forget to fsync the file before the rename! and you also need to fsync the directory after the rename!
james_marks: If the argument for NoSQL is, “we don’t know what our schema is going to be”, stop.Stop and go ask more questions until you have a better understanding of the problem.
jampekka: Oftentimes better understanding of the problem needs trying out solutions. Armchair architectures tend to blow up in contact with reality.
datadrivenangel: vlookups are bad for performance. recursive vlookups even more so.
marcosdumay: NoSQL was created to deal with scales where ACID becomes a bottleneck. It also shown itself useful for dealing with data that don't actually have an schema.If you have either of those problems, you will know it very clearly.Also, ironically, Postgres became one of the most scalable NoSQL bases out there, and one of the most flexible to use unstructured data too.
freedomben: Agreed. In my experience (YMMV), there was also a real adoption push in the js world from primarily front-end people that wanted to do some backend but didn't want to learn/deal with SQL databases. I don't say that with malice, I was also on-board the NoSQL train for a bit before I actually gained experience with the headaches it caused. The appeal of "just dump your JSON blob straight in" was (and still is) strong. Software is all about learning, and sometimes that learning is expensive. We've all built something we later regretted.
FpUser: I think this whole article and post is an attention / points seeking exercise. It is hard to imagine programmer who would not know difference between DBMS and just bunch of files and when to use which
noselasd: Yes, the code in the article will at one unlucky point end up with an empty file after a power outage.At least write to a temp file(in the same filesystem), fsync the file and its folder and rename it over the original.
theshrike79: I have a vague recollection that 4chan (At least at one point) didn't use any kind of backend database, they just rewrote the static pages with new content and that was it.That's why it could handle massive traffic with very little issues.
jerf: The one that gets me a lot, which is similar in practice to your point, is when I need server redundancy, even if one server is otherwise plenty for my task. As soon as I'm not running in one place, you need network data storage, and that kicks pretty hard in the direction of a network-accessible database. S3 works sometimes and the recent work on being able to atomically claim files has helped with some of the worst rough edges but it still doesn't take a lot to disqualify it, at least as the only store.
Sharlin: Not sure if sarcastic…
fatih-erikli-cg: It isnt sarcasm. I don't really find a case that a database that has it's own query language like SQL is needed. It won't be different than storing a JSON file and filter the content with a for loop, the dev (e.g. me) will be returning a JSON on REST API at the end. A query language may be a good thing if you are working in a team, thats it. SQL is indeed isnt a good thing.
akdev1l: SQLite can do it
trgn: it's storage file is a csv? or do you mean import/export to csv?
gf000: Probably the best use case would be something like a Facebook profile page for a given user.It may not have a very rigid schema, you may later add several other optional fields.You need very large scale (as in no of concurrent accesses), you want to shard the data by e.g. location. But also, the data is not "critical", your highschool not being visible temporarily for certain users is not an issue.You mostly use the whole dataset "at the same time", you don't do a lot of WHERE, JOIN on some nested value.In every other case I would rather reach for postgres with a JSONB column.
orthogonal_cube: SQLite did decently well but I think they should’ve done an additional benchmark with the database loaded completely into memory.Since they’re using Go to accept requests and forwarding them to their SQLite connection, it may have been worthwhile to produce the same interface with Rust to demonstrate whether or not SQLite itself was hitting its performance limit or if Go had some hand in that.Other than that, it’s a good demonstration of how a custom solution for a lightweight task can pay off. Keep it simple but don’t reinvent the wheel if the needs are very general.
allthetime: Funny, I was just hard at work on my new article “Just use a database”.
gavinray: > Doing atomic writes is extremely fragile if you are just on top of the filesystem. This is not true, at least in Linux. pwritev2(fd, iov, iovcnt, offset, RWF_ATOMIC); The requirements being that the write must be block-aligned and no larger than the underlying FS's guaranteed atomic write size
tclancy: Sure, but how many people using files as a data store even know to worry about atomicity?
sgarland: They will learn eventually, and then they’ll get to write a blogpost describing something that any sysadmin or kernel dev could’ve told them. Win-win!
sgarland: Unless I’ve missed something, Postgres doesn’t have automatic index creation, nor does it have JSON introspection to automatically convert it to a normalized schema (which is insane; I love it). It also doesn’t do any kind of sharding on its own, though of course forks like Citus exist. It definitely doesn’t do RAC / Exadata (not sure which part this falls under), where multiple nodes are connected and use RDMA to treat a bunch of SSDs as local storage.I love Postgres, and am not a huge fan of Oracle as a corporation, but I can’t deny that their RDBMS has some truly astounding capabilities.
dkarl: Honestly, at this point, if I had a design that required making atomic changes to files, I'd redo the design to use SQLite. The other way around sounds crazy to me."Why use spray paint when you can achieve the same effect by ejecting paint from your mouth in a uniform high-velocity mist?" If you happen to have developed that particular weird skill, by all means use it, but if you haven't, don't start now.That probably sounds soft and lazy. I should learn to use my operating system's filesystem APIs safely. It would make me a better person. But honestly, I think that's a very niche skill these days, and you should consider if you really need it now and if you'll ever benefit from it in the future.Also, even if you do it right, the people who inherit your code probably won't develop the same skills. They'll tell their boss it's impossibly dangerous to make any changes, and they'll replace it with a database.
ricardobeat: > They'll tell their boss it's impossibly dangerous to make any changes, and they'll replace it with a database.This, 100%. Development today is driven by appearances though, you can take advantage of that. Give it a cute name, make sure you have AI generate an emoji-rich README for it, publish it as an open source npm package, then trigger CI a few thousand times to get a pretty download count. They will happily continue using it without fear!
pythonaut_16: Heuristically they'd be right to say that though.If you start a new job and on your first day they go "Yeah the last guy said we don't need a database, so he rolled his own." are you gonna be excited, or sweating?Exception being perhaps "The last team chose to build their own data layer, and here's the decision log and architecture docs proving why it was needed."
Sharlin: Um, so your use cases are extremely narrow and limited. That's a failure of imagination and an astonishing lack of understanding of real-world computer systems if you cannot understand why people need both the power of SQL and the performance of RDBMSs.
devilsdata: This is a cool exercise, but I would hesitate to choose files over SQLite or another Dockerised relational database in production.They are overoptimising for the simplest part of writing the application; the beginning. They've half-implemented an actual database, with none of the safety features. There are a lot of potential headaches that this article has avoided talking about; perhaps because they haven't experienced them yet.See: https://danluu.com/file-consistency/What happens when you need to start expanding the scope of this feature? Joining users on profiles, or users on orgs?Ask yourself: how many shops have seriously written an application backed by files and stuck with it over the long-run? The answer is likely very few. Therefore, this is likely doubling up the work required.There is a reason people reach for a database first. I'd strongly encourage anyone to avoid doing stuff like this.
bpavuk: and then there is a decent amount of software that's mostly "one and done" and has immense performance constraints - games and anything that has to do with real-time. for game engines, a custom data format from the very start makes a lot of sense because your budget is usually less than 17ms and 8 threads on low-end hardware and 8.(3)ms across 16 threads on high-end. there, "smart data structures and dumb code beat dumb data structures and smart algorithms" couldn't be more true.yet, for a generic app or server, just don't fuck your brains and go with SQLite
hunterpayne: Serious question, why are people here acting as if formatted files are somehow more reliable than a DB? That just simply isn't true. For most of software development's history, using flat files for persistence of data was the wrong thing to do with good reason. Flat files can easily be corrupted, and that happens much more often than a DB gets corrupted. The reason you might think otherwise is just sampling bias.
btilly: I do believe that you are missing a healthy dose of sarcasm. Such as faking downloads to give yourself inflated statistics so that your employer will trust untested and AI-written garbage.That said, there really are good use cases for readable formatted files. For example configuration files that are checked into source control are far more trackable than a SQLite database for the same purpose. For another example, the files are convenient for a lot of data transfer purposes.But for updateable data? You need a really good reason not to simply use a database. I've encountered such edge cases. But I've encountered a lot more people who thought that they had an edge case, than really did.
janalsncm: I have come at it from another angle, so maybe it’s useful perspective.A lot of data pipeline work can function pretty well with each step producing artifacts that the next can consume. If you need unique items, use uuids. If you want to check whether an item exists, use a hash of something (maybe a url). And each day goes in its own subdirectory. It works.Sometimes, though, you end up reinventing database constructs. Joins, indexing, versioning. A little of that is ok, but it gets gnarly fast. If you mess up, it could mean corrupting your data in a way that’s irreversible. Or, corrupting your data in a way that’s reversible but really hard.So a database has the benefit of enforcing consistency that a glob over json files doesn’t.
hunterpayne: "All Unix filesystems will ensure the move operation is atomic."This is false, but most fs will. However, there is a lot of fs calls you have to make that you probably don't know about to make the fs operations atomic.PS The way you propose is probably the hardest way to do an atomic FS operation. It will have the highest probably of failure and have the longest period of operations and service disruption. There is good reason we move rows one at a time or in batches sized to match OS buffers.
hunterpayne: "NoSQL is the correct answer."No, no it isn't. It never is. Just as building your house on a rubber foundation isn't the correct answer either. This is just cope. Unless your use cases don't care about losing data or data corruption at all, NoSQL isn't the correct answer.
nine_k: In short, once you need reliability, your complexity necessarily grows due to the redundancy and failover you need to introduce.If your downtime does not cost much, you can host many things on a single tiny computer.
briandw: The article is fine, but I wanted to call this out."Every database you have ever used reads and writes to the filesystem, exactly like your code does when it calls open()."Technically not true. Applications like SQLite use mmap to map the file into a locally addressable memory space. This lets you skip the syscalls when reading and writing. The kernel can map that data in dynamically much faster than a userland process can.Later in the article they go over the process of reading in the entire file into memory, again mmap is much better at this. Would have been nice to see that approach used.
fg137: The article definitely oversimplifies the IO happening in a database.That said, depending who you are talking to, they may not agree with you on "mmap is much better than this". Some people will say you should do what you need in the application logic instead of depending on APIs from the OS. (although not necessarily for the specific example here)
hunterpayne: Wow, I'm sorry you have to work with such coworkers. For reference, joins are just an expensive use case. DBs do them about 10x faster that you can do them by hand. But if you need a join, you probably should either a) do it periodically and cache the result (making your data inconsistent) or b) just do it in a DB. Confusing caching the result with doing the join efficiently is an amazing misunderstanding of basic Computer Science.
creatonez: Correct, the file must be flushed to disk! I'm so used to libraries handling it that I forgot to mention that.I believe syncing the parent directory is only needed for durability and not atomicity, but if you're using this method you're probably not caring about durability in the first place.
yabutlivnWoods: linux + ext4 has atomic single and multi block writes as of 2025:https://docs.kernel.org/filesystems/ext4/atomic_writes.html
a34729t: I sympathize with this so hard. I frequently conduct system design interviews where the problem could easily be handled on a single machine with a flat file, let alone sql lite. Only the rare candidate mentions this; mostly I get a horde of microservices and queues and massive distributed databases that are totally unneccessary.
pdimitar: I mostly agree but I had plenty of cases where the project / team was forced to reinvent a query engine over flat files and/or in-memory caches.From that POV I moved from the extreme of "you don't need a state at all (and hence no database)" to the bit more moderate "you usually don't need a file or a DB but you almost certainly will want to query whatever state you store so just get a DB early".I strongly sympathize with "no microservices" of course. That's an overkill for at least 99% of all projects I've ever seen (was a contractor for a long time). But state + querying is an emergent property as a lot of projects move beyond the prototype phase.
fatih-erikli-cg: PostGIS is an extension of PostgreSQL. It claims that it has some geography features. I think it is not really related with a database. It brings only a function (HAVERSINE) that gets distance of two points on earth. It is couple of lines of code. It is not really a software project, but a detail about how the earth coordinates are calculated, and I think it is a total made up story. The real computed thing works like math.sqrt function.
thutch76: I love reading posts like these.I will still reach for a database 99% or the time, because I like things like SQL and transactions. However, I've recently been working on a 100% personal project to manage some private data; extracting insights, graphing trends, etc. It's not high volume data, so I decided to use just the file system, with data backed at yaml files, with some simple indexing, and I haven't run into any performance issues yet. I probably never will at my scale and volume.In this particular case having something that was human readable, and more importantly diffable, was more valuable to me than outright performance.Having said that, I will still gladly reach for a database with a query language and all the guarantees that comes with 99% of the time.
mamcx: > people wildly underestimate the os page cache and modern nvme drivesAnd worse, overestimate how safe is their data!All this fancy thing about not using a RDBMS could had been true only if the APIs and actual implementation across ALL the IO path were robust and RELIABLE.But is not!EVERY LAYER LIESALL of themALL OF TIMEThat is why the biggest reason building a real database (whatever the flavor) is that there is no way to avoid pay performance taxes all over the place because you can't believe the IO and having a (single | some files) getting hammered over and over make this painfully obvious.One of the most sobering experiences is that you write your IO with all the care in the world, let the (your brand new) DB run for hours, on good, great, hardware, and in less than a week you will find that that breaks in funny ways.P.D: Was part of a team doing a db
perrygeo: Let's put it this way. I always end up needing the functionality and ACID guarantees of a database. I always wish I had a database. But some times I'm forced to use the project's legacy data stores (often flat-file data lakes) and watch every wheel get reinvented as we struggle to glue consistency, transactions, a bespoke query language, etc. onto an unwilling pile of unstructured data.
WatchDog: The sqlite benchmark is not very optimised.Just adding: db.SetMaxOpenConns(runtime.NumCPU()) db.SetMaxIdleConns(runtime.NumCPU()) Made the performance on my machine go from 27,700.99 r/s to 89,687.36 r/s.I also tried making the get user a prepared statement, and storing the timestamp as an unix timestamp integer, but that didn't make much difference for me.
mikenikles: Of course it is :).
PaulDavisThe1st: > The kernel can map that data in dynamically much faster than a userland process can.Not necessarily. The kernel's mmap implementation has quite a strong bias towards certains kinds of access patterns; deviate from them and it can become slower than read(2).We tried using mmap(2) for audio file i/o in Ardour, and it got notably less bandwidth than just using read(2).
debatem1: I'm curious if you tried different madvise strategies and if any of them worked better than others?
tnelsond4: I love SQLite, I love the idea, I love having something mature and lightweight, but like the author I discovered it's overkill or dare I say insufficient for certain use cases.I was building a client side dictionary app with search functionality and thought that using sqlite's wasm port would be the perfect solution.I used SQLite for a couple years and it started to wear on me. The database files were bigger than they should be, they didn't compress super well, the loading time was a bit slow. Linear search wasn't very fast either. Editing SQLite files and doing joins and stuff was slow and frustrating as well. I wanted something simpler.I didn't need atomics, I didn't need writes, so I just handrolled a set of indexes from the source tsv files and compressed those with zstd and decompressed them via wasm on every load. Decompressing and loading was now faster than straight loading via SQLite and since my module was 52kb of wasm instead of 800kb for SQLite I could load as many instances of the module as I wanted with no regrets. I use stringzilla for linear scans and it's ridiculously fast.SQLite is great, but it's not the solution for every problem.
jfreds: The author of stringzilla, Ash Vardanian, has a bunch of really cool talks/lectures/demos. Highly recommend checking him out if you like that kind of thing :)
tnelsond4: Yeah, I probably should since I want to add glob or regex support later.It's a little out of the scope of stringzilla but I should still be able to use it for the heavy lifting
ryang2718: Yeah, files as a database is fun, but I find you ultimately reinvent the wheel when sqlite is pretty battle tested, free and easier to get right or scale up.I can't talk though because I actually find myself doing this a lot
keithnz: so the article benchmarks speed, but I don't really think most people are making the choice based on speed. Robustness is really the driving force (well at least for me). So for CRUD operations I have way more confidence with something like SQLite than using files. Files are workable in many situations, but much easier if you default to things like SQLite and use files only when you think there is a specific advantage.
amarant: It's like the old words of wisdom say, the fastest way to manage your data is to pipe it to /dev/null
inasio: There's a whole thing this days about building solvers (e.g. SAT or Ising) out of exotic hardware that does compute in memory. A while back I wondered if one could leverage distributed DB logic to build solvers for massive problems, something like compute in DB.
thegdsks: For anything that starts as a side project, SQLite covers it. The moment you need to ask this question the answer is usually "not yet." Ship first, migrate when it actually hurts
Waterluvian: My personal toolbox’s section for databases has three items: files (usually JSON), SQLite, Postgres. I haven’t and I doubt I will ever run into a situation where none of those fit. They exist, I’m sure, but I’m too general purpose to ever need anything else.
PaulDavisThe1st: I don't recall - it was several years ago. But glancing through what is left of the test program, it seems likely that we did not.
kantselovich: If there is no database, where one should persist state?Most software is stateful and needs to persist state across restarts, so I would argue that one needs at least SQLite.On SQLite being safe default: in practice it means supporting multiple databases, say SQLite and Postgres, this is more complicated that supporting just Postgres. As soon as a project leaves localhost and enters cloud development you need talk to a database over network, which warrants MySQL or Postgres.Which is more complicated: supporting a docker container with mysql or Postgres for local development OR supporting multiple databases in the project?Of course, the answer could be “it depends”, I but I would not call SQLite a default choice. It would be if you are writing desktop or mobile app, but for anything like a web app it’s a questionable choice.
koliber: Backups are super-simple as well.I'm also a convert.
oliver236: why sqlite over postgres?
afpx: Not directly related, but this is a good example of why I love dependency injection. In most systems, I typically define the interface, implement something super simple at first, and as I iterate I re-evaluate, and I can easily* swap between implementations.* Not always super easy
Croak: You remove a bit of complexity. Sure Postgres is not hard to set up sn to connect to, but Sqlite is just opening a file. It being a file makes it also very easy to test or debug you application.
jwilliams: Is it just me or is this sorely lacking any consideration for inserts? Or durability?
alexisread: I think that’s the beauty of PG here, you can find solutions to most of this:Index creation https://stackoverflow.com/questions/23876479/will-postgresql...JSON->DB schema https://jsonschema2db.readthedocs.io/en/latest/index.htmlPg shared disk failover is similar but RAC is quite unique, you’re not going to use though with a rented cluster?https://www.postgresql.org/docs/current/different-replicatio...Personally for me any technical advantages don’t outweigh the business side, YMMV :)
mike_hearn: RAC is a default part of any cloud Oracle DB, I think! I must admit I'm not an expert in all the different SKUs so there might be some that aren't, but if you rent an autonomous DB in the cloud you're probably running on ExaData/RAC. That's why the uptime is advertised as 99.95% without DR (disaster recovery, a replica running in a separate region).> Index creation https://stackoverflow.com/questions/23876479/will-postgresql...I was probably ambiguous. That's just an answer telling how to create indexes manually, and saying that you get an index for primary keys and unique constraints automatically. Of course, all databases do that. Oracle can create arbitrary indexes for any relation in the background without it being requested, if it notices that common queries would benefit from them.Forgetting to create indexes is one of the most common issues people face when writing database apps because the performance will be fine on your laptop, or when the feature is new, and then it slows down when you scale up. Or worse you deploy to prod and the site tanks because a new query that "works fine on my machine" is dog slow when there's real world amounts of data involved. Oracle will just fix it for you, Postgres will require a manual diagnosis and intervention. So this isn't the same capability.
skapadia: The second paragraph sounds eerily AI-generated.
graemep: If you look at those "When do you actually need a database?" constraints I think its missing consistency which prevents bugs and makes debugging easier.When you combine all those a database is a better alternative for all but the simplest cases.
Hendrikto: > The database files […] didn't compress super wellThat’s a good thing. If they were compressible that would indicate inefficiency in the SQLite format.
dxxvi: Before reading that post, I would use sqlite/duckdb for everything related to data. After reading that post, I will use sqlite/duckdb + Rust for everything related to data :-)
RugnirViking: have you ever run automated tests on postgres? how long did they take?
pdyc: what was your data size? i am surprised 800kb made a difference? using stringzilla was smart approach,my guess is it being unusually faster made all the difference.
direwolf20: If it's a server app, I almost always have to switch to Postgres eventually so now I start with Postgres.
wmanley: SQLite in mmap mode (not the default) will use mmap for reading. It will still use write using `pwrite` so it can detect and recover from write failures.See https://www.sqlite.org/mmap.html :> The default mechanism by which SQLite accesses and updates database disk files is the xRead() and xWrite() methods of the sqlite3_io_methods VFS object. These methods are typically implemented as "read()" and "write()" system calls which cause the operating system to copy disk content between the kernel buffer cache and user space.> Beginning with version 3.7.17 (2013-05-20), SQLite has the option of accessing disk content directly using memory-mapped I/O and the new xFetch() and xUnfetch() methods on sqlite3_io_methods.The principal advantage to mmap in my mind is that the cache is shared among your SQLite connections.
vovanidze: i think you missed the "on day 1" part of my comment. k8s, iac, and observability are incrdible tools when you actually have the scale and team to justifiy them.my point is strictly about premature optimizaton. ive seen teams spend their first month writing helm charts and terraform before they even have a single paying user. if you have product-market fit and need zero-downtime rollbacks, absolutly use k8s. but if youre just validatng an mvp, a vps and docker-compose (or sqlite) is usually enough to get off the ground.its all about trade-offs tbh.
locknitpicker: > i think you missed the "on day 1" part of my comment. k8s, iac, and observability are incrdible tools when you actually have the scale and team to justifiy them.No, not really. It's counterproductive and silly to go out of your way to setup your whole IaC in any tool you know doesn't fit your needs just because you have an irrational dislike for a tool that does. You need to be aware that nowadays Kubernetes is the interface, not the platform. You can easily use things like minikube, k3s, microk8s, etc, or even have sandbox environments in local servers or cloud providers. It doesn't matter if you target a box under your desk or AWS.It's up to you to decide whether you want to waste your time to make your life harder. Those who you are accusing of being brainwashed seem to prefer getting stuff done without fundamentalisms.
Anamon: I try to always have these considerations in mind. DB-less is underrated.One of my favourite websites ever loads its entire main index database into the client as a JSON file. It's about 5 MBs I think, so maybe 10% of the ad and autoplay video crap most "modern" websites make you download in the first few seconds. Except that this initial download actually gives you an amazing UX in return, because all the searching, filtering, pagination etc. happens entirely client-side without further network requests. I don't know any other website that is as much of a joy to just browse through. You never wait for anything until the point that you found what you want and actually want to open a detail page. Just wonderful.
jmull: > Binary search beats SQLite... For a pure ID lookup, you're paying for machinery you're not using.You'll likely end up quite a chump if you follow this logic.sqlite has pretty strong durability and consistency mechanism that their toy disk binary search doesn't have.(And it is just a toy. It waves away the maintenance of the index, for god's sake, which is almost the entire issue with indexes!)Typically, people need to change things over time as well, without losing all their data, so backwards compatibility and other aspects of flexibility that sqlite has are likely to matter too.I think once you move beyond a single file read/written atomically, you might as well go straight to sqlite (or other db) rather than write your own really crappy db.
WatchDog: It's also just wrong, their SQLite benchmark is only using a single thread for the SQLite connection. It's much faster with multiple connections.
akdev1l: You can import csv files into in memory tables and query them or you can use the csv extensions$ sqlite3 :memory:.import myfile.csv mytableSELECT * FROM mytable;$ sqlite3 :memory:SELECT *FROM csv_read('myfile.csv');
tnelsond4: Well, I went and looked at it again and it seems like SQLite databases although larger than my custom format do compress to a 10% smaller size provided there's no indexes. But my custom format uses indexes so it's not really a good comparison. I guess I should see what it's like with indexes.But at the end of the day SQLite just doesn't meet the needs of my project due to it's size and lack of free decompression support. Passing a decompression stage before giving it to SQLite is unwieldy and requires a separate decompression module and incurs a copy operation. I don't know enough about SQLite to attempt to compile and integrate the zstd decompressor into it.
Panzerschrek: I agree that using a database in many cases is overkill. And I don't understand at all who and why uses so-called in-memory databases.
koliber: The reality is that things will be blazing fast in any language if you save things by PK in HashMaps.
tstenner: In the benchmark Rust is more than 50% faster than the runner up
koliber: Correct. Order-of-magnitude-wise, it's roughly the same as the alternatives.In the context of writing a new service for a new company, you should not spend one second thinking about whether your technical choices will allow you to serve 100,000 requests per second, or 150,000 requests per second. If you are, you are focusing on the wrong thing. If you get to 1,000 requests per second with a real paying client base you already achieved more than most dream of.On the other hand, if you are optimizing a mature distributed low-latency equity trading system that is consuming ten's of thousands of market data ticks per second, a 50% improvement in performance on a 20 machine cluster might turn into some real $$$ savings. But that's not what this article is about.
ngc248: NOSQL works out really well when you know the access patterns you are working with, if not it will be like pegging a round hole with a square peg
tnelsond4: Usually I'm dealing with about 20mb of compressed data, almost 100mb uncompressed. Even with only a couple mb of data SQLite still has a startup time of a couple hundred milliseconds on my phone. But that's a couple hundred milliseconds when loading a database that's already decompressed. When loading 100mb SQLite usually took a second or so which I didn't really like for a pwa.It took me quite a few attempts to get something faster than SQLite. My new format loads instantly because I'm just casting the data to a struct. The only thing that takes time is decompressing, but that's still faster than loading the uncompressed via SQLite. My phone loads 100mb from 20mb compressed in about 400ms.But writing my own format gives other benefits like being able to extract all the HTML tags and capital letters beforehand for fast and sensible search and reconstructing it on render. It's also just way easier for me to edit tsvs with markers for what parts are indexed and have that transformed into an indexed format with 3 indexes.Also, with SQLite I was just running one module, but with my new format I'm running about 20 instances of it because it keeps the data nicer, more manageable and makes everything very parallel. Though I keep the number of web workers to 2 because it doesn't seem to benefit much to increase it more.https://github.com/tnelsond/peakslab
nchmy: This is really cool. I'm working on stuff that is somewhat aligned with this - offline knowledge base/educational platform focused on things like appropriate technologies for rural people in the developing world. Storing in the browser and, more importantly, searching it is definitely one of the major challenges. (it's also just a much more dynamic app)My main question about this is whether it can be dynamically/incrementally updated within the browser? Eg new material is available or edits have been made, so sync it from backend and it gets merged in.I've been working on using rxdb to sync and store in browser - it can use its own indexeddb abstraction, sqlite or it's own OPFS-based DB. It can also load any of these into memory in its memory-mapped mechanism. I've also made a mechanism to load everything into flexsearch in a sharedworker, so that you can do full text search fairly performantly.It's a lot of complexity though. I'd be curious to hear any of your thoughts. Or even to chat if you're open to it!