Discussion
Search code, repositories, users, issues, pull requests...
sourcegrift: We have everything optimized, and yet somehow DB queries need to be "interpreted" at runtime. There's no reason for DB queries to not be precompiled.
jpfr: The "byte-code" coming from the query planner typically only has a handful of steps in a linear sequence. Joins, filters, and such. But the individual steps can be very costly.So there is not much to gain from JITing the query plan execution only.JITing begins to make more sense, when the individual query plan steps (join, filter, ...) themselves be specialized/recompiled/improved/merged by knowing the context of the query plan.
SigmundA: Postgresql uses a process per connection model and it has no way to serialize a query plan to some form that can be shared between processes, so the time it takes to make the plan including JIT is very important.Most other DB's cache query plans including jitted code so they are basically precompiled from one request to the next with the same statement.
hans_castorp: > and it has no way to serialize a query plan to some form that can be shared between processeshttps://www.postgresql.org/docs/current/parallel-query.html"PostgreSQL can devise query plans that can leverage multiple CPUs in order to answer queries faster."
SigmundA: Nothing to do with plan caching, thats just talking about plan execution of parallel operations which is that thread or process based in PG?If process based then they can send small parts of plan across processes.
hans_castorp: Ah, didn't see the caching part.Plans for prepared statements are cached though.
SigmundA: Yes if the client manually prepares the statement it will be cached for just that connection because in PG a connection is a process, but it won't survive from one connection to the next even in same process.Other databases like MSSQL have prepared statements but they are rarely used now days since plan caching based on query text was introduced decades ago.
the_biot: What sort of things are people doing in their SQL queries that make them CPU bound? Admittedly I'm a meat-and-potatoes guy, but I like mine I/O bound.Really amazed to see not one but several generic JIT frameworks though, no idea that was a thing.
martinald: Anything jsonb in my experience is quickly CPU bound...
jjice: Definitely. If you're doing regular queries with filters on jsonb columns, having the index directly on the JSON paths is really powerful. If I have a jsonb filter in the codebase at all, it probably needs an index, unless I know the result set is already very small.
martinald: Yeah, the other problem is I've really struggled to have postgres use multiple threads/cores on one query. Often maxes out one CPU thread while dozens go unused. I constantly have to fight loads of defaults to get this to change and even then I never feel like I can get it working quite right (probably operator error to some extent).This compares to clickhouse where it constantly uses the whole hardware. Obviously it's easier to do that on a columnar database but it seems that postgres is actively designed to _not_ saturate multiple cores, which may be a good assumption in the past but definitely isn't a good one now IMO.
asah: awesome! I wonder if it's possible to point AI at this problem and synthesize a bespoke compiler (per-architecture?) for postgresql expressions?
kvdveer: Two things are holding back current LLM-style AI of being of value here:* Latency. LLM responses are measured in order of 1000s of milliseconds, where this project targets 10s of milliseconds, that's off by almost two orders of magnitute.* Determinism. LLMs are inherently non-deterministic. Even with temperature=0, slight variations of the input lead to major changes in output. You really don't want your DB to be non-deterministic, ever.
qeternity: > LLMs are inherently non-deterministic.This isn't true, and certainly not inherently so.Changes to input leading to changes in output does not violate determinism.
yomismoaqui: Quoting:"But why aren’t LLM inference engines deterministic? One common hypothesis is that some combination of floating-point non-associativity and concurrent execution leads to nondeterminism based on which concurrent core finishes first."From https://thinkingmachines.ai/blog/defeating-nondeterminism-in...
qeternity: Yes, lots of things can create indeterminism. But nothing is inherent.
magicalhippo: > This isn't trueFrom what I understand, in practice it often is true[1]:Matrix multiplication should be “independent” along every element in the batch — neither the other elements in the batch nor how large the batch is should affect the computation results of a specific element in the batch. However, as we can observe empirically, this isn’t true.In other words, the primary reason nearly all LLM inference endpoints are nondeterministic is that the load (and thus batch-size) nondeterministically varies! This nondeterminism is not unique to GPUs — LLM inference endpoints served from CPUs or TPUs will also have this source of nondeterminism.[1]: https://thinkingmachines.ai/blog/defeating-nondeterminism-in...
qeternity: Yes, lots of things can create indeterminism. But nothing is inherent.
fabian2k: The last time I looked into it my impression was that disabling the JIT in PostgreSQL was the better default choice. I had a massive slowdown in some queries, and that doesn't seem to be an entirely unusual experience. It does not seem worth it to me to add such a large variability to query performance by default. The JIT seemed like something that could be useful if you benchmark the effect on your actual queries, but not as a default for everyone.
pjmlp: That is quite strange, given that big boys RDMS (Oracle, SQL Server, DB2, Informix,...) all have JIT capabilities for several decades now.
SigmundA: The big boys all cache query plans so the amount it time it take to compile is not really a concern.
vladich: Postgres caches query plans too, the problem is you can only cache what you can share, and if your planner works well, you can share very little, there can be a lot of unique plans even for the same query
SigmundA: No it cannot cache query plans between processes (connections) and the only way it can cache in the same process in the same connection is by the client manually preparing it, this was how the big boys did it 30 years ago, not anymore.Was common guidance back in the day to use stored procedures for all application access code because they where cached in MSSQL (which PG doesn't even do). Then around 2000 it started caching based on statement text and that became much less important.You would only used prepared statements if doing a bunch of inserts in a loop or something and it has a very small benefit now days only because its not sending the same text over the network over and over and hashing to lookup plan.
vladich: I didn't say it can cache between processes. The problem is not caching between processes, it's that caching itself is not very useful, because the planner creates different plans for different input parameters of the same query in the general case. So you can reliably cache plans only for the same sets of parameters. Or you can cache generic plans, which Postgres already does as well (and sharing that cache won't solve much of the problem too).
SigmundA: Other databases cache plans and have for years because it's very useful, many (most?) apps run many of the same statement with differing parameters, its a big win.They also do things like auto parameterization if the statement doesn't have them and parameter sniffing to make multiple different plans based on different values where it makes sense.https://learn.microsoft.com/en-us/sql/relational-databases/q...You can also get this, add HINTs to control this behavior if you don't like it or its causing a problem in production, crazy I know.https://learn.microsoft.com/en-us/sql/t-sql/queries/hints-tr...PG is extremely primitive compared to these other systems in this area, and it has to be since it doesn't cache anything unless specifically instructed to for a single connection.
vladich: You make some unsubstantiated claims here. I assure you that it isn't as simple as you claim. And what Postgres does here is (mostly) the right thing, you can't do much better. You simply can't decide what plan you need to use based on the query and its parameters alone, unless you already cached that plan for those parameters (and even in that case you need to watch out for possible dramatic changes in statistics). Prepared statements != cached execution plans.
Asm2D: Many SQL engines have JIT compilers.The problems related to PostgreSQL are pretty much all described here. It's very difficult to do low-latency queries if you cannot cache the compiled code and do it over and over again. And once your JIT is slow you need a logic to decide whether to interpret or compile.I think it would be the best to start interpreting the query and start compilation in another thread, and once the compilation is finished and interpreter still running, stop the interpreter and run the JIT compiled code. This would give you the best latency, because there would be no waiting for JIT compiler.
chrisaycock: > I think it would be the best to start interpreting the query and start compilation in another threadThis technique is known as a "tiered JIT". It's how production virtual machines operate for high-level languages like JavaScript.There can be many tiers, like an interpreter, baseline compiler, optimizing compiler, etc. The runtime switches into the faster tier once it becomes ready.More info for the interested:https://ieeexplore.ieee.org/document/10444855
hinkley: It’s also common for JITs to sprout a tier and shed a tier over time, as the last and first tiers shift in cost/benefit. If the first tier works better you delay the other tiers. If the last tier gets faster (in run time or code optimization) you engage it sooner, or strip the middle tier entirely and hand half that budget to the last tier.
zaphirplane: What do you mean ? Cause the obvious thing is a shared cache and if there is one thing the writers of a db know it is locking
SigmundA: Sharing executable code between processes it not as easy as sharing data. AFAIK unless somethings changed recently PG shares nothing about plans between process and can't even share a cached plan between session/connections.
_flux: Write the binary to a file, call it `libquery-id1234.so`, and link that to whichever processes that need it?
vladich: Won't work well if it executes 20k+ queries per second. Filesystem will be a bottleneck among other things.
_flux: You can put more than one function in one file.
llm_nerd: Executable code is literally just data that you mark as executable. It did the JIT code, and the idea that it can't then share it between processes is incomprehensible.I was actually confused by this submission as it puts so much of an emphasis on initial compilation time, when every DB (apparently except for pgsql) caches that result and shares it/reuses it until invalidation. Invalidation can occur for a wide variety of reasons (data composition changing, age, etc), but still the idea of redoing it on every query, where most DBs see the same queries endlessly, is insane.
SigmundA: No a lot of jitted code has pointers to addresses specific to that process which makes no sense in another process.To make code shareable between processes takes effort and will have tradeoff in performance since it is not specialized to the process.If the query plan where at least serializable which is more like a AST then at least that part could be reused and then maybe have jitted code in each processes cached in memory that the plan can reference by some key.DB's like MSSQL avoid the problem because they run a single OS process with multiple threads instead. This is also why it can handle more connections easily since each connection is not a whole process.
patagurbon: What does specialized to the process mean? Lots of JIT tooling these days readily supports caching and precompilation. Invalidation is hard but things like reloading global references are hardly intractable problems especially for an org as large as pgsql.
larodi: sadly, no windows version yet AFAICT
vladich: It will be added soon
swaminarayan: Have you tested this under high concurrency with lots of short OLTP queries? I’m curious whether the much faster compile time actually moves the point where JIT starts paying off, or if it’s still mostly useful for heavier queries.
vladich: It's not useful for sub-millisecond queries like point lookups, or other simple ones that process only a few records. sljit option starts to pay off when you process (not necessarily return) hundreds of records. The more - the better. I'm still thinking about a caching option, that will allow to lift this requirement somewhat - for cached plans. For non-cached ones it will stay.
aengelke: > It's very difficult to do low-latency queries if you cannot cache the compiled codeThis is not too difficult, it just requires a different execution style. Salesforce's Hyper for example very heavily relies on JIT compilation, as does Umbra [1], which some people regard as one of the fastest databases right now. Umbra doesn't cache any IR or compiled code and still has an extremely low start-up latency; an interpreter exists but is practically never used.Postgres is very robust and very powerful, but simply not designed for fast execution of queries.Disclosure: I work in the group that develops Umbra.[1]: https://umbra-db.com/
Asm2D: If I recall research papers regarding Umbra it's also using AsmJit as a JIT backend, which means that theoretically the compilation times would be comparable if you only consider code emitting overhead.The problem will always be queries where the compilation is orders of magnitude more expensive than the query itself. I can imagine indexed lookup of 1 or few entries, etc... Accessing indexed entries like these are very well optimized by SQL query engines and possibly make no sense JIT optimizing.
vladich: Interesting... AsmJit is pretty fast for compilation, but 3x-5x slower than sljit. The only way I can see how to make it fast enough (i.e. without slowing down point-lookup queries and such) would be to fuse planning with code generation - i.e. a single pass plan builder + compiler essentially. Not sure if Umbra does that, and AsmJit is slow enough for that anyway, but with sljit it could be on par with interpreter even for fastest queries I believe. Pretty hard to implement though.
vladich: Sure, but not more than one query per file
_flux: Hm, what is preventing from putting more than one query into the same file?
hinkley: I’m always surprised to learn LLVM is so slow given that was one of the original motivations for developing it. I don’t know if that’s down to feature creep or intrinsic complexity being higher than people presumed was the case for GCC.
Tanjreeve: It's a compiler backend for programming languages not a runtime JIT compiler. Especially inside a DBMS a lot of the assumptions it was built with don't hold. Some people in DBMS world (mostly at TUM with Umbra/CedarDB) have written their own and others tried multi pass approaches where you have an interpreter first then a more optimised LLVM pass later.
SigmundA: Pointers to process specific memory addresses. I didn't say it was intractable only that it takes more effort, other databases do it.The current PG query plan and jit is designed around just being in memory in a single process, this would need to be extracted into something not process specific and shared between all processes. The plan itself is just a bunch of C structs I believe.
Asm2D: Because pg_jitter uses AsmJit's Compiler, which also allocates registers. That's much more work than using hardcoded physical registers in SLJIT case. There is always a cost of such comfort.I think AsmJit's strength is completeness of its backends as you can emit nice SIMD code with it (like AVX-512). But the performance could be better of course, and that's possible - making it 2x faster would be possible.
vladich: There are other issues with that auto-allocation. I tested all 3 backends on very large queries (hundreds of KBs) per query. Performance of all of them (+ LLVM) was abysmal - the compiling overhead was in seconds to tens(!) of seconds. They have some non-linear components in their optimization algorithms. While sljit was scaling linearly and almost as fast as for smaller queries. So yes, it gives higher run-time performance but the cost of that performance grows non-linearly with code size and complexity. While you still can have good performance with manual allocations.
vladich: SLJIT is a bit smarter than just to use hardcoded registers. It's multi-platform anyway, so it uses registers when they are available on the target platform, if not it will use memory, that's why performance can differ between Windows and Linux on x64 for example - different number of available registers.
Asm2D: AsmJit has only one place where a lot of time is spent - bin-packing. It's the least optimized part, which has quadratic complexity (at the moment), which starts to show when you have like hundreds of thousands of virtual registers. There is even a benchmark in AsmJit called `asmjit_bench_regalloc`, which shows that a single function that has 16MB alone, with 65k labels and 200k virtual registers takes 2.2 seconds to generate (and 40ms of that is time to just call `emit()`).If this function is optimized, or switched to some other implementation when there is tens of thousands of virtual registers, you would get orders of magnitude faster compilation.But realistically, which query requires tens of megabytes of machine code? These are pathological cases. For example we are talking about 25ms when it comes to a single function having 1MB of machine code, and sub-ms time when you generate tens of KB of machine code.So from my perspective the ability to generate SIMD code that the CPU would execute fast in inner loops is much more valuable than anything else. Any workload, which is CPU-bound just deserves this. The question is how much the CPU bound the workload is. I would imagine databases like postgres would be more memory-bound if you are processing huge rows and accessing only a very tiny part of each row - that's why columnar databases are so popular, but of course they have different problems.I worked on one project, which tried to deal with this by using buckets and hashing in a way that there would be 16 buckets, and each column would get into one of these, to make the columns closer to each other, so the query engine needs to load only buckets used in the query. But we are talking about gigabytes of RAW throughput per core in this case.
vladich: Added Windows (x86_64 for now) support
catlifeonmars: This is a neat idea. I want to take it further and precompile the entire DBMS binary for a specific schema.
menaerus: Someone is already working on it: https://arxiv.org/pdf/2603.02081
catlifeonmars: [delayed]
menaerus: It has different limitations but inefficiency doesn't seem likely to be one of them. Did you read the Experimental Results section?> Figure 2 shows the experimental results, and GenDB outperforms all baselines on every query in both benchmarks. On TPC-H, GenDB achieves a total execution time of 214 ms across five representative queries.> This result is 2.8× faster than DuckDB (594 ms) and Umbra (590 ms), which are the two fastest baselines, and 11.2× faster than ClickHouse.> On SEC-EDGAR, GenDB achieves 328 ms, which is 5.0× faster than DuckDB and 3.9× faster than Umbra.> The performance gap increases with query complexity. For example, on TPC-H Q9, which is a five-way join with a LIKE filter, GenDB completes in 38 ms, which is 6.1× faster than DuckDB. GenDB uses iterative optimization with early stopping criteria.> On TPC-H, Q6 reaches a near-optimal time of 17 ms at iteration 0 with zone-map pruning and a branchless scan, and does not require further optimization. In contrast, Q18 starts at 12,147 ms and decreases to 74 ms by iteration 1, which is a 163× improvement. This gain comes from replacing a cache-thrashing hash aggregation with an index-aware sequential scan.> On SEC-EDGAR, Q4 decreases from 1,410 ms to 106 ms over three iterations, which is a 13.3× improvement, and Q6 decreases from 1,121 ms to 88 ms over four iterations, which is a 12.7× improvement. In Q6, the optimizer gradually fuses scan, compact, and merge operations into a single OpenMP parallel region, which removes three thread-spawn overheads. By iteration 1, GenDB already outperforms all baselines
vladich: That's all great, but sadly impractical. I looked at one of the first statements: > GenDB is an LLM-powered agentic system that decomposes the complex end-to-end query processing and optimization task into a sequence of smaller and well-defined steps, where each step is handled by a dedicated LLM agent.And knowing typical LLM latency, it's outside of the realm of OLTP and probably even OLAP. You can't wait tens of seconds to minutes until LLM generates you some optimal code that you then compile and execute.
menaerus: No, that's not how I believe they intended it to work. They generate the workload-specific engine up-front and not when the query arrives.
vladich: Then why they write the opposite?
menaerus: If you look into the results, you will see that they are able to execute 5x TPC-H queries in ~200ms (total). The dataset is not large it is rather small (10GB) but nonetheless, you wouldn't be able to run 5 queries in such a small amount of time if you had to analyze the workload, generate the code, build indices, start the agents/engine and retrieve the results. I didn't read the whole paper but this is why I think your understanding is wrong.
vladich: Considering it's just s single Phd student who does this work, I don't believe such a task can be realistically accomplished, even as a PoC / research.
menaerus: Why not? Even without LLMs it is technically feasible to build custom database engine that performs much better than general database kernels. And we see this happening all the time, with timeseries, BLOBs, documents, OLTP, OLAP, logging etc.The catch is obviously that the development is way too expensive and that it takes a lot of technical capability which isn't really all that common. The novelty which this paper presents is that these two barriers might have come to an end - we can use LLMs and agents to build custom database engines for ourselves™ and our™ specific workloads, very quickly and for a tiny fraction of development price.