Discussion
Search code, repositories, users, issues, pull requests...
rs545837: Some context on the validation so far: Elijah Newren, who wrote git's merge-ort (the default merge strategy), reviewed weave and said language-aware content merging is the right approach, that he's been asked about it enough times to be certain there's demand, and that our fallback-to-line-level strategy for unsupported languages is "a very reasonable way to tackle the problem." Taylor Blau from the Git team said he's "really impressed" and connected us with Elijah. The creator of libgit2 starred the repo. Martin von Zweigbergk (creator of jj) has also been excited about the direction. We are also working with GitButler team to integrate it as a research feature.The part that's been keeping me up at night: this becomes critical infrastructure for multi-agent coding. When multiple agents write code in parallel (Cursor, Claude Code, Codex all ship this now), they create worktrees for isolation. But when those branches merge back, git's line-level merge breaks on cases where two agents added different functions to the same file. weave resolves these cleanly because it knows they're separate entities. 31/31 vs git's 15/31 on our benchmark.Weave also ships as an MCP server with 14 tools, so agents can claim entities before editing, check who's touching what, and detect conflicts before they happen.
tveita: > Elijah Newren, who wrote git's merge-ort (the default merge strategy), reviewed weave and said language-aware content merging is the right approach, that he's been asked about it enough times to be certain there's demand, and that our fallback-to-line-level strategy for unsupported languages is "a very reasonable way to tackle the problem." Taylor Blau from the Git team said he's "really impressed" and connected us with Elijah. The creator of libgit2 starred the repo. Martin von Zweigbergk (creator of jj) has also been excited about the direction.Are any of these statements public, or is this all private communication?> We are also working with GitButler team to integrate it as a research feature.Referring to this discussion, I assume: https://github.com/gitbutlerapp/gitbutler/discussions/12274
rs545837: Email conversations with Elijah and Taylor are private. Martin commented on our X post that went viral, and suggested a new benchmark design.
gritzko: At this point, the question is: why keep files as blobs in the first place. If a revision control system stores AST trees instead, all the work is AST-level. One can run SQL-level queries then to see what is changing where. Like - do any concurrent branches touch this function? - what new uses did this function accrete recently? - did we create any actual merge conflicts? Almost LSP-level querying, involving versions and branches. Beagle is a revision control system like that [1]It is quite early stage, but the surprising finding is: instead of being a depository of source code blobs, an SCM can be the hub of all activities. Beagle's architecture is extremely open in the assumption that a lot of things can be built on top of it. Essentially, it is a key-value db, keys are URIs and values are BASON (binary mergeable JSON) [2] Can't be more open than that.[1]: https://github.com/gritzko/librdx/tree/master/be[2]: https://github.com/gritzko/librdx/blob/master/be/STORE.md
samuelstros: How do you get blob file writes fast?I built lix [0] which stores AST’s instead of blobs.Direct AST writing works for apps that are “ast aware”. And I can confirm, it works great.But, the all software just writes bytes atm.The binary -> parse -> diff is too slow.The parse and diff step need to get out of the hot path. That semi defeats the idea of a VCS that stores ASTs though.[0] https://github.com/opral/lix
gritzko: I only diff the changed files. Producing blob out of BASON AST is trivial (one scan). Things may get slow for larger files, e.g. tree-sitter C++ parser is 25MB C file, 750KLoC. Takes couple seconds to import it. But it never changes, so no biggie.There is room for improvement, but that is not a show-stopper so far. I plan round-tripping Linux kernel with full history, must show all the bottlenecks.P.S. I checked lix. It uses a SQL database. That solves some things, but also creates an impedance mismatch. Must be x10 slow down at least. I use key-value and a custom binary format, so it works nice. Can go one level deeper still, use a custom storage engine, it will be even faster. Git is all custom.
rs545837: Good framing. Source code is already a serialization of an AST, we just forgot that and started treating it as text. The practical problem is adoption: every tool in the ecosystem reads bytes.
shubhamintech: The merge conflict is the symptom. The root problem is parallel agents have no coordination primitives before edits happen. The MCP server angle is the more interesting long-term bet here because it moves conflict avoidance earlier in the workflow rather than cleaning up damage after the merge. Entity claiming as a first-class primitive is where this gets really interesting for multi-agent coding. What do you think?
jerf: Everything on a disk ends up as a linear sequence of bytes. This is the source of the term "serialization", which I think is easy to hear as a magic word without realizing that it is actually telling you something important in its etymology: It is the process of taking an arbitrary data structure and turning it into something that can be sent or stored serially, that is, in an order, one bit at a time if you really get down to it. To turn something into a file, to send something over a socket, to read something off a sheet of paper to someone else, it has to be serialized.The process of taking such a linear stream and reconstructing the arbitrary data structure used to generate it (or, in more sophisticated cases, something related to it if not identical), is deserialization. You can't send anyone a cyclic graph directly but you can send them something they can deserialize into a cyclic graph if you arrange the serialization/deserialization protocol correctly. They may deserialize it into a raw string in some programming language so they can run regexes over it. They may deserialize it into a stream of tokens. This all happens from the same source of serialized data.So let's say we have an AST in memory. As complicated as your language likes, however recursive, however cross-"module", however bizarre it may be. But you want to store it on a disk or send it somewhere else. In that case it must be serialized and then deserialized.What determines what the final user ends up with is not the serialization protocol. What determines what the final user ends up with is the deserialization procedure they use. They may, for instance, drop everything except some declaration of what a "package" is if they're just doing some initial scan. They may deserialize it into a compiler's AST. They may deserialize it into tree sitter's AST. They may deserialize it into some other proprietary AST used by a proprietary static code analyzer with objects designed to not just represent the code but also be immediately useful in complicated flow analyses that no other user of the data is interested in using.The point of this seemingly rambling description of what serialization is is that"why keep files as blobs in the first place. If a revision control system stores AST trees instead"doesn't correspond to anything actionable or real. Structured text files are already your programming language's code stored as ASTs. The corresponding deserialization format involves "parsing" them, which is a perfectly sensible and very, very common deserialization method. For example, the HTML you are reading was deserialized into the browser's data structures, which are substantially richer than "just" an AST of HTML due to all the stuff a browser does with the HTML, with a very complicated parsing algorithm defined by the HTML standard. The textual representation may be slightly suboptimal for some purposes but they're pretty good at others (e.g., lots of regexes run against code over the years). If you want some other data structure in the consumer, the change has to happen in the code that consumes the serialized stream. There is no way to change the code as it is stored on disk to make it "more" or "less" AST-ish than it already is, and always has been.You can see that in the article under discussion. You don't have to change the source code, which is to say, the serialized representation of code on the disk, to get this new feature. You just have to change the deserializer, in this case, to use tree sitter to parse instead of deserializing into "an array of lines which are themselves just strings except maybe we ignore whitespace for some purposes".Once you see the source code as already being an AST, it is easy to see that there are multiple ways you could store it that could conceivably be optimized for other uses... but nothing you do to the serialization format is going to change what is possible at all, only adjust the speed at which it can be done. There is no "more AST-ish" representation that will make this tree sitter code any easier to write. What is on the disk is already maximally "AST-ish" as it is today. There isn't any "AST-ish"-ness being left on the table. The problem was always the consumers, not the representation.And as far as I can tell, it isn't generally the raw deserialization speed nowadays that is the problem with source code. Optimizing the format for any other purpose would break the simple ability to read it is as source code, which is valuable in its own right. But then, nothing stops you from representing source code in some other way right now if you want... but that doesn't open up possibilities that were previously impossible, it just tweak how quickly some things will run.
rs545837: interesting read, will comment more once I go through everything in detail. Thanks.
rs545837: Prevention better than cure, haha, that's exactly why weave ships an MCP server alongside the merge driver.
zokier: > At this point, the question is: why keep files as blobs in the first place. If a revision control system stores AST trees instead, all the work is AST-level.The problem is that disks (and storage in general) store only bytes so you inherently need to deal with bytes at some point. You could view source code files as the serialization of the AST (or other parse tree).This is especially apparent with LISPs and their sexprs, but equally applies to other languages too.
rs545837: Source code is already a serialization of an AST, we just forgot that and started treating it as text. The practical problem is adoption: every tool in the ecosystem reads bytes.
rkagerer: No C#?
rs545837: C# is supported! It goes through sem-core's(the underlying library for parsing we use in Weave) tree-sitter-c-sharp plugin. Classes, methods, interfaces, enums, structs are all extracted with it. Let me know if you hit anything.
rkagerer: Cool! I didn't see it listed on the main page so that's why I asked. Are there a lot of languages similarly supported via plugins? Are they all listed somewhere?Edit: Also, how are comments treated, in general (especially if they exist outside the structures you mentioned)? Eg. Does it somehow surface "contradictory" / conflicting edits made within comments? Or are they totally ignored?
keysersoze33: Interesting that Weave tries to solve Mergiref's shortcomings (also Tree-sitter based):> git merges lines. mergiraf merges tree nodes. weave merges entities. [1]I've been using mergiraf for ~6 months and tried to use it to resolve a conflict from multiple Claude instances editing a large bash script. Sadly neither support bash out of the box, which makes me suspect that classic merge is better in this/some cases...Will try adding the bash grammar to mergiraf or weave next time[1] https://ataraxy-labs.github.io/weave/
rs545837: Hey, author here. This comparison came up a lot when weave went viral on X (https://x.com/rs545837/status/2021020365376671820).The key difference: mergiraf matches individual AST nodes (GumTree + PCS triples). Weave matches entities (functions, classes, methods) as whole units. Simpler, faster, and conflicts are readable ("conflict in validate_token" instead of a tree of node triples).The other big gap: weave ships as an MCP server with 14 tools for agent coordination. Agents can claim entities before editing and detect conflicts before they merge. That's the piece mergiraf doesn't have.On bash: weave falls back to line-level for unsupported languages, so it'll work as well as git does there.Adding a bash tree-sitter grammar would unlock entity-level merge for it. But I can work on it tonight, if you want it urgently.Cheers,
keysersoze33: Thanks for the kind offer - no urgent rush though
kelseydh: Very cool, would love to see Ruby support added.
rs545837: Thanks for the request, our team is already working on it, and infact we were going to ship ruby tonight!Cheers,
igravious: Nice, thanks for the Ruby support!
rs545837: Of course!
igravious: Just got Kimi to use Weave to merge an official update with my (agent) modded installation (kimi-cli is open source!) and it worked a treat … kimi-cli is mostly Python (I think?!)Fair play. Great tool.
rs545837: Thanks for the feedback, pumped up to hear that.
rs545837: You will love this because we are supporting bash tree sitter grammar now with weave.
boogerlad: How can this be used with jj?
rs545837: Could you comment here on the PR(https://github.com/jj-vcs/jj/pull/8833) if you want this integrated with jj