Discussion
Search code, repositories, users, issues, pull requests...
Aurornis: Although I'm interested in both topics (KV compression and attempts to stream MoE models from storage) this is at least the 10th vibecoded project on this topic I've seen today alone across HN, Twitter, and some subreddits I visit.At least this one gave credit to the upstream projects which it used as a reference.The llama.cpp project is also getting a wave of vibecoded PRs that are very clearly being produced by pointing claude at the repo and the original paper and having it produce something.Almost none of these attempts contain information that really matters, like actual benchmark tests with differen KV quantization levels (not just perplexity or KLD).
_zoltan_: "vibe coded" is NOT the bad thing you think it is.Going from paper to implementation from scratch in half an hour or so is great.
brokencode: That’s a starting spot, but how about some testing and benchmarks?Where’s the value added if the person just tells Claude to do it and then submits a PR?The maintainers may as well vibe code it themselves if that’s all the work the would-be contributor is going to put into it.
mjr00: > "vibe coded" is NOT the bad thing you think it is.It's not inherently bad in the same way that a first draft of a novel is not inherently bad.But if someone asked me to read their novel and it was a first draft that they themselves had clearly not bothered reading or editing, I'd tell them to fuck off.
yieldcrv: if it works it workswe live in a wholly unoptimized world because the available resources have been so high, while the benefits of optimizing have been so low. that has flipped now and there are tons of low hanging fruit to optimize.I agree that benchmarks would be great, but thats only relevant to this one topic, not the overall agentic coded pull request concept itself
aegis_camera: We implemented two techniques to run massive 100B+ parameter MoE models natively on the M5 Pro 64GB MacBook Pro:TurboQuant KV compression: We ported the V3 Lloyd-Max codebooks from the TurboQuant paper (Zandieh et al., ICLR 2026) into native C++ and fused dequantization into Metal shaders. This achieves a measured 4.3× KV cache compression at runtime, completely eliminating Python overhead.SSD Expert Streaming: To fit a 122B parameter model (e.g., Qwen3.5-122B MoE) without triggering macOS VM swapping or Watchdog kernel kills, the full ~60 GB weight file remains on NVMe. Only the top-k active expert pages are streamed to the GPU per forward pass at ~9 GB/s. As a result, inference runs with only 2,694 MB of active GPU VRAM on the M5 Pro 64GB, while the OS page cache automatically handles hot-expert reuse.By combining these two approaches, we can comfortably run massive models in memory-constrained environments on Apple Silicon.Also tested QWEN 4B on IPHONE 13 Pro.Code and implementation details: https://github.com/SharpAI/SwiftLM
altruios: what tokens/s are you getting with a 122B MoE model in this setup? I didn't see any benchmarks in the benchmarks section on the readme.md
gigatexal: yeah this I'd like to see added to teh readme.
boogerlad: Does this use anything from the flash-moe project?https://github.com/Alexintosh/flash-moe
aegis_camera: Yes, this is a reference project, the main different is we don't use os swap ( it introduces latency, will add https://github.com/danveloper/flash-moe to the original reference as well ).
xiphias2: Another project without running real benchmarks. It's very easy to generate tokens, it's much harder to solve tasks locally.
aegis_camera: I'll add more details. We just wired up the pipeline on both MAC and IOS.
anemll: Check it out, you might be able to speed it up using this https://github.com/Anemll/anemll-flash-mlx https://x.com/anemll/status/2038684375425200360
simonw: I couldn't get the downloadable binary to work, or the binary I compiled myself: ./SwiftLM \ --model mlx-community/Qwen3.5-122B-A10B-4bit \ --stream-experts \ --port 5413 Error: [SwiftLM] Loading model: mlx-community/Qwen3.5-122B-A10B-4bit [SwiftLM] Enabled Async SSD Streaming on directory: e9c67b08899964be5fdd069bb1b4bc8907fe68f5 [SwiftLM] Memory strategy: FULL GPU (69.6GB model, 133.4GB available) [SwiftLM] Download: [===================>] 100% ⠋ (66395.4 MB / 66395.4 MB) | Speed: 0.0 MB/s MLX error: Failed to load the default metallib. library not found library not found library not found library not found at /Users/runner/work/SwiftLM/SwiftLM/LocalPackages/mlx-swift/Source/Cmlx/mlx-c/mlx/c/stream.cpp:115
gervwyk: Anyone else looking at these developments and thinking that local llms are the future. So many advantages above remote, and the hardware is just not there jet, but another leap like apple silicon and the tech is there..Ofcourse large corps will have fancy proprietary models, but for every day queries and tasks, local feels like a huge, and just slightly out of reach.Am i missing something fundamental?
pqtyw: It might work, but what's the point is sharing it if anyone can do the same in those 30 minutes with minimal effort?
sumeno: At least in the novel example the author had the decency to write what they're asking you to read.These are more like sending someone who didn't ask you a question a LMGTFY link they didn't ask for and expecting them to read all the results. Just a complete lack of awareness and respect for the maintainers
daft_pink: Can this work on M1, M2, M3, M4?
simonw: Claude Code helped me figure out this recipe (inspired by a similar workaround in the CI scripts): git clone --recursive https://github.com/SharpAI/SwiftLM.git cd SwiftLM swift build -c release # Trick to copy in that missing mlx.metallib file uv run --with mlx-metal python -c " import importlib.metadata, pathlib, shutil d = importlib.metadata.distribution('mlx-metal') metallib = pathlib.Path(d._path).parent / 'mlx/lib/mlx.metallib' shutil.copy(metallib, '.build/release/') print(f'Copied {metallib} -> .build/release/mlx.metallib') # Now start the server (downloads 69GB Qwen model) .build/release/SwiftLM \ --model mlx-community/Qwen3.5-122B-A10B-4bit \ --stream-experts \ --port 5413 But the server crashed when I tried to run a prompt through it: freed pointer was not the last allocation
aegis_camera: the Python mlx-metal trick is actually what's crashing it. The mlx.metallib from pip is a different version of MLX than what your Swift binary was built against. It gets past the startup error but then corrupts the GPU memory allocator at inference time → freed pointer was not the last allocation.Use the version-matched metallib that's already in the repo:cp LocalPackages/mlx-swift/Source/Cmlx/mlx/mlx/backend/metal/kernels/default.metallib \ .build/release/ .build/release/SwiftLM \ --model mlx-community/Qwen3.5-122B-A10B-4bit \ --stream-experts \ --port 5413 This is the exact metallib that was compiled alongside the Swift code — no version mismatch. Future pre-built releases will bundle it automatically.
aegis_camera: https://www.sharpai.org/benchmark/ The MLX part is what we've done with SwiftLM, the local result is still being verified more details are on-going.
daft_pink: I’ve always believed local is the future. If you consider how your iPhone has a processor that is more powerful than something very large not too long ago.
aegis_camera: I've ran this on an IPHONE 13 pro (6GB) memory, QWEN 3 1.7B runs good. So local will get more intelligent for the task you want it done soon or already.