Discussion

Search code, repositories, users, issues, pull requests...

marksully: Where does "1T parameter model" come from? I can only see models with 70B params or less mentioned in the repo.

Insanity: This is a pretty cool project! Essentially this is like using Swap memory to extend your RAM, but in a 'smart' way so you don't overload the NVMe unnecessarily.I do wonder in practice how the 'smarts' pan out, because putting a ton of stress on your NVMe during generation is probably not the best choice for it's longevity.

tatef: Hypura is a storage-tier-aware inference scheduler for Apple Silicon. It reads the GGUF file, profiles your hardware, and solves a placement problem that assigns every tensor to GPU, RAM, or NVMe based on access frequency and tier bandwidth. No manual tuning — it picks the right mode automatically.The core insight: most of a model's weights aren't needed every token. For MoE models like Mixtral, only 2/8 experts fire per token. Hypura keeps the non-expert tensors (~1 GB) on Metal and streams expert data from NVMe through a small pool buffer. A neuron cache hits 99.5% after warmup, so steady-state NVMe I/O is near-zero. Vanilla llama.cpp OOMs on the same model — Metal counts the full mmap'd file against recommendedMaxWorkingSetSize even when only a fraction is GPU-offloaded.For dense models (Llama 70B, 40 GB), it keeps attention+norms on GPU (~8 GB) and streams FFN tensors from NVMe with prefetch lookahead. Slower (0.3 tok/s), but the alternative is a crash.Numbers on M1 Max 32 GB (~5.1 GB/s NVMe):- Mixtral 8x7B Q5_K_M (31 GB): 2.2 tok/s. llama.cpp: OOM at any ngl setting. - Llama 3.3 70B Q4_K_M (40 GB): 0.3 tok/s. llama.cpp: OOM. - Qwen 2.5 14B Q4_K_M (8.4 GB): 12.3 tok/s. No overhead when the model fits in memory.It also exposes an Ollama-compatible HTTP API (/api/chat, /api/generate), so it's a drop-in for anything that talks to Ollama.Written in Rust, wraps llama.cpp via FFI, MIT licensed.Honest disclosure: I directed the architecture and design decisions, but the code was largely written by LLMs (Claude). I used the Socratic method — asking questions, proposing approaches, evaluating tradeoffs — while the models did the implementation. I think this is worth being transparent about. The hunch that motivated it: NVMe-backed inference is underutilized despite being a slow but perfectly valid memory tier, especially on Apple Silicon where unified memory + fast SSDs are the norm.Limitations I won't bury: dense FFN-streaming is I/O-bound (~50 ms per-layer stalls on each of 80 layers). Co-activation predictions need ~100 tokens to warm up. The optimize command rewrites the full model file. This is early and rough.Happy to answer questions about the placement LP, the custom GGML buffer type, or what I learned about Metal's mmap behavior on Apple Silicon (it's weird).

password4321: Don't post generated/AI-edited comments. HN is for conversation between humanshttps://news.ycombinator.com/item?id=47340079

causal: You need to change the title or actually include 1T parameter model content.

causal: Yeah title comes from nowhere in the link. No doubt it's possible but all that matters is speed and we learn nothing of that here...

monksy: There needs to be something like this from Ollama. At the moment Ollama has a lot of flaws that prevent it from getting great performance. (My understanding is better GPU/CPU splits, etc). But Ollama is the only way to host an LLM and have it switch out on demand. Sigh.

rubiquity: llama.cpp and llama-swap do this better than Ollama and with far more control.

baq: Intel Optane rolling in its grave.

zozbot234: It will be interesting to compare this to https://news.ycombinator.com/item?id=47476422 and https://news.ycombinator.com/item?id=47490070 . Very similar design except that this is apparently using mmap, which according to the earlier experiment incurs significant overhead.

salynchnew: It was written by an LLM, so... yeah.

liuliu: Still have 4 brand new ones in my storage unit. Just in case these moments.Joke aside (I do have them tho!), I don't think Optane is that much use (not to mention it is only 256GiB for my unit). It is useful legacy crutch if you have legacy software that is not designed to issue multiple reads / writes in parallel. If you do, it is really not faster than NVMe, especially these modern ones.

zozbot234: It's not about being faster (except for small reads where latency dominates, which is actually relevant when reading a handful of expert-layers immediately after routing), it's the wearout resistance which opens up the possibility of storing KV-cache (including the "linear" KV-cache of recent Qwen, which is not append-only as it was with the pure attention model) and maybe even per-layer activations - though this has the least use given how ephemeral these are.

DennisP: That doesn't read like an AI-generated comment to me. He did mention he vibe-coded the project but that's not against the guidelines.

Forgeties79: gptzero says 99% chance it’s AI-generatedIt certainly has a lot of telltale signs

speedgoose: I wonder how many minutes per token on GLM 5.

amelius: This is <1 tok/s for the 40GB model.Come on, "Run" is not the right word. "Crawl" is.

0ptan3: pmem

root_axis: Are there any 1T parameter open source models?

zozbot234: Kimi 2.5?

vicchenai: the practical question is whether the read pattern is sequential enough to actually saturate nvme bandwidth or if the attention layer access pattern ends up being random enough to kill throughput. sequential reads on a decent nvme get you 5-7 GB/s, random reads drop to maybe 500 MB/s depending on queue depth.for a 1T model youd need to stream something like 2TB of weights per forward pass at fp16. even at peak sequential thats 300+ seconds per token which is... not great for interactive use but maybe fine for batch inference where you dont care about latency.still a cool proof of concept though. the gap between 'can run' and 'runs usefully' is where things get interesting.

zozbot234: > for a 1T model youd need to stream something like 2TB of weights per forward passIsn't this missing the point of MoE models completely? MoE inference is sparse, you only read a small fraction of the weights per layer. You still have a problem of each individual expert-layer being quite small (a few MiBs each give or take) but those reads are large enough for the NVMe.

visarga: But across a sequence you still have to load most of them.

root_axis: Thanks, TIL.

smlacy: Yes, and with virtually zero context, which makes an enormous difference for TTFT on the MoE models.

Reader /

Discussion