Discussion
bigyabai: It's not the 5.9GB of RAM that concerns me, it's the 209GB of SSD space and constant swapping that is unsettling.
cogman10: Interesting, but what exactly did it do and what does it mean? Like, did it simply convert a 397B model into a 20b model? Or is this still a 397B model that now only uses around 6GB while running?
0x457: Interesting. Reminds me how Gemma 3N with PLE caching works.
quietbuilder: 44% cache hit rate is low. Over half the expert loads are cold reads off SSD, so at 1.4 GB/s effective bandwidth and ~1.8GB I/O per token, 4.74 tok/s checks out — but it'll drop with longer context or heavier reasoning.Running 397B on consumer hardware is genuinely impressive for a proof of concept. A year ago this wasn't a thing. But I keep wondering whether a well-quantized 70B that fits entirely in RAM would just be faster in practice. No I/O bottleneck, consistent throughput, smaller model but actually usable.