Discussion
JavaScript is not available.
anemll: Large Language Models with 400 billion parameters can only be run on capable hardware with heaps of memory, as even a quantized or compressed version requires a minimum of 200GB RAM. Looking at these beefy requirements, the iPhone 17 Pro would never ever be the first choice to run a 400B LLM, but video evidence shows otherwise, as one person has demonstrated that Apple’s current generation has accomplished the impossible.
lostmsu: [delayed]
ashwinnair99: A year ago this would have been considered impossible. The hardware is moving faster than anyone's software assumptions.
cogman10: This isn't a hardware feat, this is a software triumph.They didn't make special purpose hardware to run a model. They crafted a large model so that it could run on consumer hardware (a phone).
firstbabylonian: > SSD streaming to GPUIs this solution based on what Apple describes in their 2023 paper 'LLM in a flash' [1]?1: https://arxiv.org/abs/2312.11514
pdpi: It's both.We haven't had phones running laptop-grade CPUs/GPUs for that long, and that is a very real hardware feat. Likewise, nobody would've said running a 400b LLM on a low-end laptop was feasible, and that is very much a software triumph.
simonw: Yes. I collected some details here: https://simonwillison.net/2026/Mar/18/llm-in-a-flash/
cj00: It’s 400B but it’s mixture of experts so how many are active at any time?
simonw: Looks like it's Qwen3.5-397B-A17B so 17B active. https://github.com/Anemll/flash-moe/tree/iOS-App
rwaksmunski: Apple might just win the AI race without even running in it. It's all about the distribution.
raw_anon_1111: Apple is already one of the winners of the AI race. It’s making much more profit (ie it ain’t losing money) on AI off of ChatGPT, Claude, Grok (you would be surprised at how many incels pay to make AI generated porn videos) subscriptions through the App Store.It’s only paying Google $1 billion a year for access to Gemini for Siri
detourdog: Apple’s entire yearly capex is a fraction of the AI spend of the persumed AI winners.
smallerize: The iPhone 17 Pro launched 8 months ago with 50% more RAM and about double the inference performance of the previous iPhone Pro (also 10x prompt processing speed).
layer8: And hardly anyone would have considered that impossible to achieve a year ago.
zozbot234: A similar approach was recently featured here: https://news.ycombinator.com/item?id=47476422 Though iPhone Pro has very limited RAM, which you still need for the active part of the model.
simonw: Yeah, this new post is a continuation of that work.
devmor: Which is mostly insane amounts of debt leveraged entirely on the moonshot that they will find a way to turn a profit on it within the next couple years.Apple’s bet is intelligent, the “presumed winners” are hedging our economic stability on a miracle, like a shaking gambling addict at a horse race who just withdrew his rent money.
causal: Run an incredible 400B parameters on a handheld device.0.6 t/s, wait 30 seconds to see what these billions of calculations get us:"That is a profound observation, and you are absolutely right ..."
WarmWash: I don't think we are ever going to win this. The general population loves being glazed way too much.
qingcharles: Plus all those pricey 512GB Mac Studios they are selling to YouTubers.
dzikimarian: Because someone managed to run LLM on an iPhone at unusable speed Apple won AI race? Yeah, sure.
foobiekr: This is not entirely dissimilar to what Cerebus does with their weights streaming.
Aurornis: It wasn't considered impossible. There are examples of large MoE LLMs running on small hardware all over the internet, like giant models on Raspberry Pi 5.It's just so slow that nobody pursued it seriously. It's fun to see these tricks implemented, but even on this 2025 top spec iPhone Pro the output is 100X slower than output from hosted services.
foobiekr: Fantasy buildouts of hundreds of billions of dollars for gear that has a 3 year lifetime may be premature.Put another way, there is no demonstrated first mover advantage in LLM-based AI so far and all of the companies involved are money furnaces.
_air: This is awesome! How far away are we from a model of this capability level running at 100 t/s? It's unclear to me if we'll see it from miniaturization first or from hardware gains
originalvichy: On smartphones? It’s not worth it to run a model this size on a device like this. A smaller fine-tuned model for specific use cases is not only faster, but possibly more accurate when tuned to specific use cases. All those gigs of unnecessary knowledge are useless to perform tasks usually done on smartphones.
baal80spam: > The general population loves being glazed way too much.This is 100% correct!
WarmWash: Thanks for short warm blast of dopamine, no one else ever seems to grasp how smart I truly am!
pier25: https://xcancel.com/anemll/status/2035901335984611412
dang: Added to toptext. Thanks!
tombert: That's an astute point, and you're right to point it out.
actusual: You are thinking about this exactly the right way.
Tade0: Only way to have hardware reach this sort of efficiency is to embed the model in hardware.This exists[0], but the chip in question is physically large and won't fit on a phone.[0] https://www.anuragk.com/blog/posts/Taalas.html
intrasight: I think for many reasons this will become the dominant paradigm for end user devices.Moore's law will shrink it to 8mm soon. I think it'll be like a microSD card you plug in.Or we develop a new silicon process that can mimic synaptic weights in biology. Synapses have plasticity.
bigyabai: One big bottleneck is SRAM cost. Even an 8b model would probably end up being hundreds of dollars to run locally on that kind of hardware. Especially unpalatable if the model quality keeps advancing year-by-year.> Or we develop a new silicon process that can mimic synaptic weights in biology. Synapses have plasticity.It's amazing to me that people consider this to be more realistic than FAANG collaborating on a CUDA-killer. I guess Nvidia really does deserve their valuation.
russellbeattie: [delayed]
9dev: You’re absolutely right!
mannyv: The software has real software engineers working on it instead of researchers.Remember when people were arguing about whether to use mmap? What a ridiculous argument.At some point someone will figure out how to tile the weights and the memory requirements will drop again.
snovv_crash: The real improvement will be when the software engineers get into the training loop. Then we can have MoE that use cache-friendly expert utilisation and maybe even learned prefetching for what the next experts will be.
zozbot234: > maybe even learned prefetching for what the next experts will beExperts are predicted by layer and the individual layer reads are quite small, so this is not really feasible. There's just not enough information to guide a prefetch.