Discussion
Can I Run AI locally?
sxates: Cool thing!A couple suggestions:1. I have an M3 Ultra with 256GB of memory, but the options list only goes up to 192GB. The M3 Ultra supports up to 512GB. 2. It'd be great if I could flip this around and choose a model, and then see the performance for all the different processors. Would help making buying decisions!
S4phyre: Oh how cool. Always wanted to have a tool like this.
phelm: This is awesome, it would be great to cross reference some intelligence benchmarks so that I can understand the trade off between RAM consumption, token rate and how good the model is
mrdependable: This is great, I've been trying to figure this stuff out recently.One thing I do wonder is what sort of solutions there are for running your own model, but using it from a different machine. I don't necessarily want to run the model on the machine I'm also working from.
twampss: Is this just llmfit but a web version of it?https://github.com/AlexsJones/llmfit
cortesoft: Ollama runs a web server that you use to interact with the models: https://docs.ollama.com/quickstartYou can also use the kubernetes operator to run them on a cluster: https://ollama-operator.ayaka.io/pages/en/
havaloc: Missing the A18 Neo! :)
metalliqaz: Hugging Face can already do this for you (with much more up-to-date list of available models). Also LM Studio. However they don't attempt to estimate tok/sec, so that's a cool feature. However I don't really trust those numbers that much because it is not incorporating information about the CPU, etc. True GPU offload isn't often possible on consumer PC hardware.
ge96: Raspberry pi? Say 4B with 4GB of ram.
arjie: Cool website. The one that I'd really like to see there is the RTX 6000 Pro Blackwell 96 GB, though.
deanc: Yes. But llmfit is far more useful as it detects your system resources.
dgrin91: Honestly I was surprised about this. It accurately got my GPU and specs without asking for any permissions. I didnt realize I was exposing this info.
dekhn: How could it not? That information is always available to userspace.
varispeed: Does it make any sense? I tried few models at 128GB and it's all pretty much rubbish. Yes they do give coherent answers, sometimes they are even correct, but most of the time it is just plain wrong. I find it massive waste of time.
GrayShade: This feels a bit pessimistic. Qwen 3.5 35B-A3B runs at 38 t/s tg with llama.cpp (mmap enabled) on my Radeon 6800 XT.
Aurornis: At what quantization and with what size context window?
rebolek: ssh?
boutell: I'm not sure how long ago you tried it, but look at Qwen 3.5 32b on a fast machine. Usually best to shut off thinking if you're not doing tool use.
AstroBen: This doesn't look accurate to me. I have an RX9070 and I've been messing around with Qwen 3.5 35B-A3B. According to this site I can't even run it, yet I'm getting 32tok/s ^.-
misnome: It seems to be missing a whole load of the quantized Qwen models, Qwen3.5:122b works fine in the 96GB GH200 (a machine that is also missing here....)
meatmanek: This seems to be estimating based on memory bandwidth / size of model, which is a really good estimate for dense models, but MoE models like GPT-OSS-20b don't involve the entire model for every token, so they can produce more tokens/second on the same hardware. GPT-OSS-20B has 3.6B active parameters, so it should perform similarly to a 3-4B dense model, while requiring enough VRAM to fit the whole 20B model.(In terms of intelligence, they tend to score similarly to a dense model that's as big as the geometric mean of the full model size and the active parameters, i.e. for GPT-OSS-20B, it's roughly as smart as a sqrt(20b*3.6b) ≈ 8.5b dense model, but produces tokens 2x faster.)
freediddy: i think the perplexity is more important than tokens per second. tokens per second is relatively useless in my opinion. there is nothing worse than getting bad results returned to you very quickly and confidently.ive been working with quite a few open weight models for the last year and especially for things like images, models from 6 months would return garbage data quickly, but these days qwen 3.5 is incredible, even the 9b model.
boutell: I've been trying to get speech to text to work with a reasonable vocabulary on pis for a while. It's tough. All the modern models just need more GPU than is available
kylehotchkiss: My Mac mini rocks qwen2.5 14b at a lightning fast 11/tokens a second. Which is actually good enough for the long term data processing I make it spend all day doing. It doesn’t lock up the machine or prevent its primary purpose as webserver from being fulfilled.
amelius: What is this S/A/B/C/etc. ranking? Is anyone else using it?
relaxing: Apparently S being a level above A comes from Japanese grading. I’ve been confused by that, too.
bityard: "Available to userspace" is a much different thing than "available to every website that wants it, even in private mode".I too was a little surprised by this. My browser (Vivladi) makes a big deal about how privacy-conscious they are, but apparently browser fingerprinting is not on their radar.
vikramkr: Just a tier list I think
spudlyo: I run LibreWolf, which is configured to ask me before a site can use WebGL, which is commonly used for fingerprinting. I got the popup on this site, so I assume that's how they're doing it.
littlestymaar: While your remark is valid, there's two small inaccuracies here:> GPT-OSS-20B has 3.6B active parameters, so it should perform similarly to a 3-4B dense model, while requiring enough VRAM to fit the whole 20B model.First, the token generation speed is going to be comparable, but not the prefil speed (context processing is going to be much slower on a big MoE than on a small dense model).Second, without speculative decoding, it is correct to say that a small dense model and a bigger MoE with the same amount of active parameters are going to be roughly as fast. But if you use a small dense model you will see token generation performance improvements with speculative decoding (up to x3 the speed), whereas you probably won't gain much from speculative decoding on a MoE model (because two consecutive tokens won't trigger the same “experts”, so you'd need to load more weight to the compute units, using more bandwidth).
swiftcoder: It's very common in Japanese-developed video games as well
orthoxerox: For some reason it doesn't react to changing the RAM amount in the combo box at the top. If I open this on my Ryzen AI Max 395+ with 32 GB of unified memory, it thinks nothing will fit because I've set it up to reserve 512MB of RAM for the GPU.
bityard: Yeah, this site is iffy at best. I didn't even see Strix Halo on the list, but I selected 128GB and bumped up the memory bandwidth. It says gpt-oss-120b "barely runs" at ~2 t/s.In reality, gpt-oss-120b fits great on the machine with plenty of room to spare and easily runs inference north of 50 t/s depending on context.
am17an: You can still run larger MoE models using expert weight off-loading to the CPU for token generation. They are by and large useable, I get ~50 toks/second on a kimi linear 48B (3B active) model on a potato PC + a 3090
lambda: Yeah, I looked up some models I have actually run locally on my Strix Halo laptop, and its saying I should have much lower performance than I actually have on models I've tested.For MoE models, it should be using the active parameters in memory bandwidth computation, not the total parameters.
amelius: It would be great if something like this was built into ollama, so you could easily list available models based on your current hardware setup, from the CLI.
jrmg: Is there a reliable guide somewhere to setting up local AI for coding (please don’t say ‘just Google it’ - that just results in a morass of AI slop/SEO pages with out of date, non-self-consistent, incorrect or impossible instructions).I’d like to be able to use a local model (which one?) to power Copilot in vscode, and run coding agent(s) (not general purpose OpenClaw-like agents) on my M2 MacBook. I know it’ll be slow.I suspect this is actually fairly easy to set up - if you know how.
AstroBen: Ollama or LM Studio are very simple to setup.You're probably not going to get anything working well as an agent on an M2 MacBook, but smaller models do surprisingly well for focused autocomplete. Maybe the Qwen3.5 9B model would run decently on your system?
jrmg: Right - setting up LM studio is not hard. But how do I connect LM Studio to Copilot, or set up an agent?
brcmthrowaway: Basically LM Studio has a server that serves models over HTTP (localhost). Configure/enable the server and connect OpenCode to it.Try this article https://advanced-stack.com/fields-notes/qwen35-opencode-lm-s...I'm looking for an alternative to OpenCode though, I can barely see the UI.
meatmanek: For ASR/STT on a budget, you want https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3 - it works great on CPU.I haven't tried on a raspberry pi, but on Intel it uses a little less than 1s of CPU time per second of audio. Using https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/a... for chunked streaming inference, it takes 6 cores to process audio ~5x faster than realtime. I expect with all cores on a Pi 4 or 5, you'd probably be able to at least keep up with realtime.(Batch inference, where you give it the whole audio file up front, is slightly more efficient, since chunked streaming inference is basically running batch inference on overlapping windows of audio.)EDIT: there are also the multitalker-parakeet-streaming-0.6b-v1 and nemotron-speech-streaming-en-0.6b models, which have similar resource requirements but are built for true streaming inference instead of chunked inference. In my tests, these are slightly less accurate. In particular, they seem to completely omit any sentence at the beginning or end of a stream that was partially cut off.
NortySpock: I tried the Zed editor and it picked up Ollama with almost no fiddling, so that has allowed me to run Qwen3.5:9B just by tweaking the ollama settings (which had a few dumb defaults, I thought, like assuming I wanted to run 3 LLMs in parallel, initially disabling Flash Attention, and having a very short context window...).Having a second pair of "eyes" to read a log error and dig into relevant code is super handy for getting ideas flowing.
AstroBen: It looks like Copilot has direct support for Ollama if you're willing to set that up: https://docs.ollama.com/integrations/vscodeFor LM Studio under server settings you can start a local server that has an OpenAI-compatible API. You'd need to point Copilot to that. I don't use Copilot so not sure of the exact steps there
golem14: Has anyone actually built anything with this tool?The website says that code export is not working yet.That’s a very strange way to advertise yourself.
mkagenius: Literally made the same app, 2 weeks back - https://news.ycombinator.com/item?id=47171499