Discussion
Our most intelligent open models, built from Gemini 3 research and technology to maximize intelligence-per-parameter
flakiness: It's good they still have non-instruction-tuned models.
a7om_com: Gemma models are already in our AIPI inference pricing index. Open source models like Gemma run 70.7% cheaper than proprietary equivalents at the median across the 2,614 SKUs we track. With Gemma 4 hitting third-party platforms the pricing will be worth watching closely. Full data at a7om.com.
darshanmakwana: This is awesome! I will try to use them locally with opencode and see if they are usable inreplacement of claude code for basic tasks
danielhanchen: Thinking / reasoning + multimodal + tool calling.We made some quants at https://huggingface.co/collections/unsloth/gemma-4 for folks to run them - they work really well!Guide for those interested: https://unsloth.ai/docs/models/gemma-4Also note to use temperature = 1.0, top_p = 0.95, top_k = 64 and the EOS is "<turn|>". "<|channel>thought\n" is also used for the thinking trace!
l2dy: FYI, screenshot for the "Search and download Gemma 4" step on your guide is for qwen3.5, and when I searched for gemma-4 in Unsloth Studio it only shows Gemma 3 models.
danielhanchen: We're still updating it haha! Sorry! It's been quite complex to support new models without breaking old ones
mudkipdev: Can't wait for gemma4-31b-it-claude-opus-4-6-distilled on huggingface tomorrow
wg0: Google might not have the best coding models (yet) but they seem to have the most intelligent and knowledgeable models of all especially Gemini 3.1 Pro is something.One more thing about Google is that they have everything that others do not:1. Huge data, audio, video, geospatial 2. Tons of expertise. Attention all you need was born there. 3. Libraries that they wrote. 4. Their own data centers and cloud. 4. Most of all, their own hardware TPUs that no one has.Therefore once the bubble bursts, the only player standing tall and above all would be Google.
rvz: Open weight models once again marching on and slowly being a viable alternative to the larger ones.We are at least 1 year and at most 2 years until they surpass closed models for everyday tasks that can be done locally to save spending on tokens.
echelon: > We are at least 1 year and at most 2 years until they surpass closed models for everyday tasks that can be done locally to save spending on tokens.Until they pass what closed models today can do.By that time, closed models will be 4 years ahead.Google would not be giving this away if they believed local open models could win.Google is doing this to slow down Anthropic, OpenAI, and the Chinese, knowing that in the fullness of time they can be the leader. They'll stop being so generous once the dust settles.
originalvichy: The wait is finally over. One or two iterations, and I’ll be happy to say that language models are more than fulfilling my most common needs when self-hosting. Thanks to the Gemma team!
vunderba: Strongly agree. Gemma3:27b and Qwen3-vl:30b-a3b are among my favorite local LLMs and handle the vast majority of translation, classification, and categorization work that I throw at them.
james2doyle: Hmm just tried the google/gemma-4-31B-it through HuggingFace (inference provider seems to be Novita) and function/tool calling was not enabled...
linolevan: Hosted on Parasail + Google (both for free, as of now) themselves, probably would give those a shot
bertili: Qwen: Hold my beerhttps://news.ycombinator.com/item?id=47615002
xfalcox: Comparing a model you can downloads weights for with an API-only model doesn't make much sense.
evanbabaallos: Impressive
fooker: What's a realistic way to run this locally or a single expensive remote dev machine(not clusters)?
jwr: Really looking forward to testing and benchmarking this on my spam filtering benchmark. gemma-3-27b was a really strong model, surpassed later by gpt-oss:20b (which was also much faster). qwen models always had more variance.
jeffbee: Does spam filtering really need a better model? My impression is that the whole game is based on having the best and freshest user-contributed labels.
VadimPR: Gemma 3 E4E runs very quick on my Samsung S26, so I am looking forward to trying Gemma 4! It is fantastic to have local alternatives to frontier models in an offline manner.
wolttam: Gemma 4 E2B instantly became my new laptop model, holy shit
adamtaylor_13: What sort of tasks are you using self-hosting for? Just curious as I've been watching the scene but not experimenting with self-hosting.
vunderba: Not OP but one example is that recent VL models are more than sufficient for analyzing your local photo albums/images for creating metadata / descriptions / captions to help better organize your library.
kejaed: Any pointers on some local VLMs to start with?
Imustaskforhelp: Daniel, I know you might hear this a lot but I really appreciate a lot of what you have been doing at Unsloth and the way you handle your communication, whether within hackernews/reddit.I am not sure if someone might have asked this already to you, but I have a question (out of curiosity) as to which open source model you find best and also, which AI training team (Qwen/Gemini/Kimi/GLM) has cooperated the most with the Unsloth team and is friendly to work with from such perspective?
danielhanchen: Thanks a lot for the support :)Tbh Gemma-4 haha - it's sooooo good!!!For teams - Google haha definitely hands down then Qwen, Meta haha through PyTorch and Llama and Mistral - tbh all labs are great!
Imustaskforhelp: Now you have gotten me a bit excited for Gemma-4, Definitely gonna see if I can run the unsloth quants of this on my mac air & thanks for responding to my comment :-)
chasd00: Not sure why you're being downvoted, the other thing Google has is Google. They just have to spend the effort/resources to keep up and wait for everyone else to go bankrupt. At the end of the day I think Google will be the eventual LLM winner. I think this is why Meta isn't really in the race and just releases open weight models, the writing is on the wall. Also, probably why Apple went ahead and signed a deal with Google and not OpenAI or Anthropic.
wg0: I don't know why I am downvoted but Google has data, expertise, hardware and deep pockets. This whole LLM thing is invented at Google and machine learning ecosystem libraries come from Google. I don't know how people can be so irrational discounting Google's muscle.Others have just borrowed data, money, hardware and they would run out of resources for sure.
heraldgeezer: Gemma vs Gemini?I am only a casual AI chatbot user, I use what gives me the most and best free limits and versions.
antirez: Featuring the ELO score as the main benchmark in chart is very misleading. The big dense Gemma 4 model does not seem to reach Qwen 3.5 27B dense model in most benchmarks. This is obviously what matters. The small 2B / 4B models are interesting and may potentially be better ASR models than specialized ones (not just for performances but since they are going to be easily served via llama.cpp / MLX and front-ends). Also interesting for "fast" OCR, given they are vision models as well. But other than that, the release is a bit disappointing.
nabakin: Public benchmarks can be trivially faked. Lmarena is a bit harder to fake and is human-evaluated.I agree it's misleading for them to hyper-focus on one metric, but public benchmarks are far from the only thing that matters. I place more weight on Lmarena scores and private benchmarks.
canyon289: Hi all! I work on the Gemma team, one of many as this one was a bigger effort given it was a mainline release. Happy to answer whatever questions I can
azinman2: How do the smaller models differ from what you guys will ultimately ship on Pixel phones?What's the business case for releasing Gemma and not just focusing on Gemini + cloud only?
minimaxir: The benchmark comparisons to Gemma 3 27B on Hugging Face are interesting: The Gemma 4 E4B variant (https://huggingface.co/google/gemma-4-E4B-it) beats the old 27B in every benchmark at a fraction of parameters.The E2B/E4B models also support voice input, which is rare.
regularfry: Thinking vs non-thinking. There'll be a token cost there. But still fairly remarkable!
k3nz0: How do you test codeforces ELO?
vunderba: The easiest way to get started is probably to use something like Ollama and use the `qwen3-vl:8b` 4‑bit quantized model [1].It's a good balance between accuracy and memory, though in my experience, it's slower than older model architectures such as Llava. Just be aware Qwen-VL tends to be a bit verbose [2], and you can’t really control that reliably with token limits - it'll just cut off abruptly. You can ask it to be more concise but it can be hit or miss.What I often end up doing and I admit it's a bit ridiculous is letting Qwen-VL generate its full detailed output, and then passing that to a different LLM to summarize.- [1] https://ollama.com/library/qwen3-vl:8b- [2] https://mordenstar.com/other/vlm-xkcd
WarmWash: The rumor is also that Meta is looking to lease Gemini similar to Apple, as their recent efforts reportedly came up short of expectations.
azinman2: I find the benchmarks to be suggestive but not necessarily representative of reality. It's really best if you have your own use case and can benchmark the models yourself. I've found the results to be surprising and not what these public benchmarks would have you believe.
minimaxir: I can't find what ELO score specifically the benchmark chart is referring to, it's just labeled "Elo Score". It's not Codeforces ELO as that Gemma 4 31B has 2150 for that which would be off the given chart.
tjwebbnorfolk: Will larger-parameter versions be released?
scrlk: Comparison of Gemma 4 vs. Qwen 3.5 benchmarks, consolidated from their respective Hugging Face model cards: | Model | MMLUP | GPQA | LCB | ELO | TAU2 | MMMLU | HLE-n | HLE-t | |----------------|-------|-------|-------|------|-------|-------|-------|-------| | G4 31B | 85.2% | 84.3% | 80.0% | 2150 | 76.9% | 88.4% | 19.5% | 26.5% | | G4 26B A4B | 82.6% | 82.3% | 77.1% | 1718 | 68.2% | 86.3% | 8.7% | 17.2% | | G4 E4B | 69.4% | 58.6% | 52.0% | 940 | 42.2% | 76.6% | - | - | | G4 E2B | 60.0% | 43.4% | 44.0% | 633 | 24.5% | 67.4% | - | - | | G3 27B no-T | 67.6% | 42.4% | 29.1% | 110 | 16.2% | 70.7% | - | - | | GPT-5-mini | 83.7% | 82.8% | 80.5% | 2160 | 69.8% | 86.2% | 19.4% | 35.8% | | GPT-OSS-120B | 80.8% | 80.1% | 82.7% | 2157 | -- | 78.2% | 14.9% | 19.0% | | Q3-235B-A22B | 84.4% | 81.1% | 75.1% | 2146 | 58.5% | 83.4% | 18.2% | -- | | Q3.5-122B-A10B | 86.7% | 86.6% | 78.9% | 2100 | 79.5% | 86.7% | 25.3% | 47.5% | | Q3.5-27B | 86.1% | 85.5% | 80.7% | 1899 | 79.0% | 85.9% | 24.3% | 48.5% | | Q3.5-35B-A3B | 85.3% | 84.2% | 74.6% | 2028 | 81.2% | 85.2% | 22.4% | 47.4% | MMLUP: MMLU-Pro GPQA: GPQA Diamond LCB: LiveCodeBench v6 ELO: Codeforces ELO TAU2: TAU2-Bench MMMLU: MMMLU HLE-n: Humanity's Last Exam (no tools / CoT) HLE-t: Humanity's Last Exam (with search / tool) no-T: no think
kpw94: Wild differences in ELO compared to tfa's graph: https://storage.googleapis.com/gdm-deepmind-com-prod-public/...(Comparing Q3.5-27B to G4 26B A4B and G4 31B specifically)I'd assume Q3.5-35B-A3B would performe worse than the Q3.5 deep 27B model, but the cards you pasted above, somehow show that for ELO and TAU2 it's the other way around...Very impressed by unsloth's team releasing the GGUF so quickly, if that's like the qwen 3.5, I'll wait a few more days in case they make a major update.Overall great news if it's at parity or slightly better than Qwen 3.5 open weights, hope to see both of these evolve in the sub-32GB-RAM space. Disappointed in Mistral/Ministral being so far behind these US & Chinese models
coder543: > Wild differences in ELO compared to tfa's graphBecause those are two different, completely independent Elos... the one you linked is for LMArena, not Codeforces.
_boffin_: What was the main focus when training this model? Besides the ELO score, it's looking like the models (31B / 26B-A4) are underperforming on some of the typical benchmarks by a wide margin. Do you believe there's an issue with the tests or the results are misleading (such as comparative models benchmaxxing)?Thank you for the release.
simonw: I ran these in LM Studio and got unrecognizable pelicans out of the 2B and 4B models and an outstanding pelican out of the 26b-a4b model - I think the best I've seen from a model that runs on my laptop.https://gist.github.com/simonw/12ae4711288637a722fd6bd4b4b56...The gemma-4-31b model is completely broken for me - it just spits out "---\n" no matter what prompt I feed it.
wordpad: Do you think it's just part of their training set now?
mentalgear: Adding to the Q: Any good small open-source model with a high correctness of reading/extracting Tables and/of PDFs with more uncommon layouts.
nateb2022: > Very impressed by unsloth's team releasing the GGUF so quickly, if that's like the qwen 3.5, I'll wait a few more days in case they make a major update.Same here. I can't wait until mlx-community releases MLX optimized versions of these models as well, but happily running the GGUFs in the meantime!
chrislattner: If you want the fastest open source implementation on Blackwell and AMD MI355, check out Modular's MAX nightly. You can pip install it super fast, check it out here: https://www.modular.com/blog/day-zero-launch-fastest-perform...-Chris Lattner (yes, affiliated with Modular :-)
nabakin: Faster than TensorRT-LLM on Blackwell? Or do you not consider TensorRT-LLM open source because some dependencies are closed source?
DoctorOetker: Is there a reason we can't use thinking completions to train non-thinking? i.e. gradient descent towards what thinking would have answered?
indrora: gemma4-31b-it-claude-opus-4-6-distilled-abliterated-heretic-GGUF-q4-k-m
entropicdrifter: Your posting of the pelican benchmark is honestly the biggest reason I check the HackerNews comments on big new model announcements
jckahn: All hail the pelican king!
simonw: If it's part of their training set why do the 2B and 4B models produce such terrible SVGs?
DeepYogurt: maybe a dumb question but what what does the "it" stand for in the 31B-it vs 31B?
babelfish: Wow, 30B parameters as capable as a 1T parameter model?
mhitza: On the above compared benchmarks is closer to other larger open weights models, and on par with GPT-OSS 120B, for which I also have a frame of reference.
canyon289: We are always figuring out what parameter size makes sense.The decision is always a mix between how good we can make the models from a technical aspect, with how good they need to be to make all of you super excited to use them. And its a bit of a challenge what is an ever changing ecosystem.I'm personally curious is there a certain parameter size you're looking for?
jimbob45: how good they need to be to make all of you super excited to use themIsn't that more dictated by the competition you're facing from Llama and Qwent?
mohsen1: On LM Studio I'm only seeing models/google/gemma-4-26b-a4bWhere can I download the full model? I have 128GB Mac Studio
gusthema: They are all on hugging face
whhone: The LiteRT-LM CLI (https://ai.google.dev/edge/litert-lm/cli) provides a way to try the Gemma 4 model. # with uvx uvx litert-lm run \ --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \ gemma-4-E2B-it.litertlm
WarmWash: Mainline consumer cards are 16GB, so everyone wants models they can run on their $400 GPU.
NekkoDroid: Yea, I've been waiting a while for a model that is ~12-13GB so there is still a bit of extra headroom for all the different things running on the system that for some reason eat VRAM.
logicallee: Do any of you use this as a replacement for Claude Code? For example, you might use it with openclaw. I have a 48 GB integrated RAM Mac Mini M4 I currently run Claude Code on, do you think I can replace it with OpenClaw and one of these models?
moffkalast: Lm arena is so easy to game that it's ceased to be a relevant metric over a year ago. People are not usable validators beyond "yeah that looks good to me", nobody checks if the facts are correct or not.
jug: I agree; LMArena died for me with the Llama 4 debacle.
vessenes: We were promised full SVG zoos, Simon. I want to see SVG pangolins please
Analog24: So the "E2B" and "E4B" models are actually 5B and 8B parameters. Are we really going to start referring to the "effective" parameter count of dense models by not including the embeddings?These models are impressive but this is incredibly misleading. You need to load the embeddings in memory along with the rest of the model so it makes no sense o exclude them from the parameter count. This is why it actually takes 5GB of RAM to run the "2B" model with 4-bit quantization according to Unsloth (when I first saw that I knew something was up).
zaat: Thank you for your work.You have an answer on your page regarding "Should I pick 26B-A4B or 31B?", but can you please clarify if, assuming 24GB vRAM, I should pick a full precision smaller model or 4 bit larger model?
the_pwner224: > full precision smaller model or 4 bit larger modelI think you're getting the Bs in the model names mixed up with quantization.They're not too different in size. For example, for the 8 bit UD quants, 26B-A4B MoE is 29 GB and the 31B dense is 35 GB.26B-A4B is a 26 (B)illion param model but only 4 (B)illion are (A)ctivated (hence the 26B-A4B name). It uses the full 26B worth of memory but runs with the speed of a 4B model (= fast).31B is a 31 billion param model. All parameters are active so it'll be much slower.From what I recall, the dense models tend to be better at creativity, logic, math, niche insights. But the MoE model is way faster and generally "good enough."Guesstimating with mental math - with your 24 GB of VRAM you might be able to get something like a 5 bit quantization of the 31B, or 6 bit of the 26B-A4B. Not a big difference in quantization level.
iamskeole: Are there any plans for QAT / MXFP4 versions down the line?
coder543: [delayed]
karimf: I'm curious about the multimodal capabilities on the E2B and E4B and how fast is it.In ChatGPT right now, you can have a audio and video feed for the AI, and then the AI can respond in real-time.Now I wonder if the E2B or the E4B is capable enough for this and fast enough to be run on an iPhone. Basically replicating that experience, but all the computations (STT, LLM, and TTS) are done locally on the phone.I just made this [0] last week so I know you can run a real-time voice conversation with an AI on an iPhone, but it'd be a totally different experience if it can also process a live camera feed.https://github.com/fikrikarim/volocal
matt765: I'll wait for the next iteration
whimblepop: I recently canceled my Google One subscription because getting accurate answers out of Gemini for chat is basically impossible afaict. Whether I enable thinking makes no difference: Gemini always answers me super quickly, rarely actually looks something up, and lies to me. It has a really bad unchecked hallucination problem because it prioritizes speed over accuracy and (astonishingly, to me) is way more hesitant to run web searches than ChatGPT or Claude.Maybe the model is good but the product is so shitty that I can't perceive its virtues while using it. I would characterize it as pretty much unusable (including as the "Google Assistant" on my phone).It's extremely frustrating every way that I've used it but it seems like Gemini and Gemma get nothing but praise here.
staticman2: I've found Gemini works better for search when used through a Perplexity subscription. (Though these things can quickly change).
BoorishBears: Becnhmarks are a pox on LLMs.You can use this model for about 5 seconds and realize it's reasoning is in a league well above any Qwen model, but instead people assume benchmarks that are openly getting used for training are still relevant.
virgildotcodes: Downloaded through LM Studio on an M1 Max 32GB, 26B A4B Q4_K_MFirst message:https://i.postimg.cc/yNZzmGMM/Screenshot-2026-04-03-at-12-44...Not sure if I'm doing something wrong?This more or less reflects my experience with most local models over the last couple years (although admittedly most aren't anywhere near this bad). People keep saying they're useful and yet I can't get them to be consistently useful at all.
flux3125: You're not doing anything wrong, that's expected
bearjaws: The labels on the table read "Gemma 4 31B IT" which reads as 431B parameter model, not Gemma 4 - 31B...
stephbook: Kind of sad they didn't release stronger versions. $dayjob offers strong NVidias that are hungry for models and are stuck running llama, gpt-oss etc.Seems like Google and Anthropic (which I consider leaders) would rather keep their secret sauce to themselves – understandable.
kuboble: Im really looking forward to trying it out.Gemma 3 was the first model that I have liked enough to use a lot just for daily questions on my 32G gpu.
daveguy: Fyi, it took me a while to find the meaning of the "-it" in some models. That's how Google designates "instruction tuned". Come on Google. Definite your acronyms.
coder68: Are there plans to release a QAT model? Similar to what was done for Gemma 3. That would be nice to see!
coder68: 120B would be great to have if you have it stashed away somewhere. GPT-OSS-120B still stands as one of the best (and fastest) open-weights models out there. A direct competitor in the same size range would be awesome. The closest recent release was Qwen3.5-122B-A10B.
daemonologist: Gemma will give you the most, Gemini will give you the best. The former is much smaller and therefore cheaper to run, but less capable.Although I'm not sure whether Gemma will be available even in aistudio - they took the last one down after people got it to say/do questionable stuff. It's very much intended for self-hosting.
BoorishBears: Well specifically a congressperson got it to hallucinate stuff about them then wrote an agry letterBut I checked and it's there... but in the UI web search can't be disabled (presumably to avoid another egg on face situation)
vessenes: I'll pipe in - a series of Mac optimized MOEs which can stream experts just in time would be really amazing. And popular; I'm guessing in the next year we'll be able to run a very able openclaw with a stack like that. You'll get a lot of installs there. If I were a PM at Gemma, I'd release a stack for each Mac mini memory size.
zozbot234: Expert streaming is something that has to be implemented by the inference engine/library, the model architecture itself has very little to do with it. It's a great idea (for local inference; it uses too much power at scale), but making it work really well is actually not that easy.(I've mentioned this before but AIUI it would require some new feature definitions in GGUF, to allow for coalescing model data about any one expert-layer into a single extent, so that it can be accessed in bulk. That's what seems to make the new Flash-MoE work so well.)
culi: You're conflating lmarena ELO scores.Qwen actually has a higher ELO there. The top Pareto frontier open models are: model |elo |price qwen3.5-397b-a17b |1449 |$1.85 glm-4.7 |1443 | 1.41 deepseek-v3.2-exp-thinking |1425 | 0.38 deepseek-v3.2 |1424 | 0.35 mimo-v2-flash (non-thinking) |1393 | 0.24 gemma-3-27b-it |1365 | 0.14 gemma-3-12b-it |1341 | 0.11 gpt-oss-20b |1318 | 0.09 gemma-3n-e4b-it |1318 | 0.03 https://arena.ai/leaderboard/text?viewBy=plotWhat Gemma seems to have done is dominate the extreme cheap end of the market. Which IMO is probably the most important and overlooked segment
hypercube33: Mind I ask what your laptop is and configuration hardware wise?
gigatexal: downloading the official ones for my m3 max 128GB via lm studio I can't seem to get them to load. they fail for some unknown reason. have to dig into the logs. any luck for you?
meatmanek: The Unsloth llama.cpp guide[1] recommends building the latest llama.cpp from source, so it's possible we need to wait for LM Studio to ship an update to its bundled llama.cpp. Fairly common with new models.1. https://unsloth.ai/docs/models/gemma-4#llama.cpp-guide
nateb2022: LM Studio shipped this update. Under settings make sure you update your runtimes.
nateb2022: I'd recommend using the instruction tuned variants, the pelicans would probably look a lot better.