Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency
blog.google266 points by theanonymousone 9 hours ago
266 points by theanonymousone 9 hours ago
I just ran one of these locally on a Mac like this:
uvx litert-lm run \
--from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
gemma-4-E2B-it.litertlm \
--backend=gpu \
--prompt="Generate an SVG of a pelican riding a bicycle"
The first time you run that it downloads 3.2GB to ~/.cache/huggingface/hub/models--litert-community--gemma-4-E2B-it-litert-lmIt can handle audio and image input too, which is pretty cool for a 3.2GB model. For images:
uvx litert-lm run \
--from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
gemma-4-E2B-it.litertlm \
--backend=gpu --vision-backend gpu \
--attachment image.jpg --prompt describe
And for audio: uvx litert-lm run \
--from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
gemma-4-E2B-it.litertlm \
--backend=gpu --audio-backend cpu \
--attachment audio.wav --prompt transcribe
(The pelican is rubbish, but it's only a 3.2GB file so the fact it even outputs valid SVG is impressive to me: https://gist.github.com/simonw/94b318afde4b1ce5ff67d4b5d0362... )Not to mention the text-only 0.8GB version. Just crazy. You can have basic real-time conversations on-device that's video and audio aware now.
Have you seen a 0.8GB model file floating around yet? I couldn't find one earlier.
I think this is the one but it’s 0.8GB VRAM not 0.8GB size.
https://huggingface.co/google/gemma-4-E2B-it-qat-mobile-ct
But they could be cooking up a smaller one because the model card lists the Q_4 quants as being bigger than the mobile or text-only so I think we’ll need to wait for the Q_2_Distilled_Mobile_Textformer version. Still, just amazing work.
Is that actually QAT? the MLX Community models have that in their names, but these don't, and the upload dates don't quite line up.
As an aside uvx is so pleasant to use... I wish Nvidia supported it as first-class rather than making folks jump through Docker hoops.
Unsloth's collection as well [0], with their results [1]. Looks like they can get very close to 100% accuracy compared to the BF16 model that is unquantized, and Unsloth's quants are better than the original Google's QAT as posted in the article.
Personal I'm using the 2B model for web search and structured JSON output back via Unsloth Studio and its API, works very well for that even with the model embedded on phones.
Google's QAT claims to need 6.7 GB RAM, vs Unsloth's dynamic quants at 8GB. Would love to see some benchmarks. Both amazing for size.
you misunderstand what that chart shows - it shows BF16 QAT Q4_0, not BF16 regular.
meaning Google quantized the model to 4 bit and stored the result in BF16 format for compatibility and convenience to downstream packers.
Like storing small 8 bit numbers in full 32 bit integers.
So it's not close to 100% of unquantized BF16.
I'm curious if anybody can explain why Google released 4 bit QAT Q4_0 is not exactly 100% of BF16 QAT Q4_0? seems like it should be just bit twiddling, no further quantization to convert between these two packings. Unsloth talks about "lattice alignment" being an issue.
That being said I hate it that smol model makers, like Google, Qwen, ... only show the BF16 benchmarks when they release a new models, knowing that what people really run are 4-8 bit quantizations, so it's really hard to understand how much you lose when you run 4 bit vs 6 bit...
> meaning Google quantized the model to 4 bit and stored the result in BF16 format for compatibility and convenience to downstream packers.
You also misunderstand what is happening. Google did not do that. Google further trained the original model with an objective of minimizing error when quantized to 4-bit. The BF16 QAT is not an upscaled 4-bit model. When quantized to 4-bit, it should lose less accuracy than a typical 16-bit model loses when quantized to 4-bit, but the loss is not zero, because it is not based on a 4-bit model.
The Gemma 3 QAT report was a bit clearer:
https://developers.googleblog.com/en/gemma-3-quantized-aware...
"Instead of just quantizing the model after it's fully trained, QAT incorporates the quantization process during training. QAT simulates low-precision operations during training to allow quantization with less degradation afterwards for smaller, faster models while maintaining accuracy. Diving deeper, we applied QAT on ~5,000 steps using probabilities from the non-quantized checkpoint as targets. We reduce the perplexity drop by 54% (using llama.cpp perplexity evaluation) when quantizing down to Q4_0."
The BF16 is just trained to be more resistant to simulated quantization, which helps when it is actually quantized. Google is not doing post-training on the 4-bit model directly.
Are there evidence that this approach helps maintain "accuracy" performance when quantized? It sounds a bit like mxfp4 with gpt-oss, which was a confusing model upon release.
So what we want now is unsloth (or anyone) to release 4/6-bit quantized models of these releases?
I'm confused, the unsloth model is ~600mb and the one from google is 7gb?
It’s the Friday before WWDC during which Apple is going to announce an “improved” Siri based on Google models (a locked partnership, for now). Maybe it’s a coincidence, but this might be Google releasing models that will be showcased next week by Apple?
No knowledge, just speculation.
Very impressed with how much the Gemma ecosystem has advanced just this week.
Gemma 12B, multitoken prediction, and official quants released. Feels like Google is putting real effort into this string of releases, and I'm very excited to see that!
It's a bit awkward to release Gemma 4 12B (https://news.ycombinator.com/item?id=48385906), and then a canonical Q4_0 Gemma 4 12B a couple days later.
It's good that this post lists the expected VRAM usage for the models with Q4_0 Gemma 4 12B being 6.7GB, which will indeed fit Google's claims of fitting within 16GB comfortably, altough it confirms that only the quantized version will do so.
Relatedly, in Google's newly released Edge Gallery for macOS, Gemma 4 12B is explicitly listed as unsupported due to not enough RAM even on a 16GB machine, but given the expected VRAM usage here the Q4_0 variant definitely should fit and Google should fix that.
I'm not sure why you think it's awkward to have multiple releases. It's better to release models and variations as they're ready, not withhold them all until everything is ready to release all at once.
The Q4_0 is a quantization aware training checkpoint. It's not a simple quantization of the original Gemma 4 12B.
not sure if I understand you, but 4Q and QAT 4Q are different
It's super annoying when you have products that utilize these because there's...4? releases in 3 weeks?
- Gemma 4 2B/4B/27BE3B/31B
- Gemma 4 2B/4B/27BE3B/31B x "assistant" / MTP drafter models (i.e. multitoken prediction)
- Gemma 4 12B (2 days ago? 1?)
- Gemma 4 QAT 2B/4B/12B/27BE3B/31B x "assistant" models (i.e. multitoken prediction)
It probably sounds silly and really whiny in the abstract. It just causes a ton of work / confusion downstream that feels unnecessary.
Extremely glad for the output, not glad to have to chase it.
ex. llama.cpp currently supports the originals but not the MTP predictors but there is a patch for the MTP predictors but not for the small MoE models and I think it supports the 12B but maybe not media for it yet and now we have these too and the blog says there's GGUFs (llama.cpp models) but there isn't in any of the 12? repos I clicked through. and ~every consumer-facing local LLM app is built on llama.cpp or a fork of it.
Also if anyone at Google is taking feedback over to b/ or product, pleaseeee stop the "E"2B "E"4B thing, unless it's actually taking up less RAM on Android during CPU inference. I can't tell if I need to treat the 4B like an 8B (i.e. beyond most consumer hardware without a GPU) or a 4B (i.e. will run on most consumer hardware since 2021)
EDIT: And, yes, the QAT 12B x mmproj does not work with llama.cpp. I'm glad there's people who have the luxury of not having to, well, actually use these and treat me as whining :) I'll need to schedule another 4-8 hours of work for the 4th time, no fun!
These models aren't products? They are open source ish (open weight I guess), research outputs. While the naming scheme may be confusing, it is relevant and important. I believe it's on you to understand it.
> I believe it's on you to understand it.
This is exactly why Google has 10 messenger Apps.
I understand it. :)
And you're absolutely right to point out they aren't products - I hoped that was clear - when you're building a product with them, you end up having to do the same build loop 4 times, in this instance :)
You can stop after the first one. Choosing to repeat the process is on you, and probably because you see some benefit in using the variant(s) you build on top of.
Yes my framing was a little confusing. You were clear in that you are building products on them. I was more saying that because these gemma models are not products, and instead research outputs, the naming scheme should be more scientific rather than consumer friendly.
I don't see these QAT models on Edge Gallery; just the BF16 models are there. Is there anything I am missing?
Ran hf.co/google/gemma-4-12B-it-qat-q4_0-gguf:Q4_0 with ollama on a AMD Ryzen 9 8940HX, NVIDIA GeForce RTX 5060 (8 GB), 14 GB RAM laptop and it is suprisingly fast
Being able to run the 12B on 8gb VRAM is huge. It's crazy to see how fast these small local models have evolved.
had a good run with Gemma 4 E2B Unsloth 4Q: https://youtube.com/shorts/XLsAnz5aAAI
The E4B model doesn’t fit on my phone TPU, so it swaps to RAM, the QAT version means more accuracy, good!
How were you getting anything useful out of that? We found the (unquantized!) E2B model to be completely useless at even the simplest real-world classification tasks.
How do you know it swaps to ram vs on the TPU?
Would be interested in testing this on my pixel.
Once someone generates a MTP layer for 26B A4B 4 QAT I'll be singing from the hills with my 5 year old GPU.
Models:
- Safetensors: https://huggingface.co/google/gemma-4-26B-A4B-it-qat-q4_0-un...
- GGUF: https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/tree/...
Note the README in the Unsloth list of files: llama.cpp is working on a PR to support the gemma4 drafters: https://github.com/ggml-org/llama.cpp/pull/23398. Also note the PR submitter didn't experience much speedup with 26B (seems typical that MoE models don't generally benefit from MTP).
Google already did
https://huggingface.co/google/gemma-4-26B-A4B-it-qat-q4_0-un...
This is safetensors. Is there any way to run these on a Mac paired with the MLX QAT?
(Pardon my ignorance; this stuff moves so fast)
Did you see this?
https://point.free/blog/gemma-4-on-a-2016-xeon/
Xeon, but could be useful for MTP on Mac.
I hadn't seen this, thanks.
I do have the Qwen 3.6 (35B) MTP implementation running (in LM Studio; it doesn't need a separate drafter), along with non-MTP Gemma 4 26B, and I can see that Unsloth Studio can run the new QAT, but I can't see how you can run the assistant/drafter. Yet.
It's just a constantly changing landscape. Don't get me wrong, it's fascinating and for various reasons I am pleased I can keep up even slightly, but eeeehhh :-)
Could these quantized models make MTP (Multi-Token Prediction) significantly faster when used as drafters for larger regular Gemma 4 models?
Google already released specialized drafters for Gemma 4.
The E2B ones? Or what do you mean by specialized drafters?
They have -assistant in the name, so e.g.: https://huggingface.co/google/gemma-4-31B-it-assistant
The “-assistant” models released by Google are specialised tiny MTP draft models :)
31b-it-assistant is what enables MTP
google pixel intelligence may beat apple intelligence
For a moment I got excited thinking QAT is Intel Quick Assist Technology...
Same I had to do a double take. Would be pretty humourous if they somehow took advantage of crypto offloading to accelerate ai inference
How can the smaller Unsloth GGUF quant can beat the original google quant? (ref: unsloth/gemma-4-31B-it-qat-GGUF)
@google.com'ers, there are no GGUFs (blog says there is)
Isn’t this it? https://huggingface.co/google/gemma-4-12B-it-qat-q4_0-gguf
Ah, nice, ty! My excuse is those repos were added to the collection after my comment, but perhaps not :3
[flagged]
[flagged]
[flagged]
I was just testing Gemma E2B and E4B yesterday, and they are just too dumb to be useful outside of niche use cases.
Besides, there's no good agent on Android. Having a model that can't run web searches and browse websites is limited in use, particularly small models that really need to be grounded on search results to be factual, because they can't memorize enough.
Edit: I'd like to know what kind of usage the people that seem to disagree and downvoted this are having.
I think that's probably true for the vast majority of Android phones. But if you have a SOTA expensive beast, I wonder if Gemma 4 12B at 4 bit could work? Maybe something like a Redmagic 11 pro or OnePlus 13 running NanoClaw?
But also maybe a few Qwen 3.6 or Qwen 3.5 variants can fit and can handle some simple tasks.
I think Gemma 4 12B is definitely possible to run on high end phones, google claims you need 16GB of memory. But it's probably not very usable, you'll need to swap most stuff other than the LLM.
When I tried E2B and E4B with Google Edge Gallery, and added a web search skill from the skill list, E2B would fail (get stuck in a loop), E4B would need a very specific instruction, "weather in [city name]" would not call the web search tool, I'd need "web search weather in [city name]". And the result was completely hallucinated and impossible. It claimed 14c and feels like 4c (which is impossible), and 10% humidity (which is almost impossible in this city)
Asking wikipedia level history questions (without any tool use), the results were awful as well.
I'm running a service in production using Gemma 4 models, to get structured JSON output back from web search tool calls using Unsloth Studio and its API, but it did require a rather large and detailed system prompt and tool call healing if the format wasn't JSON for example (retries, reprompting with feeding the error back into the model, etc, this is also what Unsloth Studio does for its self-healing tool call feature). But once I did that, it's been working quite well and on benchmarks I've made, it's about 97% accurate after the first time and basically 100% accurate after retries.
This is running on a server though, not sure how well it'd work on a phone, I should try that. I used AI Edge Gallery on Android and it doesn't seem too good at the web search tool but maybe the web search tool itself, being a community made tool, is pretty bad, because tool calling via Unsloth Studio seems to work just fine with the exact same Gemma models on desktop/server vs the phone.
I agree that the web search tool probably is pretty bad. However a smart model would never hallucinate impossible weather data if the search tool failed.
I'm sure you can get some out of it if you babysit it with an optimized prompt, harness, etc and you can tolerate some failures. But when I try to run the ChatGPT prompts from my history, even if I pick the easier ones, it's hopeless.
I'd like to have a local agent on the phone with wikipedia level knowledge. But you probably need more like 30B params.
I use the 4B on my phone and it seems to work fine without tool calls. So it's definitely an issue with that and not the model itself. I'll play around and see if I can fix that, you might also try using the Searxng MCP as it's a better web search engine one.
I tried most prompts that didn't rely on recent knowledge on the basic "AI Chat", not the "Agent skills" version.
I just tested "List the 5 most recent Argentina vice presidents" on E4B and it literally got all 5 wrong
I use it for recommendations rather than knowledge, like recipes or basic stuff like that rather than knowledge, I mean it's likely due to its knowledge cutoff so it's not necessarily accurate. But the agent skills section does have a query Wikipedia tool call.
Try this on Unsloth Studio, they seem to have fixed Gemma tool calling.
Argentina vice presidents span from 2007 to 2023. Knowledge cutoff cant explain getting all 5 of them wrong.